代理服务器作为中间人转发请求,可以修改、记录或缓存请求/响应。
隐藏客户端的真实IP地址,保护隐私,防止被追踪。
访问被地理限制的内容,绕过防火墙或网络审查。
防止IP被封锁,实现分布式爬取,提高采集效率。
将请求分发到多个服务器,提高性能和可靠性。
企业网络中对员工上网行为进行监控和过滤。
缓存常用资源,减少带宽使用,加快访问速度。
| 代理类型 | 协议 | 特点 | 适用场景 |
|---|---|---|---|
| HTTP代理 | HTTP/HTTPS | 只处理HTTP流量,不支持其他协议 | 网页浏览、API调用、基础爬虫 |
| HTTPS代理 | HTTPS | 支持SSL/TLS加密,安全性更高 | 银行、电商等安全敏感场景 |
| SOCKS4代理 | SOCKS4 | 支持TCP连接,无认证功能 | 基本匿名需求,简单TCP应用 |
| SOCKS5代理 | SOCKS5 | 支持TCP/UDP,有认证,功能最全 | BT下载、游戏、VPN替代 |
| 透明代理 | HTTP | 不修改请求,客户端无感知 | 企业网络监控、内容缓存 |
import requests
# 最简单的HTTP代理设置
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:3128',
}
response = requests.get('http://httpbin.org/ip', proxies=proxies)
print("通过代理访问的IP:", response.json())
# 只设置HTTP代理(HTTPS请求将不使用代理)
proxies_http_only = {'http': 'http://10.10.1.10:3128'}
response = requests.get('http://example.com', proxies=proxies_http_only)
# 注意:HTTPS请求需要HTTPS代理或支持CONNECT方法的HTTP代理
import requests
# 为不同协议设置不同的代理服务器
proxies = {
'http': 'http://http-proxy.example.com:8080',
'https': 'https://https-proxy.example.com:8443',
'ftp': 'ftp://ftp-proxy.example.com:2121',
}
# 发送HTTP请求(使用HTTP代理)
response_http = requests.get('http://httpbin.org/ip', proxies=proxies)
# 发送HTTPS请求(使用HTTPS代理)
response_https = requests.get('https://httpbin.org/ip', proxies=proxies)
print("HTTP代理IP:", response_http.json())
print("HTTPS代理IP:", response_https.json())
# 注意:requests主要支持HTTP/HTTPS协议,FTP代理需要额外处理
import requests
# 设置代理,但排除某些地址
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:3128',
'no': 'pass' # 这个键名实际上不会被requests识别
}
# Requests不支持直接在proxies字典中排除域名
# 替代方案:根据URL动态选择是否使用代理
def get_with_proxy(url, use_proxy=True):
"""根据条件选择是否使用代理"""
if use_proxy:
proxies = {'http': 'http://10.10.1.10:3128'}
else:
proxies = None
return requests.get(url, proxies=proxies)
# 对内部地址不使用代理
internal_urls = ['http://internal-api.company.com', 'http://192.168.1.1']
external_urls = ['http://api.github.com', 'http://httpbin.org']
for url in internal_urls:
response = get_with_proxy(url, use_proxy=False)
print(f"直接访问 {url}: {response.status_code}")
for url in external_urls:
response = get_with_proxy(url, use_proxy=True)
print(f"通过代理访问 {url}: {response.status_code}")
import os
import requests
# 设置环境变量(影响当前进程所有requests请求)
os.environ['HTTP_PROXY'] = 'http://10.10.1.10:3128'
os.environ['HTTPS_PROXY'] = 'http://10.10.1.10:3128'
os.environ['NO_PROXY'] = 'localhost,127.0.0.1,192.168.1.1'
# 现在所有requests请求都会自动使用代理
response = requests.get('http://httpbin.org/ip')
print("通过环境变量代理访问:", response.json())
# 注意:环境变量优先级低于proxies参数
# 即使设置了环境变量,仍然可以在请求中指定proxies参数覆盖
# 清除代理环境变量
os.environ.pop('HTTP_PROXY', None)
os.environ.pop('HTTPS_PROXY', None)
HTTP_PROXY 和 HTTPS_PROXY 环境变量会影响所有使用urllib3的库NO_PROXY 可以指定不使用代理的地址(逗号分隔)许多代理服务器需要用户名和密码进行身份验证。Requests支持多种认证方式:
import requests
# 方法1:在代理URL中包含用户名和密码(推荐)
proxies = {
'http': 'http://user:pass@10.10.1.10:3128',
'https': 'http://user:pass@10.10.1.10:3128',
}
response = requests.get('http://httpbin.org/ip', proxies=proxies)
print("通过认证代理访问:", response.status_code)
# 方法2:使用requests.auth.HTTPProxyAuth
from requests.auth import HTTPProxyAuth
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:3128',
}
auth = HTTPProxyAuth('user', 'pass')
response = requests.get('http://httpbin.org/ip', proxies=proxies, auth=auth)
print("使用HTTPProxyAuth:", response.status_code)
import requests
from urllib.parse import quote_plus
# 如果用户名或密码包含特殊字符,需要编码
username = 'user@domain.com'
password = 'pass#word!123'
# 编码用户名和密码
encoded_username = quote_plus(username)
encoded_password = quote_plus(password)
# 在URL中使用编码后的凭证
proxy_url = f'http://{encoded_username}:{encoded_password}@10.10.1.10:3128'
proxies = {'http': proxy_url, 'https': proxy_url}
response = requests.get('http://httpbin.org/ip', proxies=proxies)
print("使用编码凭证的代理:", response.status_code)
# 也可以使用requests.utils.quote
from requests.utils import quote
proxy_url = f'http://{quote(username, safe="")}:{quote(password, safe="")}@10.10.1.10:3128'
proxies = {'http': proxy_url}
response = requests.get('http://httpbin.org/ip', proxies=proxies)
print("使用requests.utils.quote:", response.status_code)
import requests
import base64
# 自定义代理认证处理器
class CustomProxyAuth(requests.auth.AuthBase):
"""自定义代理认证类"""
def __init__(self, username, password):
self.username = username
self.password = password
def __call__(self, r):
# 添加Proxy-Authorization头
credentials = f'{self.username}:{self.password}'
encoded_credentials = base64.b64encode(credentials.encode()).decode()
r.headers['Proxy-Authorization'] = f'Basic {encoded_credentials}'
return r
# 使用自定义认证
proxies = {
'http': 'http://10.10.1.10:3128',
}
auth = CustomProxyAuth('user', 'pass')
response = requests.get('http://httpbin.org/ip', proxies=proxies, auth=auth)
print("使用自定义代理认证:", response.status_code)
# 或者直接设置请求头
proxies = {'http': 'http://10.10.1.10:3128'}
headers = {
'Proxy-Authorization': 'Basic ' + base64.b64encode(b'user:pass').decode()
}
response = requests.get('http://httpbin.org/ip', proxies=proxies, headers=headers)
print("直接设置Proxy-Authorization头:", response.status_code)
| 认证方式 | 优点 | 缺点 |
|---|---|---|
URL中包含认证http://user:pass@proxy:port |
简单方便,代码简洁 | 密码在URL中明文传输,可能有日志记录风险 |
| HTTPProxyAuth类 | 标准方式,支持更多认证机制 | 需要额外导入,代码稍复杂 |
| 自定义认证类 | 灵活,可支持复杂认证流程 | 实现复杂,需要更多代码 |
环境变量http_proxy=http://user:pass@proxy:port |
全局设置,无需修改代码 | 安全性差,所有请求共享同一凭证 |
SOCKS (Socket Secure) 是一种网络传输协议,主要用于客户端与服务器之间的网络数据交换。与HTTP代理不同,SOCKS代理工作在更底层,可以代理任何TCP/UDP流量。
import requests
# 注意:使用SOCKS代理需要安装额外的依赖
# pip install requests[socks] 或 pip install PySocks
# SOCKS5代理设置
proxies = {
'http': 'socks5://user:pass@127.0.0.1:1080',
'https': 'socks5://user:pass@127.0.0.1:1080'
}
# SOCKS4代理设置(无认证)
proxies_socks4 = {
'http': 'socks4://127.0.0.1:1080',
'https': 'socks4://127.0.0.1:1080'
}
# SOCKS5h代理(在代理服务器解析域名,增强隐私)
proxies_socks5h = {
'http': 'socks5h://127.0.0.1:1080',
'https': 'socks5h://127.0.0.1:1080'
}
try:
response = requests.get('http://httpbin.org/ip', proxies=proxies)
print("通过SOCKS5代理访问:", response.json())
except Exception as e:
print(f"SOCKS代理错误: {e}")
print("请确保已安装: pip install PySocks")
# 验证SOCKS代理是否工作
def test_socks_proxy(proxy_url):
"""测试SOCKS代理是否可用"""
proxies = {'http': proxy_url, 'https': proxy_url}
test_urls = [
'http://httpbin.org/ip',
'https://httpbin.org/ip',
'http://icanhazip.com'
]
for url in test_urls:
try:
response = requests.get(url, proxies=proxies, timeout=10)
print(f"{url}: 成功 (IP: {response.text.strip()})")
except Exception as e:
print(f"{url}: 失败 ({e})")
# 测试不同的SOCKS代理
print("\n测试SOCKS代理:")
test_socks_proxy('socks5://127.0.0.1:1080')
test_socks_proxy('socks5h://127.0.0.1:1080')
import requests
import socks
import socket
# 方法1:使用PySocks库直接配置(更底层)
def create_socks_socket():
"""创建使用SOCKS代理的socket"""
# 设置默认socket为SOCKS代理
socks.set_default_proxy(
socks.SOCKS5, # SOCKS版本
"127.0.0.1", # 代理地址
1080, # 代理端口
username='user', # 用户名(可选)
password='pass' # 密码(可选)
)
socket.socket = socks.socksocket
# 现在所有socket连接都会通过SOCKS代理
response = requests.get('http://httpbin.org/ip')
print("通过PySocks配置代理:", response.json())
# 恢复默认socket
socks.set_default_proxy()
socket.socket = socks._orgsocket
# 方法2:使用requests的proxies参数(推荐)
proxies = {
'http': 'socks5://user:pass@127.0.0.1:1080',
'https': 'socks5://user:pass@127.0.0.1:1080'
}
response = requests.get('http://httpbin.org/ip', proxies=proxies)
print("通过requests proxies参数:", response.json())
import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import socks
import socket
# 创建自定义适配器处理SOCKS代理
class SOCKSAdapter(HTTPAdapter):
"""自定义适配器,优化SOCKS代理连接"""
def __init__(self, proxy_url, **kwargs):
self.proxy_url = proxy_url
super().__init__(**kwargs)
def init_poolmanager(self, *args, **kwargs):
# 解析代理URL
from urllib.parse import urlparse
parsed = urlparse(self.proxy_url)
# 设置SOCKS代理
if parsed.scheme.startswith('socks'):
socks_version = {
'socks4': socks.SOCKS4,
'socks5': socks.SOCKS5,
'socks5h': socks.SOCKS5
}.get(parsed.scheme, socks.SOCKS5)
socks.set_default_proxy(
socks_version,
parsed.hostname,
parsed.port,
username=parsed.username,
password=parsed.password
)
socket.socket = socks.socksocket
super().init_poolmanager(*args, **kwargs)
# 使用自定义适配器
session = requests.Session()
# 创建重试策略
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[500, 502, 503, 504]
)
# 挂载SOCKS适配器
adapter = SOCKSAdapter('socks5://127.0.0.1:1080', max_retries=retry_strategy)
session.mount('http://', adapter)
session.mount('https://', adapter)
# 发送请求
response = session.get('http://httpbin.org/ip')
print("通过自定义SOCKS适配器:", response.json())
import requests
# 为Session设置SOCKS代理
session = requests.Session()
# 设置SOCKS代理
session.proxies = {
'http': 'socks5://127.0.0.1:1080',
'https': 'socks5://127.0.0.1:1080'
}
# 或者使用上下文管理器
with requests.Session() as session:
session.proxies = {
'http': 'socks5h://127.0.0.1:1080',
'https': 'socks5h://127.0.0.1:1080'
}
# 设置超时
session.timeout = 30
# 设置重试
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount('http://', adapter)
session.mount('https://', adapter)
# 发送多个请求(都通过同一个SOCKS代理)
urls = [
'http://httpbin.org/ip',
'https://httpbin.org/ip',
'http://httpbin.org/user-agent'
]
for url in urls:
try:
response = session.get(url)
print(f"{url}: {response.status_code}")
except Exception as e:
print(f"{url}: 错误 - {e}")
使用Session对象可以统一管理代理设置,避免为每个请求重复配置。
import requests
# 创建Session并设置代理
session = requests.Session()
# 为Session设置代理(影响该Session的所有请求)
session.proxies.update({
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:3128',
})
# 也可以直接赋值
session.proxies = {
'http': 'http://user:pass@proxy1.example.com:8080',
'https': 'https://user:pass@proxy2.example.com:8443',
}
# 现在所有通过该Session发送的请求都会使用代理
response1 = session.get('http://httpbin.org/ip')
response2 = session.get('https://api.github.com')
print(f"请求1 IP: {response1.json()}")
print(f"请求2 状态码: {response2.status_code}")
# 临时覆盖Session的代理设置
response3 = session.get(
'http://internal-api.company.com',
proxies=None # 这个请求不使用代理
)
print(f"请求3 状态码: {response3.status_code}")
import requests
import random
class DynamicProxySession:
"""动态代理Session,支持代理轮换"""
def __init__(self, proxy_list=None):
self.session = requests.Session()
self.proxy_list = proxy_list or []
self.current_proxy_index = 0
# 设置默认请求头
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
})
def get_next_proxy(self):
"""获取下一个代理"""
if not self.proxy_list:
return None
proxy = self.proxy_list[self.current_proxy_index]
self.current_proxy_index = (self.current_proxy_index + 1) % len(self.proxy_list)
return proxy
def request(self, method, url, **kwargs):
"""发送请求,自动使用代理"""
proxy = self.get_next_proxy()
if proxy:
# 确保kwargs中有proxies字典
if 'proxies' not in kwargs:
kwargs['proxies'] = {}
# 设置代理
if proxy.startswith('socks'):
kwargs['proxies'].update({
'http': proxy,
'https': proxy
})
else:
kwargs['proxies'].update({
'http': f'http://{proxy}',
'https': f'http://{proxy}'
})
print(f"使用代理: {proxy}")
return self.session.request(method, url, **kwargs)
def get(self, url, **kwargs):
return self.request('GET', url, **kwargs)
def post(self, url, **kwargs):
return self.request('POST', url, **kwargs)
# 使用示例
proxy_list = [
'user1:pass1@proxy1.example.com:8080',
'user2:pass2@proxy2.example.com:8080',
'socks5://user3:pass3@proxy3.example.com:1080',
'192.168.1.100:3128',
]
dynamic_session = DynamicProxySession(proxy_list)
# 发送多个请求,每个请求使用不同的代理
for i in range(5):
try:
response = dynamic_session.get('http://httpbin.org/ip')
print(f"请求{i+1}: {response.json()}")
except Exception as e:
print(f"请求{i+1}失败: {e}")
import requests
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
class HealthyProxySession:
"""带健康检查的代理Session"""
def __init__(self, proxy_list, check_url='http://httpbin.org/ip', timeout=5):
self.session = requests.Session()
self.check_url = check_url
self.timeout = timeout
# 测试并筛选可用的代理
self.healthy_proxies = self._check_proxies_health(proxy_list)
print(f"初始代理数: {len(proxy_list)}")
print(f"健康代理数: {len(self.healthy_proxies)}")
def _check_proxy_health(self, proxy):
"""检查单个代理的健康状况"""
try:
proxies = {
'http': f'http://{proxy}',
'https': f'http://{proxy}'
}
start_time = time.time()
response = self.session.get(
self.check_url,
proxies=proxies,
timeout=self.timeout
)
response_time = time.time() - start_time
if response.status_code == 200:
return {
'proxy': proxy,
'healthy': True,
'response_time': response_time,
'ip': response.json().get('origin', 'unknown')
}
except Exception as e:
pass
return {
'proxy': proxy,
'healthy': False,
'response_time': None,
'error': str(e) if 'e' in locals() else 'unknown'
}
def _check_proxies_health(self, proxy_list):
"""并发检查多个代理的健康状况"""
healthy_proxies = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {
executor.submit(self._check_proxy_health, proxy): proxy
for proxy in proxy_list
}
for future in as_completed(futures):
result = future.result()
if result['healthy']:
healthy_proxies.append(result)
print(f"✓ 代理 {result['proxy']} 健康 (响应时间: {result['response_time']:.2f}s)")
else:
print(f"✗ 代理 {result['proxy']} 不可用")
# 按响应时间排序
healthy_proxies.sort(key=lambda x: x['response_time'])
return healthy_proxies
def get_fastest_proxy(self):
"""获取最快的代理"""
if self.healthy_proxies:
return self.healthy_proxies[0]['proxy']
return None
def request(self, method, url, **kwargs):
"""使用最快的代理发送请求"""
fastest_proxy = self.get_fastest_proxy()
if fastest_proxy:
if 'proxies' not in kwargs:
kwargs['proxies'] = {}
kwargs['proxies'].update({
'http': f'http://{fastest_proxy}',
'https': f'http://{fastest_proxy}'
})
return self.session.request(method, url, **kwargs)
# 使用示例
proxy_list = [
'proxy1.example.com:8080',
'proxy2.example.com:8080',
'proxy3.example.com:8080',
'proxy4.example.com:8080',
]
healthy_session = HealthyProxySession(
proxy_list,
check_url='http://httpbin.org/ip',
timeout=10
)
# 使用最快的代理发送请求
response = healthy_session.request('GET', 'http://httpbin.org/headers')
print(f"使用最快代理的响应: {response.status_code}")
print(f"实际使用的IP: {response.json().get('headers', {}).get('X-Forwarded-For', 'unknown')}")
对于需要大量请求的应用(如网络爬虫),使用代理池可以避免IP被封禁,提高请求成功率。
代理池自动管理多个代理,实现负载均衡和故障转移
import requests
import random
import time
from collections import defaultdict
class SimpleProxyPool:
"""简单代理池实现"""
def __init__(self, proxy_list=None):
self.proxies = proxy_list or []
self.proxy_stats = defaultdict(lambda: {'success': 0, 'fail': 0, 'last_used': 0})
# 代理轮换策略
self.strategy = 'round-robin' # 轮询
# self.strategy = 'random' # 随机
# self.strategy = 'weighted' # 加权
def add_proxy(self, proxy):
"""添加代理到池中"""
if proxy not in self.proxies:
self.proxies.append(proxy)
def remove_proxy(self, proxy):
"""从池中移除代理"""
if proxy in self.proxies:
self.proxies.remove(proxy)
self.proxy_stats.pop(proxy, None)
def get_proxy(self):
"""根据策略获取代理"""
if not self.proxies:
return None
if self.strategy == 'round-robin':
# 轮询策略
proxy = self.proxies[0]
self.proxies = self.proxies[1:] + [proxy]
return proxy
elif self.strategy == 'random':
# 随机策略
return random.choice(self.proxies)
elif self.strategy == 'weighted':
# 加权策略(成功率高的代理权重高)
total_weight = 0
weighted_proxies = []
for proxy in self.proxies:
stats = self.proxy_stats[proxy]
success_rate = stats['success'] / max(1, stats['success'] + stats['fail'])
weight = max(1, int(success_rate * 100)) # 权重1-100
total_weight += weight
weighted_proxies.append((proxy, total_weight))
if not weighted_proxies:
return None
rand = random.randint(1, total_weight)
for proxy, weight in weighted_proxies:
if rand <= weight:
return proxy
return self.proxies[0]
def record_result(self, proxy, success=True):
"""记录代理使用结果"""
if proxy in self.proxy_stats:
if success:
self.proxy_stats[proxy]['success'] += 1
else:
self.proxy_stats[proxy]['fail'] += 1
self.proxy_stats[proxy]['last_used'] = time.time()
def get_stats(self):
"""获取代理池统计信息"""
stats = []
for proxy in self.proxies:
proxy_stat = self.proxy_stats[proxy]
total = proxy_stat['success'] + proxy_stat['fail']
success_rate = proxy_stat['success'] / max(1, total) * 100
stats.append({
'proxy': proxy,
'success': proxy_stat['success'],
'fail': proxy_stat['fail'],
'success_rate': f"{success_rate:.1f}%",
'last_used': time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(proxy_stat['last_used']))
})
return stats
# 使用示例
proxy_pool = SimpleProxyPool([
'proxy1.example.com:8080',
'proxy2.example.com:8080',
'proxy3.example.com:8080',
])
# 设置策略
proxy_pool.strategy = 'weighted'
# 模拟使用代理池
session = requests.Session()
for i in range(10):
proxy = proxy_pool.get_proxy()
try:
response = session.get(
'http://httpbin.org/ip',
proxies={'http': f'http://{proxy}', 'https': f'http://{proxy}'},
timeout=5
)
proxy_pool.record_result(proxy, success=True)
print(f"请求{i+1}: 使用代理 {proxy} 成功")
except Exception as e:
proxy_pool.record_result(proxy, success=False)
print(f"请求{i+1}: 使用代理 {proxy} 失败 - {e}")
# 查看统计信息
print("\n代理池统计:")
for stat in proxy_pool.get_stats():
print(f"{stat['proxy']}: 成功率 {stat['success_rate']} "
f"(成功: {stat['success']}, 失败: {stat['fail']})")
import requests
import random
import time
import threading
from queue import Queue
from datetime import datetime, timedelta
class AdvancedProxyPool:
"""高级代理池,带自动健康检查和维护"""
def __init__(self,
initial_proxies=None,
health_check_url='http://httpbin.org/ip',
check_interval=300, # 5分钟
max_failures=3):
self.proxies = {} # 代理字典 {proxy: {'health': True/False, ...}}
self.proxy_queue = Queue() # 健康代理队列
self.health_check_url = health_check_url
self.check_interval = check_interval
self.max_failures = max_failures
self.lock = threading.Lock()
self.session = requests.Session()
# 初始化代理
if initial_proxies:
for proxy in initial_proxies:
self.add_proxy(proxy)
# 启动健康检查线程
self.health_check_thread = threading.Thread(
target=self._health_check_worker,
daemon=True
)
self.health_check_thread.start()
def add_proxy(self, proxy):
"""添加代理"""
with self.lock:
if proxy not in self.proxies:
self.proxies[proxy] = {
'health': False, # 初始状态为不健康,等待检查
'success_count': 0,
'failure_count': 0,
'last_check': None,
'last_success': None,
'response_time': None,
'added_at': datetime.now()
}
# 立即检查新添加的代理
self._check_proxy_health(proxy)
def get_proxy(self, timeout=10):
"""获取一个健康代理"""
try:
proxy = self.proxy_queue.get(timeout=timeout)
# 将代理放回队列(实现循环使用)
self.proxy_queue.put(proxy)
return proxy
except:
return None
def _check_proxy_health(self, proxy):
"""检查单个代理的健康状况"""
try:
start_time = time.time()
response = self.session.get(
self.health_check_url,
proxies={'http': f'http://{proxy}', 'https': f'http://{proxy}'},
timeout=10
)
response_time = time.time() - start_time
if response.status_code == 200:
with self.lock:
self.proxies[proxy].update({
'health': True,
'success_count': self.proxies[proxy]['success_count'] + 1,
'last_check': datetime.now(),
'last_success': datetime.now(),
'response_time': response_time
})
# 添加到健康代理队列
if not any(proxy == item for item in list(self.proxy_queue.queue)):
self.proxy_queue.put(proxy)
return True
except Exception as e:
pass
# 代理检查失败
with self.lock:
self.proxies[proxy].update({
'health': False,
'failure_count': self.proxies[proxy]['failure_count'] + 1,
'last_check': datetime.now()
})
# 如果失败次数超过阈值,考虑移除代理
if self.proxies[proxy]['failure_count'] >= self.max_failures:
# 可以从队列中移除,但保留在字典中用于监控
self._remove_from_queue(proxy)
return False
def _remove_from_queue(self, proxy):
"""从队列中移除代理"""
# 创建一个新队列,排除指定的代理
new_queue = Queue()
while not self.proxy_queue.empty():
item = self.proxy_queue.get()
if item != proxy:
new_queue.put(item)
self.proxy_queue = new_queue
def _health_check_worker(self):
"""健康检查工作线程"""
while True:
time.sleep(self.check_interval)
# 检查所有代理
proxies_to_check = list(self.proxies.keys())
for proxy in proxies_to_check:
self._check_proxy_health(proxy)
self._log_status()
def _log_status(self):
"""记录状态"""
with self.lock:
total = len(self.proxies)
healthy = sum(1 for p in self.proxies.values() if p['health'])
queue_size = self.proxy_queue.qsize()
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] "
f"代理池状态: 总数={total}, 健康={healthy}, 队列={queue_size}")
def get_status(self):
"""获取代理池状态"""
with self.lock:
status = {
'total_proxies': len(self.proxies),
'healthy_proxies': sum(1 for p in self.proxies.values() if p['health']),
'queue_size': self.proxy_queue.qsize(),
'proxies': {}
}
for proxy, info in self.proxies.items():
status['proxies'][proxy] = {
'health': info['health'],
'success_count': info['success_count'],
'failure_count': info['failure_count'],
'last_success': info['last_success'].strftime('%Y-%m-%d %H:%M:%S')
if info['last_success'] else None,
'response_time': info['response_time']
}
return status
# 使用示例
if __name__ == '__main__':
# 创建高级代理池
proxy_pool = AdvancedProxyPool(
initial_proxies=[
'proxy1.example.com:8080',
'proxy2.example.com:8080',
'proxy3.example.com:8080',
],
health_check_url='http://httpbin.org/ip',
check_interval=60, # 1分钟检查一次
max_failures=3
)
# 等待初始健康检查完成
time.sleep(5)
# 使用代理池发送请求
session = requests.Session()
for i in range(5):
proxy = proxy_pool.get_proxy(timeout=5)
if proxy:
try:
response = session.get(
'http://httpbin.org/headers',
proxies={'http': f'http://{proxy}', 'https': f'http://{proxy}'},
timeout=10
)
print(f"请求{i+1}: 使用代理 {proxy} 成功 (状态码: {response.status_code})")
except Exception as e:
print(f"请求{i+1}: 使用代理 {proxy} 失败 - {e}")
else:
print(f"请求{i+1}: 没有可用的健康代理")
time.sleep(2)
# 查看代理池状态
print("\n代理池详细状态:")
status = proxy_pool.get_status()
print(f"总代理数: {status['total_proxies']}")
print(f"健康代理数: {status['healthy_proxies']}")
print(f"队列大小: {status['queue_size']}")
# 保持程序运行以观察健康检查
try:
while True:
time.sleep(10)
except KeyboardInterrupt:
print("\n程序结束")
import requests
import re
from bs4 import BeautifulSoup
class FreeProxyScraper:
"""免费代理采集器"""
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
def scrape_proxies(self):
"""从多个免费代理网站采集代理"""
proxies = []
# 从不同来源采集
sources = [
self._scrape_free_proxy_list,
self._scrape_proxyscrape,
self._scrape_geonode,
]
for source in sources:
try:
proxies.extend(source())
except Exception as e:
print(f"采集失败 {source.__name__}: {e}")
# 去重
unique_proxies = list(set(proxies))
print(f"采集到 {len(proxies)} 个代理,去重后 {len(unique_proxies)} 个")
return unique_proxies
def _scrape_free_proxy_list(self):
"""从free-proxy-list.net采集"""
url = 'https://free-proxy-list.net/'
response = self.session.get(url, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')
proxies = []
table = soup.find('table', {'id': 'proxylisttable'})
if table:
rows = table.find_all('tr')[1:] # 跳过表头
for row in rows:
cols = row.find_all('td')
if len(cols) >= 2:
ip = cols[0].text.strip()
port = cols[1].text.strip()
proxies.append(f'{ip}:{port}')
return proxies
def _scrape_proxyscrape(self):
"""从proxyscrape.com采集"""
url = 'https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all'
response = self.session.get(url, timeout=10)
# 每行一个代理 IP:PORT
proxies = []
for line in response.text.strip().split('\n'):
if ':' in line:
proxies.append(line.strip())
return proxies
def _scrape_geonode(self):
"""从geonode.com采集"""
url = 'https://proxylist.geonode.com/api/proxy-list?limit=100&page=1&sort_by=lastChecked&sort_type=desc'
response = self.session.get(url, timeout=10)
data = response.json()
proxies = []
for proxy in data.get('data', []):
ip = proxy.get('ip')
port = proxy.get('port')
if ip and port:
proxies.append(f'{ip}:{port}')
return proxies
def test_and_filter(self, proxies, test_url='http://httpbin.org/ip', timeout=5):
"""测试代理并过滤出可用的"""
valid_proxies = []
print(f"开始测试 {len(proxies)} 个代理...")
for i, proxy in enumerate(proxies):
try:
response = requests.get(
test_url,
proxies={'http': f'http://{proxy}', 'https': f'http://{proxy}'},
timeout=timeout
)
if response.status_code == 200:
valid_proxies.append(proxy)
print(f"✓ [{i+1}/{len(proxies)}] 代理 {proxy} 可用")
else:
print(f"✗ [{i+1}/{len(proxies)}] 代理 {proxy} 不可用")
except Exception:
print(f"✗ [{i+1}/{len(proxies)}] 代理 {proxy} 不可用")
print(f"测试完成: 总共 {len(proxies)} 个,有效 {len(valid_proxies)} 个")
return valid_proxies
# 使用示例
if __name__ == '__main__':
scraper = FreeProxyScraper()
# 采集代理
proxies = scraper.scrape_proxies()
# 测试代理(限制测试数量,避免耗时过长)
test_proxies = proxies[:20] # 只测试前20个
valid_proxies = scraper.test_and_filter(test_proxies)
# 使用有效的代理
if valid_proxies:
print(f"\n可用的免费代理:")
for proxy in valid_proxies:
print(f" {proxy}")
# 创建代理池使用这些代理
proxy_pool = SimpleProxyPool(valid_proxies)
# 使用代理池发送请求
session = requests.Session()
for i in range(3):
proxy = proxy_pool.get_proxy()
try:
response = session.get(
'http://httpbin.org/ip',
proxies={'http': f'http://{proxy}', 'https': f'http://{proxy}'},
timeout=10
)
print(f"使用代理 {proxy} 成功,IP: {response.json().get('origin')}")
except Exception as e:
print(f"使用代理 {proxy} 失败: {e}")
else:
print("没有找到可用的免费代理")
| 问题 | 可能原因 | 解决方法 |
|---|---|---|
| 代理连接超时 | 代理服务器宕机、网络问题、代理配置错误 | 检查代理地址和端口,增加超时时间,尝试其他代理 |
| 代理认证失败 | 用户名/密码错误、认证方式不支持、代理服务器配置问题 | 验证凭证,检查代理服务器支持的认证方式 |
| 部分网站无法访问 | 代理被目标网站封锁、代理IP信誉差、代理不支持HTTPS | 更换代理,使用高匿代理,检查代理协议支持 |
| 速度非常慢 | 代理服务器负载高、网络延迟大、代理位置远 | 选择地理位置近的代理,使用代理池轮换 |
| SOCKS代理不工作 | 未安装PySocks依赖、协议版本不匹配、代理服务器故障 | 安装pip install PySocks,检查协议版本 |
| 本地代理不生效 | 代理设置被覆盖、环境变量冲突、代理服务器未启动 | 检查设置优先级,确保代理服务器运行 |
import requests
import socket
import time
def debug_proxy_connection(proxy_url, test_urls=None):
"""调试代理连接问题"""
print(f"调试代理: {proxy_url}")
print("=" * 50)
# 解析代理URL
from urllib.parse import urlparse
parsed = urlparse(proxy_url)
print(f"协议: {parsed.scheme}")
print(f"主机: {parsed.hostname}")
print(f"端口: {parsed.port}")
print(f"用户名: {parsed.username}")
print(f"密码: {'*' * len(parsed.password) if parsed.password else '无'}")
# 1. 测试网络连通性
print("\n1. 测试网络连通性:")
try:
socket.create_connection((parsed.hostname, parsed.port), timeout=5)
print(f"✓ 可以连接到代理服务器 {parsed.hostname}:{parsed.port}")
except Exception as e:
print(f"✗ 无法连接到代理服务器: {e}")
return
# 2. 测试代理功能
print("\n2. 测试代理功能:")
test_urls = test_urls or [
'http://httpbin.org/ip',
'https://httpbin.org/ip',
'http://icanhazip.com',
]
session = requests.Session()
for url in test_urls:
try:
start_time = time.time()
response = session.get(url, proxies={'http': proxy_url, 'https': proxy_url}, timeout=10)
response_time = time.time() - start_time
if response.status_code == 200:
print(f"✓ {url}: 成功 ({response_time:.2f}s)")
# 如果是IP测试页面,显示通过代理的IP
if 'ip' in url or 'icanhazip' in url:
print(f" 代理IP: {response.text.strip()}")
else:
print(f"✗ {url}: 失败 - 状态码 {response.status_code}")
except requests.exceptions.ProxyError as e:
print(f"✗ {url}: 代理错误 - {e}")
except requests.exceptions.ConnectTimeout as e:
print(f"✗ {url}: 连接超时 - {e}")
except requests.exceptions.SSLError as e:
print(f"✗ {url}: SSL错误 - {e}")
except Exception as e:
print(f"✗ {url}: 未知错误 - {e}")
# 3. 测试直接连接(对比)
print("\n3. 测试直接连接(对比):")
for url in test_urls[:1]: # 只测试第一个
try:
start_time = time.time()
response = session.get(url, timeout=10)
response_time = time.time() - start_time
if response.status_code == 200:
print(f"✓ {url}: 成功 ({response_time:.2f}s)")
print(f" 直接IP: {response.text.strip()}")
except Exception as e:
print(f"✗ {url}: 失败 - {e}")
# 4. 验证代理类型
print("\n4. 验证代理类型:")
# 通过请求头判断代理类型
try:
response = session.get(
'http://httpbin.org/headers',
proxies={'http': proxy_url, 'https': proxy_url},
timeout=10
)
headers = response.json().get('headers', {})
# 检查常见代理头
proxy_headers = ['Via', 'X-Forwarded-For', 'X-Real-IP', 'Proxy-Connection']
for header in proxy_headers:
if header in headers:
print(f" 检测到代理头: {header} = {headers[header]}")
if any(h in headers for h in proxy_headers):
print(" → 代理类型: 透明代理")
else:
print(" → 代理类型: 匿名代理 或 高匿代理")
except Exception as e:
print(f" 代理类型检测失败: {e}")
print("\n调试完成")
# 使用示例
if __name__ == '__main__':
# 测试各种代理
proxies_to_test = [
'http://10.10.1.10:3128',
'http://user:pass@proxy.example.com:8080',
'socks5://127.0.0.1:1080',
]
for proxy in proxies_to_test:
debug_proxy_connection(proxy)
print("\n" + "="*50 + "\n")
from requests.adapters import HTTPAdapter
session = requests.Session()
# 优化连接池
adapter = HTTPAdapter(
pool_connections=10, # 连接池数量
pool_maxsize=100, # 最大连接数
max_retries=3, # 重试次数
pool_block=False # 是否阻塞等待连接
)
session.mount('http://', adapter)
session.mount('https://', adapter)
# 为不同阶段设置不同的超时
proxies = {'http': 'http://proxy:8080'}
response = requests.get(
'http://example.com',
proxies=proxies,
timeout=(3.05, 27) # (连接超时, 读取超时)
)
# 或者为Session设置默认超时
session = requests.Session()
session.timeout = 10 # 10秒超时
Requests库提供了强大而灵活的代理配置功能,支持HTTP、HTTPS和SOCKS代理,以及各种认证方式。关键要点:
proxies参数为单个请求或Session设置代理PySocks依赖,支持SOCKS4/5协议无论你是需要访问被限制的内容,还是在进行网络爬虫开发,合理配置代理都是确保应用稳定运行的关键。记住始终遵循目标网站的使用条款,合理使用代理服务。