python ip代理池_Python 爬虫IP代理池的实现

大家好，欢迎来到IT知识分享网。 python ip代理池_Python 爬虫IP代理池的实现

很多时候，如果要多线程的爬取网页，或者是单纯的反爬，我们需要通过代理IP来进行访问。下面看看一个基本的实现方法。

代理IP的提取，网上有很多网站都提供这个服务。基本上可靠性和银子是成正比的。国内提供的免费IP基本上都是没法用的，如果要可靠的代理只能付费；国外稍微好些，有些免费IP还是比较靠谱的。

网上随便搜索了一下，找了个网页，本来还想手动爬一些对应的IP，结果发现可以直接下载现成的txt文件

http://www.thebigproxylist.com/

下载之后，试试看用不同的代理去爬百度首页

#！/usr/bin/env python

#! -*- coding:utf-8 -*-

# Author: Yuan Li

import re,urllib.request

fp=open(“c:\\temp\\thebigproxylist-17-12-20.txt”,’r’)

lines=fp.readlines()

for ip in lines:

try:

print(“当前代理IP “+ip)

proxy=urllib.request.ProxyHandler({“http”:ip})

opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)

urllib.request.install_opener(opener)

url=”http://www.baidu.com”

data=urllib.request.urlopen(url).read().decode(‘utf-8′,’ignore’)

print(“通过”)

print(“—————————–“)

except Exception as err:

print(err)

print(“—————————–“)

fp.close()

结果如下：

C:\Python36\python.exe C:/Users/yuan.li/Documents/GitHub/Python/Misc/爬虫/proxy.py

当前代理IP 137.74.168.174:80

通过

—————————–

当前代理IP 103.28.161.68:8080

通过

—————————–

当前代理IP 91.151.106.127:53281

HTTP Error 503: Service Unavailable

—————————–

当前代理IP 177.136.252.7:3128

—————————–

当前代理IP 47.89.22.200:80

通过

—————————–

当前代理IP 118.69.61.57:8888

HTTP Error 503: Service Unavailable

—————————–

当前代理IP 192.241.190.167:8080

通过

—————————–

当前代理IP 185.124.112.130:80

通过

—————————–

当前代理IP 83.65.246.181:3128

通过

—————————–

当前代理IP 79.137.42.124:3128

通过

—————————–

当前代理IP 95.0.217.32:8080

—————————–

当前代理IP 104.131.94.221:8080

通过

不过上面这种方式只适合比较稳定的IP源，如果IP不稳定的话，可能很快对应的文本就失效了，最好可以动态地去获取最新的IP地址。很多网站都提供API可以实时地去查询

还是用刚才的网站，这次我们用API去调用，这里需要浏览器伪装一下才能爬取

#！/usr/bin/env python

#! -*- coding:utf-8 -*-

# Author: Yuan Li

import re,urllib.request

headers=(“User-Agent”,”Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0″)

opener=urllib.request.build_opener()

opener.addheaders=[headers]

#安装为全局

urllib.request.install_opener(opener)

data=urllib.request.urlopen(“http://www.thebigproxylist.com/members/proxy-api.php?output=all&user=list&pass=8a544b2637e7a45d1536e34680e11adf”).read().decode(‘utf8’)

ippool=data.split(‘\n’)

for ip in ippool:

ip=ip.split(‘,’)[0]

try:

print(“当前代理IP “+ip)

proxy=urllib.request.ProxyHandler({“http”:ip})

opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)

urllib.request.install_opener(opener)

url=”http://www.baidu.com”

data=urllib.request.urlopen(url).read().decode(‘utf-8′,’ignore’)

print(“通过”)

print(“—————————–“)

except Exception as err:

print(err)

print(“—————————–“)

fp.close()

结果如下：

C:\Python36\python.exe C:/Users/yuan.li/Documents/GitHub/Python/Misc/爬虫/proxy.py

当前代理IP 213.233.57.134:80

HTTP Error 403: Forbidden

—————————–

当前代理IP 144.76.81.79:3128

通过

—————————–

当前代理IP 45.55.132.29:53281

HTTP Error 503: Service Unavailable

—————————–

当前代理IP 180.254.133.124:8080

通过

—————————–

当前代理IP 5.196.215.231:3128

HTTP Error 503: Service Unavailable

—————————–

当前代理IP 177.99.175.195:53281

HTTP Error 503: Service Unavailable

因为直接for循环来按顺序读取文本实在是太慢了，我试着改成多线程来读取，这样速度就快多了

#！/usr/bin/env python

#! -*- coding:utf-8 -*-

# Author: Yuan Li

import threading

import queue

import re,urllib.request

#Number of threads

n_thread = 10

#Create queue

queue = queue.Queue()

class ThreadClass(threading.Thread):

def __init__(self, queue):

threading.Thread.__init__(self)

super(ThreadClass, self).__init__()

#Assign thread working with queue

self.queue = queue

def run(self):

while True:

#Get from queue job

host = self.queue.get()

print (self.getName() + “:” + host)

try:

# print(“当前代理IP ” + host)

proxy = urllib.request.ProxyHandler({“http”: host})

opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)

urllib.request.install_opener(opener)

url = “http://www.baidu.com”

data = urllib.request.urlopen(url).read().decode(‘utf-8’, ‘ignore’)

print(“通过”)

print(“—————————–“)

except Exception as err:

print(err)

print(“—————————–“)

#signals to queue job is done

self.queue.task_done()

#Create number process

for i in range(n_thread):

t = ThreadClass(queue)

t.setDaemon(True)

#Start thread

t.start()

#Read file line by line

hostfile = open(“c:\\temp\\thebigproxylist-17-12-20.txt”,”r”)

for line in hostfile:

#Put line to queue

queue.put(line)

#wait on the queue until everything has been processed

queue.join()

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://yundeesoft.com/11325.html

python ip代理池_Python 爬虫IP代理池的实现

相关推荐

发表回复