肥宅钓鱼网
当前位置: 首页 钓鱼百科

python爬虫库详解(python爬虫必备构建代理IP池)

时间:2023-08-10 作者: 小编 阅读量: 1 栏目名: 钓鱼百科

python爬虫必备构建代理IP池如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了本次项目就是自己动手构建一。

python爬虫库详解?如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了本次项目就是自己动手构建一个免费的代理ip池,接下来我们就来聊聊关于python爬虫库详解?以下内容大家不妨参考一二希望能帮到您!

python爬虫库详解

如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip。代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了。本次项目就是自己动手构建一个免费的代理ip池。


'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

In [5]:

#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)

In [6]:

#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):print(i)

(0, ('HTTP', '123.55.114.77', '9999'))(1, ('HTTP', '36.248.133.117', '9999'))(2, ('HTTP', '123.163.117.62', '9999'))(3, ('HTTP', '1.197.204.52', '9999'))(4, ('HTTP', '175.155.140.34', '1133'))(5, ('HTTP', '115.218.5.120', '9000'))(6, ('HTTP', '182.87.38.156', '9000'))(7, ('HTTP', '60.168.207.253', '1133'))(8, ('HTTP', '114.99.13.141', '1133'))(9, ('HTTP', '119.108.172.169', '9000'))(10, ('HTTP', '36.248.133.81', '9999'))(11, ('HTTP', '114.239.29.206', '9999'))(12, ('HTTP', '1.196.177.81', '9999'))(13, ('HTTP', '175.44.109.13', '9999'))(14, ('HTTP', '113.124.86.220', '9999'))

In [14]:

#上面是爬取一页代理IP的代码,我们进行修改,用一个for循环爬取多页all_datas=[] #用一个变量接收各个页的ipimport timefor page in range(2):#先爬取三页试试url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page 1) #url需要更改的地方用{}.foemat进行传参headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):#print(i)all_datas.append(i) #将变量all_datas依次接收iptime.sleep(1)#设置休眠时间一秒,防止请求服务器太频繁被服务器拒绝请求

In [15]:

print(all_datas)

[(0, ('HTTP', '123.55.114.77', '9999')), (1, ('HTTP', '36.248.133.117', '9999')), (2, ('HTTP', '123.163.117.62', '9999')), (3, ('HTTP', '1.197.204.52', '9999')), (4, ('HTTP', '175.155.140.34', '1133')), (5, ('HTTP', '115.218.5.120', '9000')), (6, ('HTTP', '182.87.38.156', '9000')), (7, ('HTTP', '60.168.207.253', '1133')), (8, ('HTTP', '114.99.13.141', '1133')), (9, ('HTTP', '119.108.172.169', '9000')), (10, ('HTTP', '36.248.133.81', '9999')), (11, ('HTTP', '114.239.29.206', '9999')), (12, ('HTTP', '1.196.177.81', '9999')), (13, ('HTTP', '175.44.109.13', '9999')), (14, ('HTTP', '113.124.86.220', '9999')), (0, ('HTTP', '175.42.68.174', '9999')), (1, ('HTTP', '182.34.34.20', '9999')), (2, ('HTTP', '113.194.140.60', '9999')), (3, ('HTTP', '113.194.48.139', '9999')), (4, ('HTTP', '123.55.114.42', '9999')), (5, ('HTTP', '175.43.151.240', '9999')), (6, ('HTTP', '123.169.118.141', '9999')), (7, ('HTTP', '123.55.101.33', '9999')), (8, ('HTTP', '1.197.203.69', '9999')), (9, ('HTTP', '120.83.122.236', '9999')), (10, ('HTTP', '36.248.132.246', '9999')), (11, ('HTTP', '58.22.177.123', '9999')), (12, ('HTTP', '36.248.133.68', '9999')), (13, ('HTTP', '121.232.199.252', '9000')), (14, ('HTTP', '114.99.15.142', '1133'))]

In [16]:

'''用一个方法检测代理ip,去掉一些质量差的ip'''def check_ip(all_datas):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}#print('{}:{}:{}'.format(type, ip, port))id_dict[type] = ip':'port#print(id_dict)try:#获取百度的页面,如果响应时间在0.1秒内,则认为是好用的IPres=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_idgood_id=check_ip(all_datas)print('好用的ip有:',good_id)

好用的ip有: [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}]

In [19]:

#将获取IP的代码也装封成一个函数,根据需要爬取的页数进行传参def get_id(pages):all_datas = []for page in range(pages):url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page1)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}# 2发送请求 ---》requests模拟游览器发送请求,获取响应数据response = requests.get(url=url, headers=headers).text# print(response)# 3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip = re.findall(r"\"IP\">(.*?)</td>", response)# print(len(all_ip))all_type = re.findall(r"\"类型\">(.*?)</td>", response)# print(len(all_type))all_port = re.findall(r"\"PORT\">(.*?)</td>", response)all_data = zip(all_type, all_ip, all_port)for i in enumerate(all_data):# print(i)all_datas.append(i)time.sleep(1)print("获取的ip有{}个".format(len(all_datas)))return all_datasdef check_ip(page=4):'''检测代理ip的方法'''all_datas=get_id(page)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}id_dict[type] = ip':'porttry:res=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_id

In [20]:

#可以运行一个函数check_ip来获得好用的IP,方便在其他模块进行调用,因为获取代理IP不是爬虫的最终目的#需要在其他代码里获取代理IP时,直接从获取代理IP的模块中导入 这个函数即可if __name__=='__main__':good_ID=check_ip(4)print("好的代理IP",good_ID)

获取的ip有60个好的代理IP [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.23:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.209:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.104.138.12:3000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.148.211:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.198.72.73:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.29.17:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.123.89:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '140.255.184.238:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.27.180:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.167:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.37:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.221.227:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.108.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.222:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.102.66:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.185:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.35.160.131:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.249.53.36:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '112.111.217.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '139.155.41.15:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.7.209:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '163.125.112.207:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.15.151:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.143.39:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '221.224.136.211:35101', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.250.156.31:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.101.237.216:9999', 'klab_external_proxy_service_port': '80'}]

    推荐阅读
  • 富贵树种植技巧(什么时间种最好)

    富贵树种植技巧种植的时间:栽培最佳时间在10月下旬降霜后或是4月中上旬发芽前进行;移栽时间在春、夏、秋三季均可,移栽时若在生长期需要带土移栽。5月上旬以前或10月中旬以后是引种根最佳时间。注意事项:富贵树对肥料的要求不高,但每20~25天施肥一次,氮、磷、钾复合肥会使其生长的更加旺盛。富贵树要保持充足的水分,每两天必须浇灌一次,每次浇灌时最好把土全部浇湿润,水分的充裕,可以提高贵树的光合作用。

  • 杜鹃花能浇淘米水吗(淘米水可以直接浇杜鹃花吗)

    杜鹃花能浇淘米水吗?接下来我们就一起去研究一下吧!杜鹃花能浇淘米水吗能浇,但不能直接使用淘米水不仅仅是针对杜鹃花,对于任何植物我们都不能使用淘米水直接灌溉,因为没有处理过的淘米水直接浇灌,那么淘米水就会在土壤中发酵,发酵会导致发热从而烧坏植物的根部,对植物的影响非常大的,想要用淘米水浇灌,需要先进行处理才行。

  • 一起倒计时跨年文案(适合跨年发的微信朋友圈文案)

    我只希望你在新的一年里平平安安,快快乐乐。新年和往常一样,愿望也不一定非要在过年期盼。2020全面小康的漏网之鱼。凡是过往,皆为序章。

  • 车险买完后怎么退(车险怎么退)

    随后签上自己的名字或是盖章,完成之后把退保申请报告出具给车险公司。一般状况来说,商业车险全是可以退保的,车险公司会测算应当退回的保费。可是要想退保却并不容易,务必符合2个标准。首先是车险的保险单务必仍在有效期限,次之在有效期内被保险的车子并未存有报案或是理赔。

  • 梧怎么组词(梧的意思)

    接下来我们就一起去了解一下吧!梧怎么组词梧字组词:魁梧、梧桐、枝梧、碧梧、鸿梧、开梧、梧楸、梧子、抵梧、梧丘、强梧。“吾”义为“中立的”,引申为“正直的”。“木”与“吾”联合起来表示“一种树干高而直的树木”。魁梧,汉语词汇,拼音是kuíwu,形容体貌高大雄伟,亦作“魁吾”。

  • 趣味物理实验高中(观生活物理现象)

    接着,王老师又以生活中烧开水的现象要求同学们进行观察和讨论:围绕生活中烧开水冒“白气”这一生活现象展开多维度探究,进行实验,在这一过程中观察、思考、感知、感悟。最后,王老师和同学们共同合作完成一个与凭直觉判断结论相反的实验,一串串的物理现象激起学生的好奇心和求知欲,激活学生思维的浪花,学生在观察中品味物理学之美,享其探究之乐,悟其物理之理!

  • 显微镜的原理和一些细节(显微镜的原理和一些细节)

    目镜的放大倍数越高,目镜的焦距越短,两片透镜之间的距离也就越小,目镜也就越短2、为什么物镜的放大倍数越高,长度越长不同放大倍数的物镜的内部结构是不一样的,简单的说,放大倍数越大,内部越复杂,镜片越多,物镜也就越长.具体见下3.物镜中的油镜是什么?

  • 聚乙烯塑料和聚丙烯的区别(它与热塑性塑料相比有什么优势)

    聚丙烯原料丰富,综合性能优良,容易加工成型,生产成本较低,所以用途广泛。聚丙烯以丙烯单体为主要组分聚合而成。聚丙烯按照参加聚合的单体组成,可分为均聚级和共聚级两种。共聚级聚丙烯有较高的冲击强度,而无规共聚丙烯除了有较高的冲击强度外,还有很好的透明度。与其它通用热塑性塑料比较,聚丙烯具有比重小、刚性好、强度高、耐挠曲,以及有高于100℃的耐热温度和良好的耐化学腐蚀性等优点。

  • 孕妇嘴巴上火起泡怎么办(孕妇嘴巴上火起泡怎么办用什么药)

    药物涂抹患处孕妇嘴巴上火起泡时,可能与缺乏维生素B有关,可以压碎维生素B2、C片,用粉末与蜂蜜调和涂抹起泡处,能促进泡泡萎缩,促进创面愈合,对于孕妇也无不良反应。土豆外敷嘴上起泡,属于中医“上火”范畴,而土豆恰好能够起到消肿解毒、健脾的效果,虽然西医看来它是病毒感染,但更深层次的病因在于维生素,特别是b族维生素的缺乏。外涂红霉素软膏嘴巴上火起泡时,可适量的涂抹一些红霉素软膏。

  • 墨西哥网络发达吗(龙腾网网友讨论)

    墨西哥有一些非常好的世界大学,世界各地有很多的留学生到墨西哥来留学。墨西哥的经济和1994年签订的《北美自贸协议》的伙伴密切相关,尤其是和美国。墨西哥被世界银行归类为中上收入国家。一些分析家认为墨西哥是新兴工业国。到2050年,墨西哥可能会成为世界第五大或第七大经济体。墨西哥与世界上的任何国家都没有冲突。墨西哥是中立的,所以不需要担心遭到核打击。