肥宅钓鱼网
当前位置: 首页 钓鱼百科

python爬虫库详解(python爬虫必备构建代理IP池)

时间:2023-08-10 作者: 小编 阅读量: 3 栏目名: 钓鱼百科

python爬虫必备构建代理IP池如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了本次项目就是自己动手构建一。

python爬虫库详解?如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了本次项目就是自己动手构建一个免费的代理ip池,接下来我们就来聊聊关于python爬虫库详解?以下内容大家不妨参考一二希望能帮到您!

python爬虫库详解

如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip。代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了。本次项目就是自己动手构建一个免费的代理ip池。


'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

In [5]:

#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)

In [6]:

#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):print(i)

(0, ('HTTP', '123.55.114.77', '9999'))(1, ('HTTP', '36.248.133.117', '9999'))(2, ('HTTP', '123.163.117.62', '9999'))(3, ('HTTP', '1.197.204.52', '9999'))(4, ('HTTP', '175.155.140.34', '1133'))(5, ('HTTP', '115.218.5.120', '9000'))(6, ('HTTP', '182.87.38.156', '9000'))(7, ('HTTP', '60.168.207.253', '1133'))(8, ('HTTP', '114.99.13.141', '1133'))(9, ('HTTP', '119.108.172.169', '9000'))(10, ('HTTP', '36.248.133.81', '9999'))(11, ('HTTP', '114.239.29.206', '9999'))(12, ('HTTP', '1.196.177.81', '9999'))(13, ('HTTP', '175.44.109.13', '9999'))(14, ('HTTP', '113.124.86.220', '9999'))

In [14]:

#上面是爬取一页代理IP的代码,我们进行修改,用一个for循环爬取多页all_datas=[] #用一个变量接收各个页的ipimport timefor page in range(2):#先爬取三页试试url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page 1) #url需要更改的地方用{}.foemat进行传参headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):#print(i)all_datas.append(i) #将变量all_datas依次接收iptime.sleep(1)#设置休眠时间一秒,防止请求服务器太频繁被服务器拒绝请求

In [15]:

print(all_datas)

[(0, ('HTTP', '123.55.114.77', '9999')), (1, ('HTTP', '36.248.133.117', '9999')), (2, ('HTTP', '123.163.117.62', '9999')), (3, ('HTTP', '1.197.204.52', '9999')), (4, ('HTTP', '175.155.140.34', '1133')), (5, ('HTTP', '115.218.5.120', '9000')), (6, ('HTTP', '182.87.38.156', '9000')), (7, ('HTTP', '60.168.207.253', '1133')), (8, ('HTTP', '114.99.13.141', '1133')), (9, ('HTTP', '119.108.172.169', '9000')), (10, ('HTTP', '36.248.133.81', '9999')), (11, ('HTTP', '114.239.29.206', '9999')), (12, ('HTTP', '1.196.177.81', '9999')), (13, ('HTTP', '175.44.109.13', '9999')), (14, ('HTTP', '113.124.86.220', '9999')), (0, ('HTTP', '175.42.68.174', '9999')), (1, ('HTTP', '182.34.34.20', '9999')), (2, ('HTTP', '113.194.140.60', '9999')), (3, ('HTTP', '113.194.48.139', '9999')), (4, ('HTTP', '123.55.114.42', '9999')), (5, ('HTTP', '175.43.151.240', '9999')), (6, ('HTTP', '123.169.118.141', '9999')), (7, ('HTTP', '123.55.101.33', '9999')), (8, ('HTTP', '1.197.203.69', '9999')), (9, ('HTTP', '120.83.122.236', '9999')), (10, ('HTTP', '36.248.132.246', '9999')), (11, ('HTTP', '58.22.177.123', '9999')), (12, ('HTTP', '36.248.133.68', '9999')), (13, ('HTTP', '121.232.199.252', '9000')), (14, ('HTTP', '114.99.15.142', '1133'))]

In [16]:

'''用一个方法检测代理ip,去掉一些质量差的ip'''def check_ip(all_datas):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}#print('{}:{}:{}'.format(type, ip, port))id_dict[type] = ip':'port#print(id_dict)try:#获取百度的页面,如果响应时间在0.1秒内,则认为是好用的IPres=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_idgood_id=check_ip(all_datas)print('好用的ip有:',good_id)

好用的ip有: [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}]

In [19]:

#将获取IP的代码也装封成一个函数,根据需要爬取的页数进行传参def get_id(pages):all_datas = []for page in range(pages):url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page1)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}# 2发送请求 ---》requests模拟游览器发送请求,获取响应数据response = requests.get(url=url, headers=headers).text# print(response)# 3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip = re.findall(r"\"IP\">(.*?)</td>", response)# print(len(all_ip))all_type = re.findall(r"\"类型\">(.*?)</td>", response)# print(len(all_type))all_port = re.findall(r"\"PORT\">(.*?)</td>", response)all_data = zip(all_type, all_ip, all_port)for i in enumerate(all_data):# print(i)all_datas.append(i)time.sleep(1)print("获取的ip有{}个".format(len(all_datas)))return all_datasdef check_ip(page=4):'''检测代理ip的方法'''all_datas=get_id(page)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}id_dict[type] = ip':'porttry:res=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_id

In [20]:

#可以运行一个函数check_ip来获得好用的IP,方便在其他模块进行调用,因为获取代理IP不是爬虫的最终目的#需要在其他代码里获取代理IP时,直接从获取代理IP的模块中导入 这个函数即可if __name__=='__main__':good_ID=check_ip(4)print("好的代理IP",good_ID)

获取的ip有60个好的代理IP [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.23:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.209:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.104.138.12:3000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.148.211:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.198.72.73:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.29.17:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.123.89:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '140.255.184.238:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.27.180:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.167:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.37:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.221.227:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.108.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.222:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.102.66:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.185:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.35.160.131:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.249.53.36:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '112.111.217.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '139.155.41.15:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.7.209:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '163.125.112.207:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.15.151:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.143.39:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '221.224.136.211:35101', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.250.156.31:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.101.237.216:9999', 'klab_external_proxy_service_port': '80'}]

    推荐阅读
  • 青岛小枣园拆迁(青岛河套一拆迁小区问题跟踪)

    而胶建集团相关负责人在协调会上承认,该项目在混凝土搅拌过程中存在使用海沙的问题,村民从村委拿到的维修方案也被确认是假的。因尚家沟新村问题频繁,此前,多家媒体曾予报道。刘赞勋他们并没有让胶建集团的施工人员照此方案对房屋进行维修。该公司认定,方案和印章都是假的。7月26日,青岛城乡建筑设计院有限公司书面证明,“维修方案”并非由该公司出具,所盖公章系伪造。假印章和中启胶建三分公司的印章颜色较为一致。

  • 火麻为什么不能随便种 火麻为什么不能随便种呢

    火麻茎皮纤维长而坚韧,可用以织麻布或纺线,制绳索,编织渔网和造纸。火麻一般指的桑科大麻属植物大麻,其提取物是一种毒品,所以不能随便种。大麻的果壳和苞片具有毒性,主要有效化学成分为四氢大麻酚,吸食或者口服之后会有精神和生理的活性作用。

  • 薄脆怎么保存(薄脆保存方法)

    以下内容大家不妨参考一二希望能帮到您!变凉后将薄脆装入保鲜袋或塑料袋中。这样可以充分隔绝水气,以免使薄脆返潮。然后放入冰箱的保鲜室进行保存即可。薄脆北京地区著名的地方传统小吃之一。薄脆,顾名思义,即薄又脆,但薄而不碎,脆而不艮,香酥可口。二十世纪三、四十年代,在北京吃早点,常向卖炸油饼的要个薄脆。

  • 暮江吟的意思是什么(暮江吟古诗的意思)

    暮江吟的意思是什么暮江吟古诗的意思:一道残阳渐沉江中,半江碧绿半江艳红。最可爱的是那九月初三之夜,露珠亮似珍珠郎朗新月形如弯弓。暮江吟古诗原文:一道残阳铺水中,半江瑟瑟半江红。可怜九月初三夜,露似真珠月似弓。《暮江吟》大约是长庆二年白居易在赴杭州任刺史的途中写的。《暮江吟》是唐代诗人白居易创作的一首七绝。全诗语言清丽流畅,格调清新,绘影绘色,细致真切,其写景之微妙,历来备受称道。

  • 做美发1年没做了想再做(在美发店做了柔顺)

    前晚,施小姐戴着口罩回到美发店,与发型师协商。施小姐瘦瘦高高,身材很好,也是典型的爱美女生。发型师姓叶,28岁。施小姐说,这个理论成立不成立她不知道,就算成立,你作为职业发型师,也有提问和告知消费者的义务。经过一番劝说,叶先生态度缓和下来,承认自己有一定责任。前晚施小姐发了朋友圈:感谢各位热心人的关心,今晚在记者的陪同下到该店协商,也算圆满解决。

  • 家庭盆栽花卉红千层(盆栽花卉红千层的养护方法)

    家庭盆栽花卉红千层具体的养殖方法:最适宜的土壤:蓬松、透水、保水、肥沃的土壤既可。这样可以减少水分的消耗,同时还可以保持环境的湿润,这结植物的生长非常有利,尤其是对枝叶进行喷水,这对刚栽种的红千层很有利,可以提高它的成活率。

  • 大车转弯引发的交通事故责任划分(大转弯车辆与小转弯车辆发生车祸谁全责)

    转弯时出现交通事故的,如果是追尾或者后车超车时发生的事故,一般由后车承担责任,但具体责任划分,以交警出具的责任认定书为准。有信号灯管控的路口,需要右转的车辆无论在车道信号灯放行情况下还是在信号灯管控的情况下,右转均需要让行侧向直行车辆和直行非机动车和行人。

  • 茶具什么材质好(茶具什么材质的好)

    用来泡茶,不仅不失茶的色、香、味,更不易霉馊变质。而且紫砂功夫茶具使用越久,壶身越是光亮照人,极具收藏价值。

  • 种植金银花的方法(种植金银花的方法简述)

    接下来我们就一起去研究一下吧!种植金银花的方法种子种植:尽量选在4月份左右进行,购买质量好一些的种子,放进40度的温水里面进行催芽,裂口后将播进土中,覆土并每天适当喷水即可。枝条种植:在雨季进行最佳,剪取长度为30-35cm的枝条,底部叶子剪掉后插入土壤里,用土按实固定好就完成了。

  • 黄豆芽去腥味怎么处理(需要加入什么辅料)

    下面希望有你要的答案,我们一起来看看吧!黄豆芽去腥味怎么处理鸡小胸肉一块,切成条,生抽胡椒腌制一下。加入少量的淀粉,蛋清拌匀。黄豆芽一小撮,洗净,开水烫一下去豆腥味。2小碗清水加入汤锅,干红辣椒半个剪成小段,放入锅中;小葱葱叶几段,放入锅中,点火烧开。水沸腾,加入黄豆芽,再沸腾,一条一条放入鸡肉条;再沸腾,关火。加入一点点盐,调味。