肥宅钓鱼网
当前位置: 首页 钓鱼百科

python爬虫库详解(python爬虫必备构建代理IP池)

时间:2023-08-10 作者: 小编 阅读量: 2 栏目名: 钓鱼百科

python爬虫必备构建代理IP池如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了本次项目就是自己动手构建一。

python爬虫库详解?如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了本次项目就是自己动手构建一个免费的代理ip池,接下来我们就来聊聊关于python爬虫库详解?以下内容大家不妨参考一二希望能帮到您!

python爬虫库详解

如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip。代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了。本次项目就是自己动手构建一个免费的代理ip池。


'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

In [5]:

#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)

In [6]:

#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):print(i)

(0, ('HTTP', '123.55.114.77', '9999'))(1, ('HTTP', '36.248.133.117', '9999'))(2, ('HTTP', '123.163.117.62', '9999'))(3, ('HTTP', '1.197.204.52', '9999'))(4, ('HTTP', '175.155.140.34', '1133'))(5, ('HTTP', '115.218.5.120', '9000'))(6, ('HTTP', '182.87.38.156', '9000'))(7, ('HTTP', '60.168.207.253', '1133'))(8, ('HTTP', '114.99.13.141', '1133'))(9, ('HTTP', '119.108.172.169', '9000'))(10, ('HTTP', '36.248.133.81', '9999'))(11, ('HTTP', '114.239.29.206', '9999'))(12, ('HTTP', '1.196.177.81', '9999'))(13, ('HTTP', '175.44.109.13', '9999'))(14, ('HTTP', '113.124.86.220', '9999'))

In [14]:

#上面是爬取一页代理IP的代码,我们进行修改,用一个for循环爬取多页all_datas=[] #用一个变量接收各个页的ipimport timefor page in range(2):#先爬取三页试试url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page 1) #url需要更改的地方用{}.foemat进行传参headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):#print(i)all_datas.append(i) #将变量all_datas依次接收iptime.sleep(1)#设置休眠时间一秒,防止请求服务器太频繁被服务器拒绝请求

In [15]:

print(all_datas)

[(0, ('HTTP', '123.55.114.77', '9999')), (1, ('HTTP', '36.248.133.117', '9999')), (2, ('HTTP', '123.163.117.62', '9999')), (3, ('HTTP', '1.197.204.52', '9999')), (4, ('HTTP', '175.155.140.34', '1133')), (5, ('HTTP', '115.218.5.120', '9000')), (6, ('HTTP', '182.87.38.156', '9000')), (7, ('HTTP', '60.168.207.253', '1133')), (8, ('HTTP', '114.99.13.141', '1133')), (9, ('HTTP', '119.108.172.169', '9000')), (10, ('HTTP', '36.248.133.81', '9999')), (11, ('HTTP', '114.239.29.206', '9999')), (12, ('HTTP', '1.196.177.81', '9999')), (13, ('HTTP', '175.44.109.13', '9999')), (14, ('HTTP', '113.124.86.220', '9999')), (0, ('HTTP', '175.42.68.174', '9999')), (1, ('HTTP', '182.34.34.20', '9999')), (2, ('HTTP', '113.194.140.60', '9999')), (3, ('HTTP', '113.194.48.139', '9999')), (4, ('HTTP', '123.55.114.42', '9999')), (5, ('HTTP', '175.43.151.240', '9999')), (6, ('HTTP', '123.169.118.141', '9999')), (7, ('HTTP', '123.55.101.33', '9999')), (8, ('HTTP', '1.197.203.69', '9999')), (9, ('HTTP', '120.83.122.236', '9999')), (10, ('HTTP', '36.248.132.246', '9999')), (11, ('HTTP', '58.22.177.123', '9999')), (12, ('HTTP', '36.248.133.68', '9999')), (13, ('HTTP', '121.232.199.252', '9000')), (14, ('HTTP', '114.99.15.142', '1133'))]

In [16]:

'''用一个方法检测代理ip,去掉一些质量差的ip'''def check_ip(all_datas):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}#print('{}:{}:{}'.format(type, ip, port))id_dict[type] = ip':'port#print(id_dict)try:#获取百度的页面,如果响应时间在0.1秒内,则认为是好用的IPres=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_idgood_id=check_ip(all_datas)print('好用的ip有:',good_id)

好用的ip有: [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}]

In [19]:

#将获取IP的代码也装封成一个函数,根据需要爬取的页数进行传参def get_id(pages):all_datas = []for page in range(pages):url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page1)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}# 2发送请求 ---》requests模拟游览器发送请求,获取响应数据response = requests.get(url=url, headers=headers).text# print(response)# 3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip = re.findall(r"\"IP\">(.*?)</td>", response)# print(len(all_ip))all_type = re.findall(r"\"类型\">(.*?)</td>", response)# print(len(all_type))all_port = re.findall(r"\"PORT\">(.*?)</td>", response)all_data = zip(all_type, all_ip, all_port)for i in enumerate(all_data):# print(i)all_datas.append(i)time.sleep(1)print("获取的ip有{}个".format(len(all_datas)))return all_datasdef check_ip(page=4):'''检测代理ip的方法'''all_datas=get_id(page)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}id_dict[type] = ip':'porttry:res=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_id

In [20]:

#可以运行一个函数check_ip来获得好用的IP,方便在其他模块进行调用,因为获取代理IP不是爬虫的最终目的#需要在其他代码里获取代理IP时,直接从获取代理IP的模块中导入 这个函数即可if __name__=='__main__':good_ID=check_ip(4)print("好的代理IP",good_ID)

获取的ip有60个好的代理IP [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.23:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.209:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.104.138.12:3000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.148.211:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.198.72.73:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.29.17:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.123.89:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '140.255.184.238:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.27.180:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.167:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.37:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.221.227:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.108.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.222:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.102.66:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.185:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.35.160.131:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.249.53.36:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '112.111.217.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '139.155.41.15:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.7.209:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '163.125.112.207:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.15.151:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.143.39:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '221.224.136.211:35101', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.250.156.31:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.101.237.216:9999', 'klab_external_proxy_service_port': '80'}]

    推荐阅读
  • 蚕豆可以跟红枣一起煮吗(蚕豆能不能跟红枣一起煮)

    由于牡蛎中含锌量很高,蚕豆中含有的纤维素会影响人体对锌的吸收。蚕豆和牛肝一起吃容易影响人体对维生素A的吸收,降低食物的营养价值。由于榴莲性质较热,蚕豆又十分干燥,两者一起吃会造成上火、口干、长痘。

  • 大江大河2杨采钰第几集出场(电视剧大江大河2介绍)

    下面希望有你要的答案,我们一起来看看吧!大江大河2杨采钰第几集出场《大江大河2》女主角杨采钰在第24集登场。《大江大河2》是由上海广播电视台、东阳正午阳光影视有限公司、SMG尚世影业联合出品,李雪、黄伟执导,孔笙监制,王凯、杨烁、董子健、杨采钰领衔主演的当代都市剧。该剧于2020年12月20日在东方卫视、浙江卫视首播,并在优酷、爱奇艺、腾讯同步播出。

  • 日本插画师病娇男孩(日本插画网站跨性别员工遭男上司持续性骚扰)

    被告则是Pixiv的一个男性执行董事,同时也是受害人的上司。2018年4月,受害人以设计师的身份进入Pixiv就职。2022年1月,受害人因被诊断出精神疾病而停职。公司在2019年已下令对骚扰加害者降职降薪,并禁止接近受害人。前成员A女,控诉被永田宽哲多次性骚扰。而永田宽哲方面只承认有同房睡觉、全身按摩,但对于设置针孔摄像头偷拍一事予以否认。如今4年过去后,Pixiv再次曝光类似事件,男上司性骚扰员工后,并没有被公司开除。

  • 熬东瓜菜怎样做好吃(熬东瓜菜好吃做法)

    熬东瓜菜怎样做好吃冬瓜去皮去瓤。切成不厚不薄的片,不要太薄了,容易熬散。用手揉搓一分钟杀杀水,再杀几分钟水,杀过的水记得倒掉。起锅烧油,放入几粒花椒大料。放入五花肉煸炒。炒至变色的那种时候放入老抽调色加入半碗水,盖上盖子炖几分钟,这样汤汁会更好吃。加入生抽醋盐香油,关火。喜欢吃葱花香菜的可以出锅的时候放点。冬瓜肉粉条都非常好吃哦,连汤都好喝。

  • 豉油鹅做法步骤(豉油鹅怎么做好吃)

    下面内容希望能帮助到你,我们来一起看看吧!豉油鹅做法步骤食材:鹅一只,八角适量,桂皮适量,香叶适量,草果适量,小茴香适量,沙仁适量。整个大鹅收拾干净。各种香料备齐、备用。锅洗净烧热放油放入整鹅进锅炸至鹅皮微金黄,放入各种香料翻炒几下,再适量的水进去煮至定型。先放入半桶味事达酱油煮。煮时翻翻鹅身继续煮,小火大概煮40分钟左右。

  • 枸杞树喜欢酸性还是碱性土壤(有什么特点)

    以下内容希望对你有帮助!枸杞树喜欢酸性还是碱性土壤枸杞树喜欢碱性土壤,枸杞树为茄科枸杞属多分枝灌木植物,常生于山坡、荒地、丘陵地、盐碱地、路旁及村边宅旁,我国枸杞著名产地为宁夏及青海。枸杞是一种耐干旱、耐贫瘠、耐盐碱的多年生灌木经济作物。经济年龄达到15年。枸杞适应性强,一般年平均气温在5~20℃之间的地区都可栽培。对土壤没有严格要求,在壤土、沙土或者贫脊的盐碱土都可以种植。

  • 包文婧红色长裙(包文婧的身材原来这么好)

    包文婧的身材原来这么好#今天穿什么##变美百科全书##穿搭红黑榜#女孩子们的生活造型有很多的风格,不同风格的私服造型在气质的展现上当然是不一样的,所以想要让自己的私服造型变得更加时髦,就需要了解自己到底适合什么样的私服穿搭。

  • 淘宝客服兼职在哪里找(怎么在淘宝网上找客服工作)

    以下内容希望对你有帮助!淘宝客服兼职在哪里找进入淘宝官网,鼠标放在右上角“网站导航”会自动弹出选项,选择淘工作。搜索或者直接点击关键词后会进入到详细的列表页面,类似于淘宝你搜了一个名称后的结果展示。选择你喜欢的一个客服工作,点击进入,想做的话就投个简历,等待对方回应即可。

  • 减四肢赘肉的方法(有什么减肥的动作)

    减四肢赘肉的方法宽距深蹲,双腿分开大概肩部两倍的距离,双手掌心相对,抱于胸前位置。腰背部挺直,腹部要收紧,直视前方。然后用力下蹲,让大腿平行地面,小腿垂直于地面,并且脚掌着地不要移动。登山跑,俯身趴在瑜伽垫上,腰背部要挺直,双脚脚尖撑地,双腿伸直。双臂置于肩部正下方,垂直于地面,然后双腿轮流进行向腹部位置提膝的动作,双腿要快速的交替进行,除了腿部,其他部位要尽可能的保持稳定,不要晃动。

  • 公司发票账务要保存多久(公司发票账务要保存多久讲解)

    以下内容大家不妨参考一二希望能帮到您!公司发票账务要保存多久《发票管理办法》规定,开具发票的单位和个人应当按照税务机关的规定存放和保管发票,不得擅自损毁。已经开具的发票存根联和发票登记簿,应当保存5年。保存期满,报经税务机关查验后销毁。具体咨询税务机关。