77百科网
当前位置: 首页 生活百科

python爬虫库详解(python爬虫必备构建代理IP池)

时间:2023-08-10 作者: 小编 阅读量: 1 栏目名: 生活百科

python爬虫库详解?如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了本次项目就是自己动手构建一个免费的代理ip池,接下来我们就来聊聊关于python爬虫库详解?以下内容大家不妨参考一二希望能帮到您!

python爬虫库详解

如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip。代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了。本次项目就是自己动手构建一个免费的代理ip池。


'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

In [5]:

#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)

In [6]:

#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):print(i)

(0, ('HTTP', '123.55.114.77', '9999'))(1, ('HTTP', '36.248.133.117', '9999'))(2, ('HTTP', '123.163.117.62', '9999'))(3, ('HTTP', '1.197.204.52', '9999'))(4, ('HTTP', '175.155.140.34', '1133'))(5, ('HTTP', '115.218.5.120', '9000'))(6, ('HTTP', '182.87.38.156', '9000'))(7, ('HTTP', '60.168.207.253', '1133'))(8, ('HTTP', '114.99.13.141', '1133'))(9, ('HTTP', '119.108.172.169', '9000'))(10, ('HTTP', '36.248.133.81', '9999'))(11, ('HTTP', '114.239.29.206', '9999'))(12, ('HTTP', '1.196.177.81', '9999'))(13, ('HTTP', '175.44.109.13', '9999'))(14, ('HTTP', '113.124.86.220', '9999'))

In [14]:

#上面是爬取一页代理IP的代码,我们进行修改,用一个for循环爬取多页all_datas=[] #用一个变量接收各个页的ipimport timefor page in range(2):#先爬取三页试试url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page 1) #url需要更改的地方用{}.foemat进行传参headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):#print(i)all_datas.append(i) #将变量all_datas依次接收iptime.sleep(1)#设置休眠时间一秒,防止请求服务器太频繁被服务器拒绝请求

In [15]:

print(all_datas)

[(0, ('HTTP', '123.55.114.77', '9999')), (1, ('HTTP', '36.248.133.117', '9999')), (2, ('HTTP', '123.163.117.62', '9999')), (3, ('HTTP', '1.197.204.52', '9999')), (4, ('HTTP', '175.155.140.34', '1133')), (5, ('HTTP', '115.218.5.120', '9000')), (6, ('HTTP', '182.87.38.156', '9000')), (7, ('HTTP', '60.168.207.253', '1133')), (8, ('HTTP', '114.99.13.141', '1133')), (9, ('HTTP', '119.108.172.169', '9000')), (10, ('HTTP', '36.248.133.81', '9999')), (11, ('HTTP', '114.239.29.206', '9999')), (12, ('HTTP', '1.196.177.81', '9999')), (13, ('HTTP', '175.44.109.13', '9999')), (14, ('HTTP', '113.124.86.220', '9999')), (0, ('HTTP', '175.42.68.174', '9999')), (1, ('HTTP', '182.34.34.20', '9999')), (2, ('HTTP', '113.194.140.60', '9999')), (3, ('HTTP', '113.194.48.139', '9999')), (4, ('HTTP', '123.55.114.42', '9999')), (5, ('HTTP', '175.43.151.240', '9999')), (6, ('HTTP', '123.169.118.141', '9999')), (7, ('HTTP', '123.55.101.33', '9999')), (8, ('HTTP', '1.197.203.69', '9999')), (9, ('HTTP', '120.83.122.236', '9999')), (10, ('HTTP', '36.248.132.246', '9999')), (11, ('HTTP', '58.22.177.123', '9999')), (12, ('HTTP', '36.248.133.68', '9999')), (13, ('HTTP', '121.232.199.252', '9000')), (14, ('HTTP', '114.99.15.142', '1133'))]

In [16]:

'''用一个方法检测代理ip,去掉一些质量差的ip'''def check_ip(all_datas):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}#print('{}:{}:{}'.format(type, ip, port))id_dict[type] = ip':'port#print(id_dict)try:#获取百度的页面,如果响应时间在0.1秒内,则认为是好用的IPres=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_idgood_id=check_ip(all_datas)print('好用的ip有:',good_id)

好用的ip有: [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}]

In [19]:

#将获取IP的代码也装封成一个函数,根据需要爬取的页数进行传参def get_id(pages):all_datas = []for page in range(pages):url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page1)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}# 2发送请求 ---》requests模拟游览器发送请求,获取响应数据response = requests.get(url=url, headers=headers).text# print(response)# 3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip = re.findall(r"\"IP\">(.*?)</td>", response)# print(len(all_ip))all_type = re.findall(r"\"类型\">(.*?)</td>", response)# print(len(all_type))all_port = re.findall(r"\"PORT\">(.*?)</td>", response)all_data = zip(all_type, all_ip, all_port)for i in enumerate(all_data):# print(i)all_datas.append(i)time.sleep(1)print("获取的ip有{}个".format(len(all_datas)))return all_datasdef check_ip(page=4):'''检测代理ip的方法'''all_datas=get_id(page)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}id_dict[type] = ip':'porttry:res=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_id

In [20]:

#可以运行一个函数check_ip来获得好用的IP,方便在其他模块进行调用,因为获取代理IP不是爬虫的最终目的#需要在其他代码里获取代理IP时,直接从获取代理IP的模块中导入 这个函数即可if __name__=='__main__':good_ID=check_ip(4)print("好的代理IP",good_ID)

获取的ip有60个好的代理IP [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.23:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.209:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.104.138.12:3000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.148.211:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.198.72.73:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.29.17:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.123.89:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '140.255.184.238:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.27.180:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.167:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.37:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.221.227:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.108.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.222:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.102.66:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.185:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.35.160.131:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.249.53.36:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '112.111.217.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '139.155.41.15:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.7.209:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '163.125.112.207:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.15.151:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.143.39:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '221.224.136.211:35101', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.250.156.31:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.101.237.216:9999', 'klab_external_proxy_service_port': '80'}]

    推荐阅读
  • 最好的学数学方法(数学学习方法)

    建议可以买一些教材,里面会有一些做题技巧。有些同学在课堂上埋头苦干,一节课下来,书上满满当当,但这样其实是效率不高的。数学重要的是思维,只要弄清楚了知识点,相似的题型都几乎会做。所以在课堂上最好是紧跟老师的步伐。掌握了知识点后也要实践,通过做课外题更好地发现学习的弱项。

  • 萆薢的功效与作用(萆薢的功效与作用有哪些)

    萆薢的功效与作用萆薢是一种味苦性平的中药材,可以入肝经、肾经和胃经,利湿去浊和祛风除湿是它最重要的功效,平时多用于人类淋浊白带和腰腿疼痛等多疾病的治疗,另外萆薢对人类的湿热疮毒也有一定的治疗效果。萆薢可治尿频萆薢是一种对尿频治疗效果出色的中药材,在治疗时可以把萆薢洗净以后加工成末,然后用白酒调成像桐子大小的药丸,每天食用七十粒,食用时要空腹用盐汤送下,连用一星期能起到良好的治疗效果。

  • 10种最流行的配色(女人会配色才更精致)

    女人会配色才更精致#抗老好物推荐##品牌好物#丹纳在《艺术哲学》中是这样写的:色彩之于形象,有如伴奏之于歌词,不但如此,有时色彩竟是歌词,而形象只是伴奏色彩搭配是最基本的穿搭功底,再高级的衣服若是没有协调的色彩作为配合。

  • 尿检可以提前多久取尿(体检怎么取尿样)

    尿液检查是很重要的检查项目,可能有些人不知道,取尿样最好是中段尿才最为稳定。所谓中段尿,就是在一次排尿的中途时间里取尿。如果取的尿不是中段,最明显的误差就是会让尿液的pH值偏高或者偏低,还有一些数据也会有轻微的浮动。但这种影响不会有“质”的变化,比如没有炎症会测出炎症来,或是有炎症测不出来。来源:北京12320在聆听

  • 五年级汉字的演变(汉字的起源与演变)

    五年级汉字的演变一、起源与演变“走”字始见于商代甲骨文,其字形像一个人两臂一上一下,是摆动双臂跑起来的象形。楷书字形有“大止”、“夭止”、“土之”、“土止”四种,汉代汉语确定为“走”的字形。引申为趋向、走向,后指步行,现代汉语的“走”相当于古代汉语的“步”。走又由脚步的移动引申为移动、离开、改变、泄露等义。用走作意符的字大多与跑或快走有关。

  • 两个月的贵宾犬吃什么东西(宠物挑食怎么办)

    贵宾犬也称“贵妇犬”,又称“卷毛狗”,不过它在德语中是水花飞溅的意思。法国的长卷毛犬、葡萄牙水犬、西班牙猎犬都有可能是贵宾犬的祖先。贵宾犬在法国被视为国犬,它气质独特,造型多变,赢得了许多人的欢心。贵宾犬活泼,性情优良,极易近人,是一种忠诚的犬种。为了保持它的美貌,需要定期给狗狗洗澡,记得洗澡后一定要吹干啊,不然狗狗会感冒的。饮食篇贵宾犬平常喜欢动,而且十分活泼,热情开朗。

  • iphone键盘表情符号怎么启用(令人兴奋的iPhone技巧可让您解锁隐藏的文本表情符号功能)

    苹果在其iPhone上安装了大量巧妙的黑客技术,但并非都是显而易见的。幸运的是,一位iPhone粉丝透露了一个出色的隐藏表情符号功能。这将解锁一个隐藏页面,让您发送带有效果的消息。现在您要点击顶部的屏幕,然后选择使用回声发送。他们可以一遍又一遍地重播“回声”。如果您看不到此功能,请确保您使用的是最新版本的iOS。要进行检查,请进入“设置”>“通用”>“软件更新”。如果您还没有更新到最新的iOS更新,赶快更新吧。

  • 岛国猎奇神剧(这么羞耻的岛国雷剧)

    《哥哥太爱我了怎么办》据说,这部剧中的玛丽苏直接拉响了防空警报。17岁的女主濑户花告白12次都接连失败。有些网友甚至认为,电视剧中的主角比原著还苏。凭借《殿下,给点利息》提名2017年第40届日本电影学院奖最佳新人。同时对几个男的产生好感,一度让人怀疑她是否滥情。完美又多金的跑车男跟妹妹告白却被哥哥警告。隔断兄妹两人的血缘关系。剧情再次发生反转。6月30日《哥哥太爱我了》电影版上映,期待。

  • 发票丢失怎么办(如何处理发票丢失)

    接下来我们就一起去了解一下吧!发票丢失怎么办发票丢失,应于丢失当日书面报告主管税务机关,并在报刊和电视等传播媒介上公告声明作废。请办理上述手续后,按以下程序办理。丢失发票的消费者到销货单位取得该发票存根联复印件;到销货方所在地主管税务机关盖章确认并登记备案;由销货单位重新开具与原销货发票存根联内容一致的销货发票。