77百科网
当前位置: 首页 生活百科

python爬虫库详解(python爬虫必备构建代理IP池)

时间:2023-08-10 作者: 小编 阅读量: 1 栏目名: 生活百科

python爬虫库详解?如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了本次项目就是自己动手构建一个免费的代理ip池,接下来我们就来聊聊关于python爬虫库详解?以下内容大家不妨参考一二希望能帮到您!

python爬虫库详解

如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip。代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了。本次项目就是自己动手构建一个免费的代理ip池。


'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

'''#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数'''url = 'https://www.kuaidaili.com/free/inha/1/'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

In [5]:

#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)

In [6]:

#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):print(i)

(0, ('HTTP', '123.55.114.77', '9999'))(1, ('HTTP', '36.248.133.117', '9999'))(2, ('HTTP', '123.163.117.62', '9999'))(3, ('HTTP', '1.197.204.52', '9999'))(4, ('HTTP', '175.155.140.34', '1133'))(5, ('HTTP', '115.218.5.120', '9000'))(6, ('HTTP', '182.87.38.156', '9000'))(7, ('HTTP', '60.168.207.253', '1133'))(8, ('HTTP', '114.99.13.141', '1133'))(9, ('HTTP', '119.108.172.169', '9000'))(10, ('HTTP', '36.248.133.81', '9999'))(11, ('HTTP', '114.239.29.206', '9999'))(12, ('HTTP', '1.196.177.81', '9999'))(13, ('HTTP', '175.44.109.13', '9999'))(14, ('HTTP', '113.124.86.220', '9999'))

In [14]:

#上面是爬取一页代理IP的代码,我们进行修改,用一个for循环爬取多页all_datas=[] #用一个变量接收各个页的ipimport timefor page in range(2):#先爬取三页试试url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page 1) #url需要更改的地方用{}.foemat进行传参headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}#2发送请求 ---》requests模拟游览器发送请求,获取响应数据response=requests.get(url=url,headers=headers).text#print(response)#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip=re.findall(r"\"IP\">(.*?)</td>",response)#print(len(all_ip))all_type=re.findall(r"\"类型\">(.*?)</td>",response)#print(len(all_type))all_port=re.findall(r"\"PORT\">(.*?)</td>",response)all_data=zip(all_type,all_ip,all_port)for i in enumerate(all_data):#print(i)all_datas.append(i) #将变量all_datas依次接收iptime.sleep(1)#设置休眠时间一秒,防止请求服务器太频繁被服务器拒绝请求

In [15]:

print(all_datas)

[(0, ('HTTP', '123.55.114.77', '9999')), (1, ('HTTP', '36.248.133.117', '9999')), (2, ('HTTP', '123.163.117.62', '9999')), (3, ('HTTP', '1.197.204.52', '9999')), (4, ('HTTP', '175.155.140.34', '1133')), (5, ('HTTP', '115.218.5.120', '9000')), (6, ('HTTP', '182.87.38.156', '9000')), (7, ('HTTP', '60.168.207.253', '1133')), (8, ('HTTP', '114.99.13.141', '1133')), (9, ('HTTP', '119.108.172.169', '9000')), (10, ('HTTP', '36.248.133.81', '9999')), (11, ('HTTP', '114.239.29.206', '9999')), (12, ('HTTP', '1.196.177.81', '9999')), (13, ('HTTP', '175.44.109.13', '9999')), (14, ('HTTP', '113.124.86.220', '9999')), (0, ('HTTP', '175.42.68.174', '9999')), (1, ('HTTP', '182.34.34.20', '9999')), (2, ('HTTP', '113.194.140.60', '9999')), (3, ('HTTP', '113.194.48.139', '9999')), (4, ('HTTP', '123.55.114.42', '9999')), (5, ('HTTP', '175.43.151.240', '9999')), (6, ('HTTP', '123.169.118.141', '9999')), (7, ('HTTP', '123.55.101.33', '9999')), (8, ('HTTP', '1.197.203.69', '9999')), (9, ('HTTP', '120.83.122.236', '9999')), (10, ('HTTP', '36.248.132.246', '9999')), (11, ('HTTP', '58.22.177.123', '9999')), (12, ('HTTP', '36.248.133.68', '9999')), (13, ('HTTP', '121.232.199.252', '9000')), (14, ('HTTP', '114.99.15.142', '1133'))]

In [16]:

'''用一个方法检测代理ip,去掉一些质量差的ip'''def check_ip(all_datas):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}#print('{}:{}:{}'.format(type, ip, port))id_dict[type] = ip':'port#print(id_dict)try:#获取百度的页面,如果响应时间在0.1秒内,则认为是好用的IPres=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_idgood_id=check_ip(all_datas)print('好用的ip有:',good_id)

好用的ip有: [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}]

In [19]:

#将获取IP的代码也装封成一个函数,根据需要爬取的页数进行传参def get_id(pages):all_datas = []for page in range(pages):url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page1)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}# 2发送请求 ---》requests模拟游览器发送请求,获取响应数据response = requests.get(url=url, headers=headers).text# print(response)# 3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等all_ip = re.findall(r"\"IP\">(.*?)</td>", response)# print(len(all_ip))all_type = re.findall(r"\"类型\">(.*?)</td>", response)# print(len(all_type))all_port = re.findall(r"\"PORT\">(.*?)</td>", response)all_data = zip(all_type, all_ip, all_port)for i in enumerate(all_data):# print(i)all_datas.append(i)time.sleep(1)print("获取的ip有{}个".format(len(all_datas)))return all_datasdef check_ip(page=4):'''检测代理ip的方法'''all_datas=get_id(page)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}good_id=[]for i, (type, ip, port) in all_datas:id_dict = {}id_dict[type] = ip':'porttry:res=requests.get('https://www.baidu.com',headers=headers,proxies=id_dict,timeout=0.1)if res.status_code==200:good_id.append(id_dict)except Exception as error:print(id_dict,error)return good_id

In [20]:

#可以运行一个函数check_ip来获得好用的IP,方便在其他模块进行调用,因为获取代理IP不是爬虫的最终目的#需要在其他代码里获取代理IP时,直接从获取代理IP的模块中导入 这个函数即可if __name__=='__main__':good_ID=check_ip(4)print("好的代理IP",good_ID)

获取的ip有60个好的代理IP [{'HTTP': '123.55.114.77:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.117:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.117.62:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.52:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.140.34:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.5.120:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.87.38.156:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '60.168.207.253:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.13.141:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '119.108.172.169:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.239.29.206:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.196.177.81:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.13:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.124.86.220:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.68.174:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.20:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.140.60:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '113.194.48.139:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.114.42:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.43.151.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.169.118.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.33:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.203.69:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '120.83.122.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.132.246:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '58.22.177.123:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.68:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.199.252:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.99.15.142:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.23:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.149.136.209:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '114.104.138.12:3000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '121.232.148.211:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.198.72.73:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.29.17:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.42.123.89:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '140.255.184.238:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.163.27.180:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.101.167:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.248.133.37:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.221.227:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.108.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '182.34.34.222:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.44.109.141:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.55.102.66:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '1.197.204.185:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.240:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.12.115.236:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '171.35.160.131:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.249.53.36:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '112.111.217.106:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '139.155.41.15:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '115.218.7.209:9000', 'klab_external_proxy_service_port': '80'}, {'HTTP': '163.125.112.207:8118', 'klab_external_proxy_service_port': '80'}, {'HTTP': '110.243.15.151:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '175.155.143.39:1133', 'klab_external_proxy_service_port': '80'}, {'HTTP': '221.224.136.211:35101', 'klab_external_proxy_service_port': '80'}, {'HTTP': '36.250.156.31:9999', 'klab_external_proxy_service_port': '80'}, {'HTTP': '123.101.237.216:9999', 'klab_external_proxy_service_port': '80'}]

    推荐阅读
  • 总有一天你会在我身旁(你会出现在我身边)

    结果收到了许多朋友的评论。投稿的女生是一位在牛津大学读书的学霸。课程结束后,两个人虽然分开了,但是联系却没有断掉。他们约定要去世界上最好的大学读书。随后就相继收到了牛津大学的录取通知书。女生一转身,男生手里拿着戒指,笑眯眯地单膝跪地看着她。这场由图书馆开始的爱情,在图书馆得到了它最宏大又浪漫的回应。他携带着他的过去,现在,连同他所期待的未来。好的爱情,是不在乎相遇时机的。

  • 济南离婚房屋 济南离婚房产纠纷

    网上申请流程网上申请析产双方分别登录山东政务服务网济南分站或下载“泉城办”手机APP,注册用户并进行实名认证,输入不动产证号,确认不动产登记信息,填写析产份额情况,上传资料后提交。领取证书办证申请经不动产登记机构和税务部门预审通过后,申请人可通过网上缴费,并根据领证短信通知内容,携带身份证明及其他材料到就近的不动产登记大厅领取不动产权证书。使用不动产电子证书的原证书可申请公告注销

  • 魔芋丝煮多久(魔芋丝怎么煮)

    以下内容希望对你有帮助!魔芋丝煮多久10分钟左右,首先把魔芋丝放入碗中,撒1勺砂糖、半勺鸡精。然后继续撒1勺食盐、1大勺辣椒粉。加2勺陈醋、2勺蒸鱼豉油。再加上适量葱花,淋适量热油。最后撒点香菜末,搅拌均匀后稍微腌制几分钟即可。

  • 有效治疗胃溃疡的方法(患了胃溃疡怎么办)

    一些胃溃疡患者因为对胃溃疡缺认识,没有引起足够的重视,从而导致病情反复发作,非常影响日常工作和生活。幽门螺杆菌感染、药物、饮食、遗传因素等,都是诱发胃溃疡发生的常见因素。一般胃溃疡需要接受4到8周的治疗才能够得到缓解,特殊情况下,可能还会适当延长治疗周期,所以应坚持用药,不能急于求成。因此胃溃疡患者要保持良好的心态,学会释放工作和生活中的压力。

  • 防衣物掉色的小妙招(揭秘衣物掉色的原因)

    我们实验所选的商品材质都是以棉为主,“菲比多尔”玫红色女童连衣裙成分为100%棉。汗渍色牢度不合格的话,主要是夏天的衣服或贴身的内衣在穿着时会经汗液的浸渍而掉色。耐干摩擦色牢度不合格的衣物会在使用过程中,因为产品的不同部位受到的摩擦程度不同,掉色的程度不同。以上是我们从客服处索取到的“菲比多尔”服装检测报告,报告中显示衣服色牢度各项完全符合国家标准,所以我们也放心的将此连衣裙用于掉色实验。

  • 怎样保鲜菜类及水果(如何保鲜菜类及水果)

    以下内容大家不妨参考一二希望能帮到您!怎样保鲜菜类及水果番茄通风保鲜。把番茄放在阴凉的地方,并把袋子扎紧。吃完的蔬菜沙拉,用保鲜膜封口,上面再放一张白纸。买来草莓可用醋水泡着,等吃的时候记得用水再洗一下。把香蕉头部用保鲜膜封好,可以让香蕉保存更长久。茄子买来先别洗,用保鲜膜裹好就行。等吃的时候再洗它就行了。

  • 金桔和山楂能一起吃吗(金桔和山楂可以一起吃吗?)

    2、金桔性温,吃起来肉质脆嫩,味道酸甜多汁,其中还含有丰富的维生素、有机酸、金桔甙、蛋白质、矿物质等营养成分,是一种营养价值很高的水果;而山楂味酸甘,性微温,其中含有丰富的维生素C、脂肪酸、胡萝卜素、黄酮类成分,金桔和山楂在食物性质和营养成分上并无冲突,所以一般情况下,是可以一起吃的。

  • 明日之后升级七级庄园需要什么材料(简介明日之后升级七级庄园需要什么材料)

    明日之后升级七级庄园需要什么材料庄园升级的核心就是熟练度等级,有三种熟练度等级,达到对应等级就可以升级庄园。熟练度才是庄园的核心,等级也是庄园的核心,升级庄园的材料也可以提前准备,提前2天准备材料即可。商队任务有大量的熟练度,可以把每周的商队任务清理掉,获得大量的熟练度,用来升级。熟练度提升是非常多的,玩家可以刷任务,把所有的任务都做完,悬赏任务也做完。

  • 汽车踩油门有滋滋声怎么回事(汽车踩油门怎么会有滋滋声)

    在发动机右边大多数是发电机的皮带;左边的话是节气门的问题;有前轮刹车片可能大部分原因是由于刹车卡钳,小部分卡死影响到。可能是装上的时候把刹车卡钳的活塞压回去时候压坏了导致一定程度卡死。踩油门有异响滋滋声是变速箱要换机油了。油门,内燃机上控制燃料供量的装置。油门踏板又称加速踏板。是汽车燃料供给系的一部分。通过控制其踩踏量,来控制发动机节气门开度,控制进气量,电脑控制油量,从而控制发动机的转速。

  • 投资资金被骗(里全是骗子演大戏)

    2月4日,云南省昆明市公安局富民县分局在执法办案中心举行涉案财物发还仪式,将打击诈骗犯罪案侦中追缴的20万元涉案资金返还受害人。张女士送锦旗感谢民警。经审讯深挖,六名犯罪嫌疑人对犯罪事实供认不讳。经过对花某及其亲属的政策宣传和规劝工作,2020年12月21日,花某主动投案自首。花某到案后,民警积极动员花某及其家属退赔涉案资金,最终,花某家属主动退赔了诈骗张女士的20万元人民币。