好吧,无聊的时候写了一个自动获取免费代理的文章 连接地址
既然我们获得了免费的代理列表,那么有很多事情可以干,比如 , 爬取某个网站并且没有被封IP的风险, 比如, 增加某网站的流量, 下面是第一版, 这个砖就给大家提供个思路,因为之前使用urllib2 来实现,但是问题是没有办法保持他的回话,也就是不能达到真正和浏览器一样的效果,所以,找到了一个新的思路
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
#coding:utf-8 #maple_m@hotmail.com #www.503error.com #Zhang xiaoMing import urllib2 import urllib import cookielib import hashlib import re import time import json import unittest from selenium import webdriver from bs4 import BeautifulSoup from pip._vendor.distlib._backport.tarfile import TUREAD from time import sleep from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.desired_capabilities import DesiredCapabilities import random class Spide: def __init__(self,proxy_ip,proxy_type,proxy_port,use_proxy=False): print 'using the proxy info :',proxy_ip self.proxy_ip = proxy_ip self.proxy_type = proxy_type self.proxy_port = proxy_port self.proxy = urllib2.ProxyHandler({proxy_type: proxy_ip+":"+proxy_port}) self.usercode = "" self.userid = "" self.cj = cookielib.LWPCookieJar(); self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj)); if use_proxy: self.opener = urllib2.build_opener(self.proxy) urllib2.install_opener(self.opener); def add_view(self): print '--->start adding view' print '--->proxy info',self.proxy_ip service_args = [ '--proxy='+self.proxy_ip+':'+self.proxy_port, '--proxy-type='+self.proxy_type, ] dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = ( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 " "(KHTML, like Gecko) Chrome/15.0.87" ) driver = webdriver.PhantomJS(executable_path='/home/zhangzhiming/git/phantomjs/bin/phantomjs',service_args=service_args,desired_capabilities=dcap) driver.set_page_load_timeout(90) driver.get("http://www.503error.com/") soup = BeautifulSoup(driver.page_source, 'xml') titles = soup.find_all('h1', {'class': 'entry-title'}) ranCount = random.randint(0,len(titles)) print 'random find a link of the website to access , random is :',ranCount randomlink = titles[ranCount].a.attrs['href'] driver.get(randomlink) driver.close() print 'finish once' def get_proxy(self): proxy_info_json = "" #first get the proxy info from print '-->using the ip'+self.proxy_ip+'to get the proxyinfo' try: reqRequest_proxy = urllib2.Request('url2'); reqRequest_proxy.add_header('Accept','*/*'); reqRequest_proxy.add_header('Accept-Language','zh-CN,zh;q=0.8'); reqRequest_proxy.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36'); reqRequest_proxy.add_header('Content-Type','application/x-www-form-urlencoded'); proxy_info = urllib2.urlopen(reqRequest_proxy).read(); print proxy_info proxy_info_json = json.loads(proxy_info) return_str=proxy_info_json['protocol']+":"+proxy_info_json['ip']+proxy_info_json['port'] except Exception,e: print 'proxy have problem' #print proxy_info_json['protocol'] #print proxy_info_json['ip'] #print proxy_info_json['port'] return proxy_info_json #print proxy_info def get_proxys100(self): proxy_info_json = "" #first get the proxy info from print '-->using the ip'+self.proxy_ip+'to get the proxyinfo100' try: reqRequest_proxy = urllib2.Request('url1'); reqRequest_proxy.add_header('Accept','*/*'); reqRequest_proxy.add_header('Accept-Language','zh-CN,zh;q=0.8'); reqRequest_proxy.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36'); reqRequest_proxy.add_header('Content-Type','application/x-www-form-urlencoded'); proxy_info = urllib2.urlopen(reqRequest_proxy).read(); #print proxy_info proxy_info_json = json.loads(proxy_info) #for porxy_i in proxy_info_json: # print porxy_i #return_str=proxy_info_json['protocol']+":"+proxy_info_json['ip']+proxy_info_json['port'] return proxy_info_json except Exception,e: print 'proxy have problem' if __name__ == "__main__": #firs time get the proxy print 'START ADDING VIEW:' print 'Geting the new proxy info First time' print '---------------------------------------------------------------------------------------------------------' for count in range(1): test = Spide(proxy_ip='youproxyip',proxy_type='http',proxy_port='3128',use_proxy=False) proxy_list = test.get_proxy() print '->this is the :',count print '->Geting the new proxy info:' print '->using the proxy to get proxy list incase forbiden' print '->proxy info',proxy_list proxy100 = test.get_proxys100() for proxy1 in proxy100: try: print 'proxy1:',proxy1 Spide1=Spide(proxy_ip=proxy1['ip'],proxy_type=proxy1['type'],proxy_port=proxy1['port'],use_proxy=True) print 'before add view' Spide1.add_view() print '->sleep 15 s' time.sleep(15) #sleep random time to ranTime = random.randint(10,50) print '->sleep random time:',ranTime time.sleep(ranTime) print '-> getting new proxy ' #proxy_list = Spide1.get_proxy() except Exception,e: print '->something wrong ,hahah ,next' |
一点小的注释:
整个流程为: 1 获取代理 ->2 访问首页 —>3 获取首页博客列表,随机访问->4随机等待N秒 ->返回第1步
1:你需要更改youproxyip为你一个你已经拥有的代理ip,或者,不用填写,因为后边的use_proxy=False, 这个时候你确保你能够不适用代理访问到代码中的两个自动抓取代理ip地址的网站
2:/home/zhangzhiming/git/phantomjs/bin/phantomjs 这个路径是你安装的phantomjs路径
3:代码中有两个获取代理的方法,例子中选择了一个(不要喷我下边的循环为什么是一次还要循环,因为这个版本是原来是有外层循环的)
4: 获取免费代理地址我就不写了,需要的留言吧(url1 ,url2 为隐藏的获取免费代理的网站)
Latest posts by Zhiming Zhang (see all)
- aws eks node 自动化扩展工具 Karpenter - 8月 10, 2022
- ReplicationController and ReplicaSet in Kubernetes - 12月 20, 2021
- public key fingerprint - 5月 27, 2021
log4geek.cc 2017/03/25 10:25
Python技术交流群 18112174