600字范文 > python写的多线程代理服务器抓取保存验证程序

python写的多线程代理服务器抓取保存验证程序

时间：2022-03-07 06:49:50

于是决定用python重新写,python支持多线程啊。

已经有一年多没有用过 python了，很多语法，语言特性都快忘记得差不多了。经过三天业余时间的

摸索，今天我写的这个程序终于可以和大家交流了。

下面放出源代码: 希望有高手能帮我共同完善,

这个程序是我学python语言以来写的第二个程序，应该有很多写得不够简洁的地方，希望行家多多指点

程序现有功能:

1. 能自动从12个网站抓取代理列表，并保存到数据库里面

2. 自动验证每个代理是否可用,并保存验证时的响应时间做为判断代理速度的依据

3. 能分类输出代理信息，已验证的，未验证的，高度匿名代理，普通匿名代理，透明代理到不同文件

4支持的输出格式有 xml,htm,csv,txt,tab 每种文件都能自定义字段和格式

5. 扩展性比较强, 要添加一个新的抓取网站只需要改变一个全局变量，添加两个函数 (有详细接口说明)

6. 用 sqlite 做数据库，小巧，方便，简单，0配置，0安装，放在屁股口袋里就可以带走

7. 多线程抓取，多线程验证

我的运行环境：windows xp + python v2.4 ,其他版本未测试

程序下载:点击这里(242kb)

代码的注释非常详细,python 初学者都可以看懂， 12个网站抓取分析的正则表达式都有详细注释

# -*- coding: gb2312 -*-# vi:ts=4:et"""目前程序能从下列网站抓取代理列表/http://www.pass-///http://www.my-/http://www.samair.ru/proxy//http://proxylist.sakura.ne.jp/////问:怎样才能添加自己的新网站，并自动让程序去抓取?答:请注意源代码中以下函数的定义.从函数名的最后一个数字从1开始递增，目前已经到了13 def build_list_urls_1(page=5):def parse_page_2(html=''):def build_list_urls_2(page=5):def parse_page_2(html=''):.......def build_list_urls_13(page=5):def parse_page_13(html=''):你要做的就是添加 build_list_urls_14 和 parse_page_14 这两个函数比如你要从抓取 /somepath/showlist.asp?page=1... 到/somepath/showlist.asp?page=8 假设共8页那么 build_list_urls_14 就应该这样定义要定义这个page这个参数的默认值为你要抓取的页面数8，这样才能正确到抓到8个页面def build_list_urls_14(page=8): ..... return [ #返回的是一个一维数组，数组每个元素都是你要抓取的页面的绝对地址'/somepath/showlist.asp?page=1','/somepath/showlist.asp?page=2','/somepath/showlist.asp?page=3',....'/somepath/showlist.asp?page=8']接下来再写一个函数 parse_page_14(html='')用来分析上面那个函数返回的那些页面html的内容并从html中提取代理地址注意：这个函数会循环处理 parse_page_14 中的所有页面，传入的html就是那些页面的html文本ip: 必须为 xxx.xxx.xxx.xxx 数字ip格式，不能为格式port: 必须为 2-5位的数字type: 必须为数字 2,1,0,-1 中的其中一个。这些数字代表代理服务器的类型2:高度匿名代理 1: 普通匿名代理 0:透明代理 -1: 无法确定的代理类型#area: 代理所在国家或者地区，必须转化为 utf8编码格式 def parse_page_14(html=''):....return [[ip,port,type,area] [ip,port,type,area] ..... .... [ip,port,type,area] ]最后，最重要的一点:修改全局变量 web_site_count的值，让他加递增1 web_site_count=14问：我已经按照上面的说明成功的添加了一个自定义站点，我要再添加一个，怎么办?答：既然已经知道怎么添加 build_list_urls_14 和 parse_page_14了那么就按照同样的办法添加def build_list_urls_15(page=5):def parse_page_15(html=''):这两个函数，并更新全局变量 web_site_count=15"""import urllib,time,random,re,threading,stringweb_site_count=13 #要抓取的网站数目day_keep=2#删除数据库中保存时间大于day_keep天的无效代理indebug=1 thread_num=100 # 开 thread_num 个线程检查代理check_in_one_call=thread_num*10 # 本次程序运行时最多检查的代理个数skip_check_in_hour=1 # 在时间 skip_check_in_hour内,不对同一个代理地址再次验证skip_get_in_hour=8# 每次采集新代理的最少时间间隔 (小时)proxy_array=[]# 这个数组保存将要添加到数据库的代理列表 update_array=[] # 这个数组保存将要更新的代理的数据 db=None #数据库全局对象conn=Nonedbfile='proxier.db'#数据库文件名target_url="/" # 验证代理的时候通过代理访问这个地址target_string="030173"# 如果返回的html中包含这个字符串，target_timeout=30# 并且响应时间小于 target_timeout 秒 #那么我们就认为这个代理是有效的 #到处代理数据的文件格式，如果不想导出数据，请让这个变量为空 output_type=''output_type='xml' #以下格式可选, 默认xml# xml# htm # tab 制表符分隔, 兼容 excel# csv 逗号分隔, 兼容 excel# txt xxx.xxx.xxx.xxx:xx 格式# 输出文件名请保证这个数组含有六个元素output_filename=['uncheck', # 对于未检查的代理,保存到这个文件'checkfail', # 已经检查，但是被标记为无效的代理,保存到这个文件'ok_high_anon', # 高匿代理(且有效)的代理,按speed排序，最块的放前面'ok_anonymous', # 普通匿名(且有效)的代理,按speed排序，最块的放前面'ok_transparent',# 透明代理(且有效)的代理,按speed排序，最块的放前面'ok_other' # 其他未知类型(且有效)的代理,按speed排序]#输出数据的格式支持的数据列有 # _ip_ , _port_ , _type_ , _status_ , _active_ ,#_time_added_, _time_checked_ ,_time_used_ , _speed_, _area_output_head_string='' # 输出文件的头部字符串output_format='' # 文件数据的格式 output_foot_string='' # 输出文件的底部字符串if output_type=='xml':output_head_string="<?xml version='1.0' encoding='gb2312'?><proxylist>\n" output_format="""<item><ip>_ip_</ip><port>_port_</port><speed>_speed_</speed><last_check>_time_checked_</last_check><area>_area_</area></item>"""output_foot_string="</proxylist>"elif output_type=='htm':output_head_string="""<table border=1 width='100%'><tr><td>代理</td><td>最后检查</td><td>速度</td><td>地区</td></tr>"""output_format="""<tr><td>_ip_:_port_</td><td>_time_checked_</td><td>_speed_</td><td>_area_</td></tr>"""output_foot_string="</table>"else: output_head_string=''output_foot_string=''if output_type=="csv":output_format="_ip_, _port_, _type_, _speed_, _time_checked_, _area_\n"if output_type=="tab":output_format="_ip_\t_port_\t_speed_\t_time_checked_\t_area_\n"if output_type=="txt":output_format="_ip_:_port_\n"# 输出文件的函数def output_file():global output_filename,output_head_string,output_foot_string,output_typeif output_type=='':returnfnum=len(output_filename)content=[]for i in range(fnum):content.append([output_head_string])conn.execute("select * from `proxier` order by `active`,`type`,`speed` asc")rs=conn.fetchall()for item in rs:type,active=item[2],item[4]if active is None:content[0].append(formatline(item)) #未检查elif active==0:content[1].append(formatline(item)) #非法的代理elif active==1 and type==2:content[2].append(formatline(item)) #高匿 elif active==1 and type==1:content[3].append(formatline(item)) #普通匿名 elif active==1 and type==0:content[4].append(formatline(item)) #透明代理 elif active==1 and type==-1:content[5].append(formatline(item)) #未知类型的代理else:passfor i in range(fnum):content[i].append(output_foot_string)f=open(output_filename[i]+"."+output_type,'w')f.write(string.join(content[i],''))f.close()#格式化输出每条记录def formatline(item):global output_formatarr=['_ip_','_port_','_type_','_status_','_active_','_time_added_','_time_checked_','_time_used_','_speed_','_area_']s=output_formatfor i in range(len(arr)):s=string.replace(s,arr[i],str(formatitem(item[i],i)))return s #对于数据库中的每个不同字段，要处理一下，中文要编码，日期字段要转化def formatitem(value,colnum):global output_typeif (colnum==9):value=value.encode('cp936')elif value is None:value=''if colnum==5 or colnum==6 or colnum==7:#time_xxxedvalue=string.atof(value)if value<1:value=''else:value=formattime(value)if value=='' and output_type=='htm':value=''return valuedef check_one_proxy(ip,port):global update_arrayglobal check_in_one_callglobal target_url,target_string,target_timeouturl=target_urlcheckstr=target_stringtimeout=target_timeoutip=string.strip(ip)proxy=ip+':'+str(port)proxies = {'http': 'http://'+proxy+'/'}opener = urllib.FancyURLopener(proxies)opener.addheaders = [('User-agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)')]t1=time.time()if (url.find("?")==-1):url=url+'?rnd='+str(random.random())else:url=url+'&rnd='+str(random.random())try:f = opener.open(url)s= f.read()pos=s.find(checkstr)except:pos=-1passt2=time.time()timeused=t2-t1if (timeused<timeout and pos>0):active=1else:active=0 update_array.append([ip,port,active,timeused])print len(update_array),' of ',check_in_one_call," ",ip,':',port,'--',int(timeused) def get_html(url=''):opener = urllib.FancyURLopener({})#不使用代理#www.my- 需要下面这个Cookie才能正常抓取opener.addheaders = [('User-agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)'),('Cookie','permission=1')]t=time.time()if (url.find("?")==-1):url=url+'?rnd='+str(random.random())else:url=url+'&rnd='+str(random.random())try:f = opener.open(url)return f.read()except:return ''################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_1(page=5):page=page+1ret=[]for i in range(1,page):ret.append('/page%(num)01d.html'%{'num':i})return retdef parse_page_1(html=''):matches=re.findall(r'''<td>([\d\.]+)<\/td>[\s\n\r]* #ip<td>([\d]+)<\/td>[\s\n\r]*#port<td>([^\<]*)<\/td>[\s\n\r]* #type <td>([^\<]*)<\/td> #area ''',html,re.VERBOSE)ret=[]for match in matches:ip=match[0]port=match[1]type=match[2]area=match[3]if (type=='anonymous'):type=1elif (type=='high anonymity'):type=2elif (type=='transparent'):type=0else:type=-1ret.append([ip,port,type,area])if indebug:print '1',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_2(page=1):return ['/ProxyList/fresh-proxy-list.shtml']def parse_page_2(html=''):matches=re.findall(r'''((?:[\d]{1,3}\.){3}[\d]{1,3})\:([\d]+)#ip:port\s+(Anonymous|Elite Proxy)[+\s]+ #type(.+)\r?\n#area''',html,re.VERBOSE)ret=[]for match in matches:ip=match[0]port=match[1]type=match[2]area=match[3]if (type=='Anonymous'):type=1else:type=2ret.append([ip,port,type,area])if indebug:print '2',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_3(page=15):page=page+1ret=[]for i in range(1,page):ret.append('http://www.samair.ru/proxy/proxy-%(num)02d.htm'%{'num':i})return retdef parse_page_3(html=''):matches=re.findall(r'''<tr><td><span\sclass\="\w+">(\d{1,3})<\/span>\. #ip(part1)<span\sclass\="\w+"> (\d{1,3})<\/span> #ip(part2)(\.\d{1,3}\.\d{1,3}) #ip(part3,part4)\:\r?\n(\d{2,5})<\/td>#port<td>([^<]+)</td> #type<td>[^<]+<\/td> <td>([^<]+)<\/td> #area<\/tr>''',html,re.VERBOSE)ret=[]for match in matches:ip=match[0]+"."+match[1]+match[2]port=match[3]type=match[4]area=match[5]if (type=='anonymous proxy server'):type=1elif (type=='high-anonymous proxy server'):type=2elif (type=='transparent proxy'):type=0else:type=-1ret.append([ip,port,type,area])if indebug:print '3',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_4(page=3):page=page+1ret=[]for i in range(1,page):ret.append('http://www.pass-/proxy/index.php?page=%(n)01d'%{'n':i})return retdef parse_page_4(html=''):matches=re.findall(r"""list\('(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' #ip\,'(\d{2,5})'#port\,'(\d)'#type\,'([^']+)'\)#area\;\r?\n""",html,re.VERBOSE)ret=[]for match in matches:ip=match[0]port=match[1]type=match[2]area=match[3]if (type=='1'):#type的判断可以查看抓回来的网页的javascript部分type=1elif (type=='3'):type=2elif (type=='2'):type=0else:type=-1if indebug:print '4',ip,port,type,area area=unicode(area, 'cp936') area=area.encode('utf8') ret.append([ip,port,type,area])return ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_5(page=12):page=page+1ret=[]for i in range(1,page):ret.append('/index2.asp?page=%(num)01d'%{'num':i})return retdef parse_page_5(html=''):matches=re.findall(r"<font color=black>([^<]*)</font>",html)ret=[]for index, match in enumerate(matches):if (index%3==0):ip=matches[index+1]port=matches[index+2]type=-1#该网站未提供代理服务器类型if indebug:print '5',ip,port,type,match area=unicode(match, 'cp936') area=area.encode('utf8') ret.append([ip,port,type,area])else:continuereturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_6(page=3):page=page+1ret=[]for i in range(1,page):ret.append('/proxy%(num)01d.html'%{'num':i})return retdef parse_page_6(html=''):matches=re.findall(r'''<tr><td>([^&]+) #ip‌‍\:([^<]+) #port</td><td>HTTP</td><td>[^<]+</td><td>([^<]+)</td>#area</tr>''',html,re.VERBOSE)ret=[]for match in matches:ip=match[0]port=match[1]type=-1#该网站未提供代理服务器类型area=match[2]if indebug:print '6',ip,port,type,areaarea=unicode(area, 'cp936') area=area.encode('utf8') ret.append([ip,port,type,area])return ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_7(page=1):return ['/http_highanon.txt']def parse_page_7(html=''):matches=re.findall(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\:(\d{2,5})',html)ret=[]for match in matches:ip=match[0]port=match[1]type=2 area='--'ret.append([ip,port,type,area])if indebug:print '7',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_8(page=1):return ['/http.txt']def parse_page_8(html=''):matches=re.findall(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\:(\d{2,5})',html)ret=[]for match in matches:ip=match[0]port=match[1]type=-1 area='--'ret.append([ip,port,type,area])if indebug:print '8',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_9(page=6):page=page+1ret=[]for i in range(0,page):ret.append('http://proxylist.sakura.ne.jp/index.htm?pages=%(n)01d'%{'n':i})return retdef parse_page_9(html=''):matches=re.findall(r'''(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) #ip\:(\d{2,5}) #port<\/TD>[\s\r\n]*<TD>([^<]+)</TD> #area[\s\r\n]*<TD>([^<]+)</TD> #type''',html,re.VERBOSE)ret=[]for match in matches:ip=match[0]port=match[1]type=match[3] area=match[2]if (type=='Anonymous'):type=1else:type=-1ret.append([ip,port,type,area])if indebug:print '9',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_10(page=5):page=page+1ret=[]for i in range(1,page):ret.append('/page%(n)01d.html'%{'n':i})return retdef parse_page_10(html=''):matches=re.findall(r'''(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) #ip<\/td>[\s\r\n]*<td[^>]+>(\d{2,5})<\/td>#port[\s\r\n]*<td>([^<]+)<\/td> #type[\s\r\n]*<td>([^<]+)<\/td> #area''',html,re.VERBOSE)ret=[]for match in matches:ip=match[0]port=match[1]type=match[2] area=match[3]if (type=='high anonymity'):type=2elif (type=='anonymous'):type=1elif (type=='transparent'):type=0else:type=-1ret.append([ip,port,type,area])if indebug:print '10',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_11(page=10):page=page+1ret=[]for i in range(1,page):ret.append('http://www.my-/list/proxy.php?list=%(n)01d'%{'n':i})ret.append('http://www.my-/list/proxy.php?list=s1')ret.append('http://www.my-/list/proxy.php?list=s2')ret.append('http://www.my-/list/proxy.php?list=s3') return retdef parse_page_11(html=''):matches=re.findall(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\:(\d{2,5})',html)ret=[] if (html.find('(Level 1)')>0):type=2elif (html.find('(Level 2)')>0):type=1elif (html.find('(Level 3)')>0):type=0else:type=-1for match in matches:ip=match[0]port=match[1]area='--' ret.append([ip,port,type,area])if indebug:print '11',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_12(page=4):ret=[]ret.append('/plr4.html')ret.append('/pla4.html')ret.append('/pld4.html')ret.append('/pls4.html')return retdef parse_page_12(html=''):matches=re.findall(r'''onMouseOver\="s\(\'(\w\w)\'\)" #area\sonMouseOut\="d\(\)"\s?c?l?a?s?s?\=?"?(\w?) #type "?>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) #ip\:(\d{2,5}) #port''',html,re.VERBOSE)ret=[] for match in matches:ip=match[2]port=match[3]area=match[0]type=match[1]if (type=='A'):type=2elif (type=='B'):type=1else:type=0ret.append([ip,port,type,area])if indebug:print '12',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /#################################################################################def build_list_urls_13(page=3):url='/'html=get_html(url) matchs=re.findall(r"""href\='([^']+)'>(?:high_anonymous|anonymous|transparent)\sproxy\slist<\/a>""",html,re.VERBOSE) return map(lambda x: url+x, matchs)def parse_page_13(html=''):html_matches=re.findall(r"eval\(unescape\('([^']+)'\)",html)if (len(html_matches)>0):conent=urllib.unquote(html_matches[0])matches=re.findall(r"""<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})<\/td><td>(\d{2,5})<\/td><\/tr>""",conent,re.VERBOSE) ret=[]if (html.find('<title>Checked Proxy Lists - proxylist_high_anonymous_')>0):type=2elif (html.find('<title>Checked Proxy Lists - proxylist_anonymous_')>0): type=1elif (html.find('<title>Checked Proxy Lists - proxylist_transparent_')>0):type=0else:type=-1for match in matches:ip=match[0]port=match[1]area='--'ret.append([ip,port,type,area])if indebug:print '13',ip,port,type,areareturn ret################################################################################### by Go_Rush(阿舜) from /##################################################################################线程类class TEST(threading.Thread):def __init__(self,action,index=None,checklist=None):threading.Thread.__init__(self)self.index =indexself.action=actionself.checklist=checklistdef run(self):if (self.action=='getproxy'):get_proxy_one_website(self.index)else:check_proxy(self.index,self.checklist)def check_proxy(index,checklist=[]):for item in checklist:check_one_proxy(item[0],item[1])def patch_check_proxy(threadCount,action=''):global check_in_one_call,skip_check_in_hour,connthreads=[]if (action=='checknew'): #检查所有新加入，并且从未被检查过的orderby=' `time_added` desc 'strwhere=' `active` is null 'elif (action=='checkok'): #再次检查以前已经验证成功的代理orderby=' `time_checked` asc 'strwhere=' `active`=1 'elif (action=='checkfail'): #再次检查以前验证失败的代理orderby=' `time_checked` asc 'strwhere=' `active`=0 ' else: #检查所有的 orderby=' `time_checked` asc 'strwhere=' 1=1 ' sql="""select `ip`,`port` FROM `proxier` where`time_checked` < (unix_timestamp()-%(skip_time)01s) and %(strwhere)01s order by %(order)01s limit %(num)01d"""%{'num':check_in_one_call,'strwhere':strwhere,'order':orderby,'skip_time':skip_check_in_hour*3600}conn.execute(sql)rows = conn.fetchall() check_in_one_call=len(rows)#计算每个线程将要检查的代理个数if len(rows)>=threadCount:num_in_one_thread=len(rows)/threadCount else:num_in_one_thread=1threadCount=threadCount+1print "现在开始验证以下代理服务器....."for index in range(1,threadCount): #分配每个线程要检查的checklist,并把那些剩余任务留给最后一个线程checklist=rows[(index-1)*num_in_one_thread:index*num_in_one_thread]if (index+1==threadCount): checklist=rows[(index-1)*num_in_one_thread:]t=TEST(action,index,checklist)t.setDaemon(True)t.start()threads.append((t))for thread in threads:thread.join(60) update_proxies() #把所有的检查结果更新到数据库def get_proxy_one_website(index):global proxy_arrayfunc='build_list_urls_'+str(index)parse_func=eval('parse_page_'+str(index))urls=eval(func+'()')for url in urls:html=get_html(url)print urlproxylist=parse_func(html)for proxy in proxylist:ip=string.strip(proxy[0])port=string.strip(proxy[1])if (pile("^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}{1}quot;).search(ip)):type=str(proxy[2])area=string.strip(proxy[3])proxy_array.append([ip,port,type,area])def get_all_proxies():global web_site_count,conn,skip_get_in_hour#检查最近添加代理是什么时候，避免短时间内多次抓取rs=conn.execute("select max(`time_added`) from `proxier` limit 1")last_add=rs.fetchone()[0]if (last_add and my_unix_timestamp()-last_add<skip_get_in_hour*3600): print """放弃抓取代理列表!因为最近一次抓取代理的时间是: %(t)1s这个时间距离现在的时间小于抓取代理的最小时间间隔: %(n)1d 小时如果一定要现在抓取代理，请修改全局变量: skip_get_in_hour 的值"""%{'t':formattime(last_add),'n':skip_get_in_hour}returnprint "现在开始从以下"+str(web_site_count)+"个网站抓取代理列表...."threads=[]count=web_site_count+1for index in range(1,count):t=TEST('getproxy',index)t.setDaemon(True)t.start()threads.append((t))for thread in threads:thread.join(60) add_proxies_to_db()def add_proxies_to_db():global proxy_arraycount=len(proxy_array)for i in range(count):item=proxy_array[i]sql="""insert into `proxier` (`ip`,`port`,`type`,`time_added`,`area`) values('"""+item[0]+"',"+item[1]+","+item[2]+",unix_timestamp(),'"+clean_string(item[3])+"')" try:conn.execute(sql)print "%(num)2.1f\%\t"%{'num':100*(i+1)/count},item[0],":",item[1]except:pass def update_proxies():global update_arrayfor item in update_array:sql='''update `proxier` set `time_checked`=unix_timestamp(), `active`=%(active)01d, `speed`=%(speed)02.3f where `ip`='%(ip)01s' and `port`=%(port)01d '''%{'active':item[2],'speed':item[3],'ip':item[0],'port':item[1]}try:conn.execute(sql) except:pass #sqlite 不支持 unix_timestamp这个函数,所以我们要自己实现def my_unix_timestamp():return int(time.time())def clean_string(s):tmp=re.sub(r"['\,\s\\\/]", ' ', s)return re.sub(r"\s+", ' ', tmp)def formattime(t):return time.strftime('%c',time.gmtime(t+8*3600))def open_database():global db,conn,day_keep,dbfile try:from pysqlite2 import dbapi2 as sqliteexcept:print """本程序使用 sqlite 做数据库来保存数据，运行本程序需要 pysqlite的支持python 访问 sqlite 需要到下面地址下载这个模块 pysqlite, 272kb/tracker/pysqlite/wiki/pysqlite#Downloads下载(Windows binaries for Python 2.x)"""raise SystemExittry:db = sqlite.connect(dbfile,isolation_level=None) db.create_function("unix_timestamp", 0, my_unix_timestamp) conn = db.cursor()except:print "操作sqlite数据库失败，请确保脚本所在目录具有写权限"raise SystemExitsql="""/* ip:只要纯ip地址(xxx.xxx.xxx.xxx)的代理 *//* type: 代理类型 2:高匿 1:普匿 0:透明 -1: 未知 *//* status: 这个字段本程序还没有用到，留在这里作以后扩展*/ /* active: 代理是否可用 1:可用 0:不可用 */ /* speed: 请求相应时间，speed越小说明速度越快 */ CREATE TABLE IF NOT EXISTS `proxier` (`ip` varchar(15) NOT NULL default '', `port` int(6) NOT NULL default '0',`type` int(11) NOT NULL default '-1', `status` int(11) default '0', `active` int(11) default NULL, `time_added` int(11) NOT NULL default '0', `time_checked` int(11) default '0',`time_used` int(11) default '0', `speed` float default NULL, `area` varchar(120) default '--',/* 代理服务器所在位置 */PRIMARY KEY (`ip`) );/*CREATE INDEX IF NOT EXISTS `type` ON proxier(`type`);CREATE INDEX IF NOT EXISTS `time_used` ON proxier(`time_used`);CREATE INDEX IF NOT EXISTS `speed` ON proxier(`speed`);CREATE INDEX IF NOT EXISTS `active`ON proxier(`active`);*/PRAGMA encoding = "utf-8";/* 数据库用 utf-8编码保存 */"""conn.executescript(sql)conn.execute("""DELETE FROM `proxier`where `time_added`< (unix_timestamp()-?) and `active`=0""",(day_keep*86400,))conn.execute("select count(`ip`) from `proxier`")m1=conn.fetchone()[0]if m1 is None:returnconn.execute("""select count(`time_checked`) from `proxier` where `time_checked`>0""")m2=conn.fetchone()[0]if m2==0:m3,m4,m5=0,"尚未检查","尚未检查"else:conn.execute("select count(`active`) from `proxier` where `active`=1")m3=conn.fetchone()[0]conn.execute("""select max(`time_checked`), min(`time_checked`) from `proxier` where `time_checked`>0 limit 1""")rs=conn.fetchone()m4,m5=rs[0],rs[1]m4=formattime(m4)m5=formattime(m5)print """共%(m1)1d条代理，其中%(m2)1d个代理被验证过，%(m3)1d个代理验证有效。最近一次检查时间是：%(m4)1s最远一次检查时间是: %(m5)1s提示：对于检查时间超过24小时的代理，应该重新检查其有效性"""%{'m1':m1,'m2':m2,'m3':m3,'m4':m4,'m5':m5}def close_database():global db,connconn.close()db.close()conn=Nonedb=Noneif __name__ == '__main__':open_database()get_all_proxies()patch_check_proxy(thread_num)output_file() close_database()print "所有工作已经完成"

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。

python写的多线程 代理服务器 抓取 保存 验证程序

python写的多线程代理服务器抓取保存验证程序