python爬虫之爬取起点中文网小说
hello大家好,这篇文章带大家来制作一个python爬虫爬取阅文集团旗下产品起点中文网的程序,这篇文章的灵感来源于本人制作的一个项目:电脑助手启帆助手
⬆是项目的部分源码
准备工作
用到的库有:
urllib.requestlxml.etree
代码分析
第一步:导入要用到的库from urllib import requestfrom lxml import etree
2.第二步:设置请求头及小说网址(这里的网址以作者写的一本为例)
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}url="/info/1020546097"
3.第三步:爬取每个章节的链接、标题,并解析
req = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)Lit_tit_list = html.xpath('//ul[@class="cf"]/li/a/text()') #爬取每个章节名字Lit_href_list = html.xpath('//ul[@class="cf"]/li/a/@href') #每个章节链接# print(Lit_tit_list)# print(Lit_href_list)
4.第四步:抓取文章并用text保存至电脑
for tit,src in zip(Lit_tit_list,Lit_href_list):url = "http:" + srcreq = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)text_list = html.xpath('//div[@class="read-content j_readContent"]/p/text()')text = "\n".join(text_list)file_name = tit + ".txt"print("正在抓取文章:" + file_name)with open(file_name, 'a', encoding="utf-8") as f:f.write("\t" + tit + '\n' + text)
完整代码
from urllib import requestfrom lxml import etreeheader = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}url="/info/1020546097"req = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)Lit_tit_list = html.xpath('//ul[@class="cf"]/li/a/text()') #爬取每个章节名字Lit_href_list = html.xpath('//ul[@class="cf"]/li/a/@href') #每个章节链接# print(Lit_tit_list)# print(Lit_href_list)for tit,src in zip(Lit_tit_list,Lit_href_list):url = "http:" + srcreq = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)text_list = html.xpath('//div[@class="read-content j_readContent"]/p/text()')text = "\n".join(text_list)file_name = tit + ".txt"print("正在抓取文章:" + file_name)with open(file_name, 'a', encoding="utf-8") as f:f.write("\t" + tit + '\n' + text)
效果展示
以下就是爬取的txt文件啦:
好啦,这篇文章就到这里吧,白······