600字范文 > python爬虫之爬取起点中文网小说

python爬虫之爬取起点中文网小说

时间：2021-06-11 04:31:58

相关推荐

python爬虫之爬取起点中文网小说

hello大家好，这篇文章带大家来制作一个python爬虫爬取阅文集团旗下产品起点中文网的程序，这篇文章的灵感来源于本人制作的一个项目：~~电脑助手~~启帆助手

⬆是项目的部分源码

准备工作

用到的库有：

urllib.requestlxml.etree

代码分析

第一步：导入要用到的库

from urllib import requestfrom lxml import etree

2.第二步:设置请求头及小说网址(这里的网址以作者写的一本为例)

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}url="/info/1020546097"

3.第三步：爬取每个章节的链接、标题，并解析

req = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)Lit_tit_list = html.xpath('//ul[@class="cf"]/li/a/text()') #爬取每个章节名字Lit_href_list = html.xpath('//ul[@class="cf"]/li/a/@href') #每个章节链接# print(Lit_tit_list)# print(Lit_href_list)

4.第四步:抓取文章并用text保存至电脑

for tit,src in zip(Lit_tit_list,Lit_href_list):url = "http:" + srcreq = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)text_list = html.xpath('//div[@class="read-content j_readContent"]/p/text()')text = "\n".join(text_list)file_name = tit + ".txt"print("正在抓取文章：" + file_name)with open(file_name, 'a', encoding="utf-8") as f:f.write("\t" + tit + '\n' + text)

完整代码

from urllib import requestfrom lxml import etreeheader = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}url="/info/1020546097"req = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)Lit_tit_list = html.xpath('//ul[@class="cf"]/li/a/text()') #爬取每个章节名字Lit_href_list = html.xpath('//ul[@class="cf"]/li/a/@href') #每个章节链接# print(Lit_tit_list)# print(Lit_href_list)for tit,src in zip(Lit_tit_list,Lit_href_list):url = "http:" + srcreq = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)text_list = html.xpath('//div[@class="read-content j_readContent"]/p/text()')text = "\n".join(text_list)file_name = tit + ".txt"print("正在抓取文章：" + file_name)with open(file_name, 'a', encoding="utf-8") as f:f.write("\t" + tit + '\n' + text)

效果展示

以下就是爬取的txt文件啦：

好啦，这篇文章就到这里吧，白······

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。