600字范文,内容丰富有趣,生活中的好帮手!
600字范文 > python冰雪奇缘使用教程_python爬虫分析冰雪奇缘影评 并对关键字生成词云

python冰雪奇缘使用教程_python爬虫分析冰雪奇缘影评 并对关键字生成词云

时间:2021-03-07 10:06:04

相关推荐

python冰雪奇缘使用教程_python爬虫分析冰雪奇缘影评 并对关键字生成词云

import requests;

from lxml import etree

import time

url = "/subject/25887288/reviews?start=%d"

'''

使用ctrl+r键进行替换,使用(.*?): (.*)来匹配headers进行'$1': '$2',将其替换为字典替换

'''

# 请求头必须要为字典

headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',

# 'Accept-Encoding': 'gzip, deflate, br',

'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',

'Cache-Control': 'max-age=0',

'Connection': 'keep-alive',

'Cookie': 'bid=2kVXwdWZeMw; ll="108301"; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1577408430%2C%22https%3A%2F%%2Flink%3Furl%3DZ9ckN1KJW1JO8LetqcHNw_EpRXfP8M7thIHnPSuuwCunbidpPUid9bMYnngvA-dS%26wd%3D%26eqid%3Da4ee110000486cf8000000065e0557a7%22%5D; _pk_ses.100001.4cf6=*; ap_v=0,6.0; __utma=30149280.1534063425.1569670281.1569670281.1577408430.2; __utmz=30149280.1577408430.2.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmc=30149280; __utmc=223695111; __utma=223695111.528350327.1577408430.1577408430.1577408430.1; __utmz=223695111.1577408430.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmt=1; __gads=ID=0379800c7797856f:T=1577408438:S=ALNI_Man_1mimIVqMBoWKYD4NxKArBMQQQ; __yadk_uid=ddEOZgUO3cWlTAq7yge0eufiInrHkSje; __utmt=1; __utmb=30149280.1.10.1577408430; _vwo_uuid_v2=DD6DFBB7AA0D3A60734ECB8E8B5188216|fbc6c67dc4eac6e41dd6678dfe11a683; dbcl2="208536278:6OQfpDVXgQc"; push_noty_num=0; push_doumail_num=0; ck=8rj9; _pk_id.100001.4cf6=4cb7f01585a36431.1577408430.1.1577408979.1577408430.; __utmb=223695111.21.10.1577408430',

'Host': '',

'Sec-Fetch-Mode': 'navigate',

'Sec-Fetch-Site': 'none',

'Sec-Fetch-User': '?1',

'Upgrade-Insecure-Requests': '1',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36 Edg/79.0.309.56',

}

if __name__ == '__main__':

fp = open("./bxqy.csv", mode="w", encoding="utf-8")

fp.write("author\tcomment\treply\n")

for i in range(25):

url_movies = url%(20*i)

response = requests.get(url_movies,headers=headers)

response.encoding = 'utf-8'

text = response.text

# with open('./movies.html', 'w', encoding='utf-8') as fp:

# fp.write(text)

html = etree.HTML(text)

# xpath解析时使用div[@id=]的方式解析位置

commets = html.xpath('//div[@class="review-list "]/div[@data-cid]')

for commet in commets:

author = commet.xpath(".//a[@class='name']/text()")[0].strip() # 获取作者名字

content = commet.xpath(".//div[@class='short-content']/text()")[1][:-1].strip()

reply = commet.xpath(".//a[@class='reply ']/text()")[0][:-2].strip()

# print("%s | %s | %s" % (author,content,reply))

fp.write("%s\t%s\t%s\n" % (author,content,reply))

# fp.write("-----------------------------第%d页------------------------------\n"%(i+1))

print("-----------------------------第%d页的数据保存成功------------------------------\n"%(i+1))

time.sleep(1)

fp.close()

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。