600字范文 > python3爬虫实战一：爬取豆瓣最新上映电影及画出词云分布

python3爬虫实战一：爬取豆瓣最新上映电影及画出词云分布

时间：2021-02-16 17:15:35

参考：/88325/

任务：

1. 豆瓣电影主页抓取最新上映的全部电影id号与电影名

2. 进入每部电影具体详情页面提取首页热门短评

3. 对每部电影短评进行词云分布画图

python 版本 3.5

准备工作:

1.第三方库：

requests，jieba，wordcloud，pandas，matplotlib，BeautifulSoup，numpy，re

wordcloud的pip安装会出错，下载whl文件再安装下载地址: https://www.lfd.uci.edu/~gohlke/pythonlibs/

2.其他文件

1）simhei.tff 字体文件：百度搜索下载即可

2）stopwords.txt 停用词文件百度搜索下载即可

一、豆瓣电影主页抓取最新上映的全部电影id号与电影名

第一步：对网页进行访问，获取html网页

import requests

root_url="/cinema/nowplaying/nanjing/"headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'}response = requests.get(root_url,headers = headers)html = response.text

以上即可获取网页的html网页、也可以通过直接访问:/cinema/nowplaying/nanjing/ 检查来查看html网页结构，如下图所示：

第二步对html网页进行分析，解析我们所需要的信息

我们需要获取所有正在热映的电影，由以上html图片可知，全部电影信息均存储在div id='nowplaying'的内部，而每部电影都是在li class='list-item'标签内所以我们可以通过BeautifulSoup来提取信息。

soup = BeautifulSoup(html,'lxml')nowplaying_movie = soup.find_all('div',id = 'nowplaying')nowplaying_movie_list = nowplaying_movie[0].find_all('li',class_ = 'list-item')

因为我们需要提取每部电影的id号和电影名，我们则需要通过分析html网页找到每部电影内部的id号和电影名

我们可以发现直接可以提取其每部电影的id号，而电影名存储在li class='stitle' 的a内部 .所以我们可以创建一个空白列表去循环存储所有电影的id和电影名构建的所有字典组。

nowplaying_list = [] for item in nowplaying_movie_list:nowplaying_dict = {}nowplaying_dict['id'] = item ['id']for tag_img_item in item.find_all('li',class_ = 'stitle'):nowplaying_dict['name'] = tag_img_item.a.text.strip()nowplaying_list.append(nowplaying_dict)

我们可以通过print(nowplaying_list)查看获取的一个列表，其包含正在热映的所有电影id、电影名构建的字典组。结果如下图所示：

二、进入每部电影具体详情页面提取首页热门短评

第一步获取每部电影网页地址

试着打开第一部电影（这里是：头号玩家），发现其url均为/subject/4920389/?from=playing_poster，打开其他电影的具体网址，也是相同结构，只是id号不一样，所以我们只需要更换到相应电影的id号即可得到其网页地址。id号可以通过循环以上所有电影的列表获取每部的id号。这里我们以第一部电影为例。

首先获取第一部电影（头号玩家）html网页

first_movie = nowplaying_list[0]comment_url = "/subject/"+ first_movie['id']+"/?from=playing_poster"resp = requests.get(comment_url,headers = headers)first_movie_html = resp.text

第二步解析网页获取所有评论

查看评z论页面的html结构，如下图所示

只加载了可以显示的热门短评，由于评论过多，我们只选取了首页的短评内容（完整代码中包含提取前十页的短评内容）

发现评论均存在于div class=comment 下部的p内解析代码如下示：

soup1 = BeautifulSoup(first_movie_html,'lxml')comment_div_list = soup1.find_all('div',class_='comment')first_movie_comment_list = []for item in comment_div_list:if item.find_all('p')[0].string is not None:first_movie_comment_list.append(item.find_all('p')[0].string)

通过print(first_movie_comment_list)我们即可得到关于第一部电影的所有评论信息如下图所示：

第三步对数据进行清洗

我们发现评论是一个列表形式存储着各用户的短评，而且中间还夹杂着各种符号，这对我们毫无意义，我们需要去除无用符号，获取干净的中文字符，其中需要用到正则表达式提取中文字符,并且将评论词语全部连接成一个字符串，以便于后期的分词操作，清洗代码如下

import re

comments = '' # 将所有评论化成一条字符串for k in range(len(first_movie_comment_list)):comments = comments + (str(first_movie_comment_list[k])).strip()pattern = pile(r'[\u4e00-\u9fa5]+') # 使用正则表达式去除标点符号filterdata = re.findall(pattern,comments)cleaned_comments = ''.join(filterdata) # 此时获取了每部电影整洁的评论词语字符串

通过print(cleaned_comments)可以获取整洁的字符串评论信息如下所示

以上即获取了干净无其他字符的整洁的评论词

三、对每部电影短评进行词云分布画图

第一步对以上获取的整洁的评论字符串进行分词处理（需要使用jieba分词库）

import jiebaimport pandas as pdsegment = jieba.lcut(cleaned_comments)words_df = pd.DataFrame({'segment':segment}) # 获取每部电影分词表格

上述也用了pandas库用于显示图表形式，可以使用print(word_df.head()) 获取前5条的信息如下所示

第二步去除短评中的停用词

由于评论中含有太多高频却无意义的词语，需要对其进行去除。停用词文件stopwords.txt百度下载至代码文件同级目录下

去除停用词代码如下

stopwords = pd.read_csv("stopwords.txt", index_col=False, quoting=3, sep="\t", names=['stopword'],encoding='gbk')words_df = words_df[~words_df.segment.isin(stopwords.stopword)]

第三步对短评内的词语进行计数显示，便于后期词云分布按照词频显示

import numpy

words_stat = words_df.groupby(by=['segment'])['segment'].agg({'计数':numpy.size})words_stat = words_stat.reset_index().sort_values(by=['计数'],ascending=False)

我们可以通过print(words_stat.head())查看此刻该电影短评的词频计数的前五条结果如下所示

第四步用词云进行显示

这里需要用到字体文件 simhei.ttf文件【百度下载至程序根目录下】，词云图片名即为电影名,代码如下

import matplotlib.pyplot as pltfrom wordcloud import WordCloudimport matplotlibmatplotlib.rcParams['figure.figsize'] = (10.0, 5.0)wordcloud = WordCloud(font_path="simhei.ttf", background_color="white", max_font_size=80)word_frequence = {x[0]: x[1] for x in words_stat.head(1000).values}word_frequence_list = []for key in word_frequence:temp = (key, word_frequence[key])word_frequence_list.append(temp)# 保存词云图片try:print('正在生成'+first_movie['name']+'词云分布...')wordcloud = wordcloud.fit_words(dict(word_frequence_list))plt.imshow(wordcloud)plt.savefig(first_movie['name'].strip())except:print('该电影无评论,没有词云分布...')

最后显示结果如下图所示

到此，该项目已经完成，这只是一个练习爬虫的一个小项目，适合初学者尝试，之前参考最先链接的博客自己动手重新尝试了下，这对于自己也算是一种锻炼。只有自己多去尝试练习，而非仅仅只是看看才会学习的更多。这也是第一次自己写博客，写的不是很好，有错误欢迎指正，一起努力进步。

完整代码如下

#!/usr/bin/env python3# -*- coding: UTF-8 -*-#########################"""抓取豆瓣最近上映的所有电影id和电影名，并对每部具体电影的评论进行分词进行词云分布requests + BeautifulSoup + jieba + 词云分布"""#########################importwarnings,timewarnings.filterwarnings('ignore')fromwordcloudimportWordCloudimportrequests,re,jieba,numpyimportpandasaspdimportmatplotlib.pyplotaspltimportmatplotlibmatplotlib.rcParams['figure.figsize'] = (10.0,5.0)frombs4importBeautifulSoupheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'}# 获取所有正在热映的电影 id号和电影名defget_all_nowplaying_movie(url):response = requests.get(url,headers = headers)html = response.textsoup = BeautifulSoup(html,'lxml')nowplaying_movie = soup.find_all('div',id ='nowplaying')nowplaying_movie_list = nowplaying_movie[0].find_all('li',class_ ='list-item')nowplaying_list = [] # 存储所有正在热映的电影的id号和电影名【{id name},{id name}...】foriteminnowplaying_movie_list:nowplaying_dict = {}nowplaying_dict['id'] = item ['id']fortag_img_iteminitem.find_all('li',class_ ='stitle'):nowplaying_dict['name'] = tag_img_item.a.text.strip()nowplaying_list.append(nowplaying_dict)returnnowplaying_list########### 注意以下方法是获取的是前十页短评内容 url 相对之前有些变化但解析类似defget_each_moive_page_comment(each_movie,page_num): # 具体进入某部电影对其评论进行提取 # 这里完整代码获取的是而非之前第一页的可显示的评论，现在提取前十页评论 # 需要进入全部短评获取每页的评论所在的链接如下形式 comment_url ='/subject/'+each_movie['id']+'/comments?start='+str(page_num*20)+'&limit=20'resp = requests.get(comment_url,headers = headers)each_movie_html = resp.text#解析 soup1 = BeautifulSoup(each_movie_html,'lxml')comment_div_list = soup1.find_all('div',class_='comment')each_movie_comment_list = []foritemincomment_div_list:ifitem.find_all('p')[0].stringis not None:each_movie_comment_list.append(item.find_all('p')[0].string)comments =''# 将所有评论化成一条字符串forkinrange(len(each_movie_comment_list)):comments = comments + (str(each_movie_comment_list[k])).strip()returncommentsdefdata_clearing(comments,i,each_movie):pattern = pile(r'[\u4e00-\u9fa5]+') # 使用正则表达式去除标点符号 filterdata = re.findall(pattern,comments)cleaned_comments =''.join(filterdata) # 此时获取了每部电影整洁的评论词语字符串 # 进行分词处理 segment = jieba.lcut(cleaned_comments)words_df = pd.DataFrame({'segment':segment})# 获取每部电影分词表格 # 去除停用词 stopwords = pd.read_csv("stopwords.txt", index_col=False, quoting=3, sep="\t", names=['stopword'],encoding='gbk')words_df = words_df[~words_df.segment.isin(stopwords.stopword)]words_stat = words_df.groupby(by=['segment'])['segment'].agg({'计数':numpy.size})words_stat = words_stat.reset_index().sort_values(by=['计数'],ascending=False)# print(words_stat.head()) wordcloud = WordCloud(font_path="simhei.ttf", background_color="white", max_font_size=80)word_frequence = {x[0]: x[1]forxinwords_stat.head(1000).values}word_frequence_list = []forkeyinword_frequence:temp = (key, word_frequence[key])word_frequence_list.append(temp)#print(word_frequence_list) print('第'+ str(i) +'部电影...')try:print('正在生成 '+each_movie['name']+' 词云分布...')wordcloud = wordcloud.fit_words(dict(word_frequence_list))plt.imshow(wordcloud)plt.savefig(str(i)+'_'+each_movie['name'].strip())except:print('该电影无评论,没有词云分布...')defmain():root_url ="/cinema/nowplaying/nanjing/"all_moive = get_all_nowplaying_movie(root_url)i = 1foreach_movieinall_moive:movie_comments =''try:forindex_pageinrange(0,10): # 循环获取10页的评论 index_page_comment = get_each_moive_page_comment(each_movie,index_page)movie_comments += index_page_commenttime.sleep(2)#print(movie_comments)data_clearing(movie_comments, i, each_movie)print('\n')i += 1except:print('该电影无短评......')time.sleep(2)if__name__ =="__main__":main()