600字范文 > 爬虫豆瓣读书top250 保存为本地csv文件可用excel查看（具体步骤和容易遇到的坑）

爬虫豆瓣读书top250 保存为本地csv文件可用excel查看（具体步骤和容易遇到的坑）

时间：2023-07-02 14:40:37

1、目的

将豆瓣读书top250排名保存到本地excel，包括书名，作者，评分，评论数，简评，网址。用到了requests，res，BeautifulSoup，csv库。

2、分析网址

打开豆瓣读书网址：/top250

第一页：/top250

第二页：/top250?start=25

第三页：/top250?start=50

…

第十页：/top250?start=225

把第一页的网址改为：/top250?start=0

找到规律，通过修改最后的数字改变网址，先定义一个函数获取所有网址，并存入列表。

def get_all_url():#定义获取所有网址的函数urls = []for i in range(0, 250, 25):url_1 = '/top250?start1={}'.format(i)urls.append(url_1)return urls

返回urls列表

3、分析网站内容

打开任一网址，右击书名–检查

打开后显示：

通过观察其他书名，可以通过div class = pl2定位。

先导入需要的库文件，使用requests获取网页内容，使用beautifulsoup解析网页。header要写，模拟为浏览器，不然可能返回空值。

import requests #导入requests库，用于获取网页数据import re #导入re库，用于正则表达式筛选数据from bs4 import BeautifulSoup #导入库，用于解析网页import csv #导入库，用于创建csv文件并写入数据url_2 = '/top250?start=50'header = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like\Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0\.3 Mobile/15E148 Safari/604.1'}res = requests.get(url_2, headers=header)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'lxml')data_name = soup.find_all('div', class_='pl2')[1]names = data_name.a.get('title')href = data_name.a.get('href')

打印测试结果，返回书名和网址：

相同的方法，分别定位到作者，评分，

data_author = soup.find_all('p', class_='pl')[1]authors = data_author.get_text().split('/')[0]data_score = soup.find_all('span', class_='rating_nums')[1]scores = data_score.get_text()data_msg = soup.find_all('span', class_='pl')[1].get_text()msgs = re.findall('\d+', data_msg)[0]

打印结果测试：

定位简评稍微复杂，因为如果直接用定位到的路径，会发现有的书名和简评不对应，原因就是有的书没有简评，所以导致错位，解决办法就是通过简评的上一级进行定位，然后再查找，如果没有简评的会返回None值，然后再进行一次判断。

通过两次定位

data_com = soup.find_all('td', valign='top')[3].find('span', class_='inq')if data_com is not None:comments = data_com.get_text()else:comments = '无'

打印测试结果：

至此，所有的结果都已经得到，下面要将本书的信息写入到本地的csv文件中。使用到了csv库。

with open('c:/豆瓣读书TOP250.csv', 'w', newline='', encoding='utf-8-sig') as file:book_info = csv.writer(file)book_info.writerow(('序号', '书名', '作者', '评分', '评论数', '简评', '网址'))book_info.writerow((1, names, authors, scores, msgs, comments, href,))

运行后，可看到C盘多了一个文件‘豆瓣读书TOP250.csv’，打开后

成功写入了。

注意的地方：

open(‘c:/豆瓣读书TOP250.csv’, ‘w’, newline=’’, encoding=‘utf-8-sig’)

如果不写newline=’ '，最后的结果是每条数据后面都有一个空白行。

如果只写encoding=‘utf-8’,会导致打开csv文件时出现乱码，此时如果用文本文档打开再保存也可以解决乱码问题。直接写’utf-8-sig’则直接打开即可正常显示。

单本书的信息已经成功读取并写入，接下来考虑的是读取每页的书的信息。每页有25本书，所以通过以下语句实现。

for i in range(25):data_name = soup.find_all('div', class_='pl2')[i]names = data_name.a.get('title')href = data_name.a.get('href')data_author = soup.find_all('p', class_='pl')[i]authors = data_author.get_text().split('/')[0]data_score = soup.find_all('span', class_='rating_nums')[i]scores = data_score.get_text()data_msg = soup.find_all('span', class_='pl')[i].get_text()msgs = re.findall('\d+', data_msg)[0]data_com = soup.find_all('td', valign='top')[2 * i + 1].find('span', class_='inq')if data_com is not None:comments = data_com.get_text()else:comments = '无'book_info.writerow((1, names, authors, scores, msgs, comments, href))

查看结果，本页的数据全部存入excel，序号那一列暂时写入1，在总程序里会再做处理。

接下来就是要把所有页面的数据爬取下来，文章的开头我们已经把所有的网址存入了一个列表中，所以只要把列表中的网址每次拿出一个，执行一遍上面的程序，等所有网页执行完了，我们就得到了所有的数据了。

所以我构建了一个get_book_info(url)的函数，url代表要爬取的网址，只要每次改变url的网址，就可以将数据写入csv文件。

def get_book_info(url_2): #定义函数，通过网址参数获取整页所需数据，并将其存入csv文件global ids #定义序号变量header = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like\Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0\.3 Mobile/15E148 Safari/604.1'}res = requests.get(url_2, headers=header)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'lxml')for i in range(25):data_name = soup.find_all('div', class_='pl2')[i]names = data_name.a.get('title')href = data_name.a.get('href')data_author = soup.find_all('p', class_='pl')[i]authors = data_author.get_text().split('/')[0]data_score = soup.find_all('span', class_='rating_nums')[i]scores = data_score.get_text()data_msg = soup.find_all('span', class_='pl')[i].get_text()msgs = re.findall('\d+', data_msg)[0]data_com = soup.find_all('td', valign='top')[2 * i + 1].find('span', class_='inq')if data_com is not None:comments = data_com.get_text()else:comments = '无'ids += 1book_info.writerow((ids, names, authors, scores, msgs, comments, href,))

获取网址和获取每个网址数据的函数都定义好了，接下来需要通过主程序调用他们。

if __name__ == '__main__': #主函数try:with open('c:/豆瓣读书TOP250.csv', 'w', newline='', encoding='utf-8-sig') as file:book_info = csv.writer(file)book_info.writerow(('序号', '书名', '作者', '评分', '评论数', '简评', '网址'))url_list = get_all_url()ids = 0for url in url_list:get_book_info(url)except Exception as e: #异常处理print('Error:', e)print('下载完成！')

其中使用try语句，增加了异常处理，如果异常可以打印出异常信息。使用with open语句目的是打开后可以确保文件能自动关闭。

全部代码如下：

最开头的注释#coding=gbk，是因为编码问题当用命令行调用此py文件时报如下错误：SyntaxError: Non-UTF-8 code starting with ‘\xb5’ in file 1111.py on line 2, but no encoding declared; see /dev/peps/pep-0263/ for details

如果直接用pycharm运行则无此问题。

#coding=gbkimport requests #导入requests库，用于获取网页数据import re #导入re库，用于正则表达式筛选数据from bs4 import BeautifulSoup #导入库，用于解析网页import csv #导入库，用于创建csv文件并写入数据def get_all_url():#定义获取所有网址的函数urls = []for i in range(0, 250, 25):url_1 = '/top250?start={}'.format(i)urls.append(url_1)return urlsdef get_book_info(url_2): #定义函数，通过网址参数获取整页所需数据，并将其存入csv文件global ids #定义序号变量header = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like\Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0\.3 Mobile/15E148 Safari/604.1'}res = requests.get(url_2, headers=header)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'lxml')for i in range(25):data_name = soup.find_all('div', class_='pl2')[i]names = data_name.a.get('title')href = data_name.a.get('href')data_author = soup.find_all('p', class_='pl')[i]authors = data_author.get_text().split('/')[0]data_score = soup.find_all('span', class_='rating_nums')[i]scores = data_score.get_text()data_msg = soup.find_all('span', class_='pl')[i].get_text()msgs = re.findall('\d+', data_msg)[0]data_com = soup.find_all('td', valign='top')[2 * i + 1].find('span', class_='inq')if data_com is not None:comments = data_com.get_text()else:comments = '无'ids += 1book_info.writerow((ids, names, authors, scores, msgs, comments, href,))if __name__ == '__main__': #主函数try:with open('c:/豆瓣读书TOP250.csv', 'w', newline='', encoding='utf-8-sig') as file:book_info = csv.writer(file)book_info.writerow(('序号', '书名', '作者', '评分', '评论数', '简评', '网址'))url_list = get_all_url()ids = 0for url in url_list:get_book_info(url)except Exception as e: #异常时执行print('Error:', e)print('下载完成！')

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。

爬虫豆瓣读书top250 保存为本地csv文件 可用excel查看（具体步骤和容易遇到的坑）

爬虫豆瓣读书top250 保存为本地csv文件可用excel查看（具体步骤和容易遇到的坑）