600字范文 > Python3网络爬虫之Scrapy框架实现招聘数据抓取

Python3网络爬虫之Scrapy框架实现招聘数据抓取

时间：2021-04-29 21:38:03

项目需求：

某招聘网上面有公司发布的的各种工作岗位，进入首页 / 后可见到一个搜索框，如下图所示：

在搜索框输入岗位名称，跳转到如下图所示页面，页面上可见各种工作岗位信息，页面底部是页面选择按钮。

选中其中一个工作岗位点击进去，可见下图所示的岗位信息，其中包括岗位名称、地点、时间、工作职责和工作要求等信息。

现要求如下：

搭建腾讯招聘Scrapy框架通过框架输入你要抓取的岗位名称，然后搜索结果里面的所有岗位的数据抓取下来，抓取内容包括岗位名称、地点、岗位类别、岗位需求、岗位职责、发布时间将抓取的数据存入MySQL数据库和CSV文件中制作岗位需求词云图

最终运行结果：

项目步骤：

搭建腾讯招聘Scrapy爬虫框架抓取数据包，分析页面结构，厘清抓取思路和抓取策略在items.py里面定义要抓取的数据字段编写爬虫文件主体逻辑，实现数据的抓取修改settings.py文件编写管道文件pipelines.py，将数据存入MySQL数据库和CSV文件中编写词云图代码逻辑，实现词云图输出

1.搭建腾讯招聘Scrapy爬虫框架

（1）安装Scrapy框架

pip install scrapy

（2）创建Scrapy项目

scrapy startprojectTencent

（3）创建完成后，切换到项目路径下

cd Tencent

（4）启动Scrapy项目

scrapy genspider

（5）运行Scrapy项目

scrapy crawl tencent

或者在spiders文件夹同级的路径下创建run.py启动文件：

# -*- coding:utf-8 -*-from scrapy import cmdlinecmdline.execute("scrapy crawl tencent".split())

2.抓取数据包，分析页面结构，厘清抓取思路和抓取策略

3.在items.py里面定义要抓取的数据字段

# Define here the models for your scraped items## See documentation in:# /en/latest/topics/items.htmlimport scrapyclass TencentItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# pass# 定义要抓取的数据结构job_name = scrapy.Field()job_location = scrapy.Field()job_requirement = scrapy.Field()job_responsibility = scrapy.Field()

4.编写爬虫文件主体逻辑，实现数据的抓取，tencent.py

import scrapyimport urllib.parseimport jsonimport mathfrom ..items import TencentItemclass TencentSpider(scrapy.Spider):# 对一级页面发送请求，获取岗位的post_id,通过post_id在构建二级页面url,才能获取岗位详情页的岗位信息# 首先获取岗位信息的总页数count，就可以获取所有符合搜索条件的岗位二级页urlname = 'tencent'allowed_domains = ['']# start_urls = ['/']job = input("请输入你要搜索的工作岗位：")# 对job进行编码处理encode_job = urllib.parse.quote(job)# 一级页面地址（搜索页）one_url = "/tencentcareer/api/post/Query?timestamp=1632547113170&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn"# 二级页面地址（岗位详情页）two_url = "/tencentcareer/api/post/ByPostId?timestamp=1632546677601&postId={}&language=zh-cn"# 重写start_urlsstart_urls = [one_url.format(encode_job,1)]def parse(self, response):# pass# 获取response返回的数据文本，并转换成Python json 字典形式json_dic = json.loads(response.text)# 获取"Data"节点下“Count”的值，得到搜索记录的总记录数，并int转换成整型job_count = int(json_dic["Data"]["Count"])# 通过总记录数获得总页数，ceil()函数实现向上取整total_pages = math.ceil(job_count / 10)# 构建每一页的urlfor page in range(1,total_pages+1):# 构建一级页面的urlone_url = self.one_url.format(self.encode_job, page)# 对一级页面发起请求，获取所有岗位的post_id# 利用调度器实现，将url交给调度器进入队列# yield相当于return直接返回，可以参考 /mahaokun/article/details/120471305# callback回调自定义函数parse_post_ids，实现yield scrapy.Request(url=one_url, callback=self.parse_post_ids)# 自定义函数，实现一级页面Request的callback处理def parse_post_ids(self, response):# post_id_list列表获取respons json数据中的post_id数据集字典posts_list = json.loads(response.text)["Data"]["Posts"]for p in posts_list:post_id = p["PostId"]# 构建二级页面的urltwo_url = self.two_url.format(post_id)# 将url交给调度器进入队列yield scrapy.Request(url=two_url, callback=self.parse_job)# 自定义函数，实现二级页面Request的callback处理def parse_job(self, response):# 二级页面岗位详情解析逻辑item = TencentItem()job = json.loads(response.text)["Data"]# job_name = job["RecruitPostName"]# job_location = job["LocationName"]# job_requirement = job["Requirement"]# job_responsibility = job["Responsibility"]# print(job_name)# print(job_location)# print(job_requirement)# print(job_responsibility)item['job_name'] = job["RecruitPostName"]item['job_location'] = job["LocationName"]item['job_requirement'] = job["Requirement"]item['job_responsibility'] = job["Responsibility"]yield item

5.修改settings.py文件

（1）修改是否遵守robots协议，默认是True表示遵守，通常改为FalseROBOTSTXT_OBEY = False（2）修改最大并发请求数量，默认是16CONCURRENT_REQUESTS = 1（3）修改下载延迟,类似time.sleep(2)DOWNLOAD_DELAY = 2（4）修改默认的 request header，加入User-Agent'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'（5）修改管道，300表示优先级，数字越小，优先级越高ITEM_PIPELINES = {'Tencent.pipelines.TencentPipeline': 300,'Tencent.pipelines.TencentMysqlPipeline': 200}

# Scrapy settings for Tencent project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##/en/latest/topics/settings.html#/en/latest/topics/downloader-middleware.html#/en/latest/topics/spider-middleware.htmlBOT_NAME = 'Tencent'SPIDER_MODULES = ['Tencent.spiders']NEWSPIDER_MODULE = 'Tencent.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent# 设置User-Agent# USER_AGENT = 'Tencent (+)'# Obey robots.txt rules# 是否遵守robots协议，默认是True表示遵守，通常改为FalseROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)# 最大并发请求数量，默认是16CONCURRENT_REQUESTS = 1# Configure a delay for requests for the same website (default: 0)# See /en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs# 下载延迟,类似time.sleep(2)DOWNLOAD_DELAY = 2# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:# 重写默认的 request headerDEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'}# Enable or disable spider middlewares# See /en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'Tencent.middlewares.TencentSpiderMiddleware': 543,#}# Enable or disable downloader middlewares# See /en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'Tencent.middlewares.TencentDownloaderMiddleware': 543,#}# Enable or disable extensions# See /en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}# Configure item pipelines# See /en/latest/topics/item-pipeline.html# 管道 300表示优先级，数字越小，优先级越高ITEM_PIPELINES = {'Tencent.pipelines.TencentPipeline': 300,'Tencent.pipelines.TencentMysqlPipeline': 200}# Enable and configure the AutoThrottle extension (disabled by default)# See /en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See /en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

6.编写管道文件pipelines.py，将数据存入MySQL数据库和CSV文件中

# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: /en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom itemadapter import ItemAdapterimport pymysqlclass TencentPipeline:# 处理数据的逻辑def process_item(self, item, spider):print(item)return item# 创建处理MySQL的管道class TencentMysqlPipeline:# 自定义爬虫时开启一次，可以用来链接数据库def open_spider(self, spider):self.db = pymysql.connect(host="localhost", user="root", password="root", database="tencent", port=3306, charset="utf8")# 创建游标对象，用于执行mysql语句self.cursor = self.db.cursor()print("开始爬虫")def process_item(self, item, spider):sql_insert = "insert into tencent(name,location,requirement,responsibility) values(%s,%s,%s,%s)"data = [item["job_name"],item["job_location"],item["job_requirement"],item["job_responsibility"]]self.cursor.execute(sql_insert, data)mit()return item# 自定义关闭爬虫def close_spider(self, spider):self.cursor.close()self.db.close()print("退出爬虫")

7.编写词云图代码逻辑，实现词云图输出

在Tencent工程下创建word_cloud文件夹，把yingwu.jpg和STHUPO.TFF素材拷贝到文件夹中，并创建wc.py文件

import numpy as npimport pandas as pd# jieba用于对象的分词import jieba# wordcloud词云转换 ImageColorGenerator:可设置图片的显示颜色from wordcloud import WordCloud,ImageColorGenerator# 读取图片imagefrom PIL import Imageimport numpy as npimport matplotlib.pyplot as mp# pandas读取csv文件# 返回值为DataFramedf = pd.read_csv("../Tencent/tencent.csv", engine="python")# 获取job_responsibility列的数据并拼成一个字符串job_responsibility = df["job_responsibility"].valuesjob_responsibility_str = "".join(job_responsibility)# 通过jieba分词转换成对象，再用转换成列表jieba_split = list(jieba.cut(job_responsibility_str))text = " ".join(jieba_split)# 读取词云图的模板并将其转换为Numpy数组mask = Image.open("yinwu.jpg")mask = np.array(mask)# 创建词云图对象# mask：词云图模板，stopwords：过滤的词，collocations：为False就会去掉重复的词语，background_color：背景色stopwords = ["的", "和", "技", "品"]wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")# 生产词云图word_image = wc.generate(text)# 设置图片显示颜色image_color =ImageColorGenerator(mask)wc.recolor(color_func=image_color)# 显示词云图mp.imshow(word_image)# 关闭刻度mp.axis("off")# 显示图像mp.show()

效果图如下：

如果想批量的把关键词作为词云图的词库，则可以把上述代码进行改动，创建get_cloud_img函数，代码如下：

import numpy as npimport pandas as pd# jieba用于对象的分词import jieba# wordcloud词云转换 ImageColorGenerator:可设置图片的显示颜色from wordcloud import WordCloud,ImageColorGenerator# 读取图片imagefrom PIL import Imageimport numpy as npimport matplotlib.pyplot as mp# pandas读取csv文件# 返回值为DataFramedf = pd.read_csv("../Tencent/tencent.csv", engine="python")def get_cloud_img(data, label):# 获取job_responsibility列的数据并拼成一个字符串# job_responsibility = df["job_responsibility"].values# job_responsibility_str = "".join(job_responsibility)job_responsibility_str = "".join(data)# 通过jieba分词转换成对象，再用转换成列表jieba_split = list(jieba.cut(job_responsibility_str))text = " ".join(jieba_split)# 读取词云图的模板并将其转换为Numpy数组mask = Image.open("yinwu.jpg")mask = np.array(mask)# 创建词云图对象# mask：词云图模板，stopwords：过滤的词，collocations：为False就会去掉重复的词语，background_color：背景色stopwords = ["的", "和", "技", "品"]wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")# 生产词云图word_image = wc.generate(text)# 设置图片显示颜色image_color =ImageColorGenerator(mask)wc.recolor(color_func=image_color)# 显示词云图mp.imshow(word_image)# 关闭刻度mp.axis("off")# 保存图片mp.savefig("%s.png" % label)# 显示图像mp.show()# 批量调用生产词云图函数get_cloud_img(df["job_responsibility"].values, "腾讯招聘-需求词云图")get_cloud_img(df["job_requirement"].values, "腾讯招聘-职责词云图")

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。