600字范文,内容丰富有趣,生活中的好帮手!
600字范文 > Python3网络爬虫之Scrapy框架实现招聘数据抓取

Python3网络爬虫之Scrapy框架实现招聘数据抓取

时间:2021-04-29 21:38:03

相关推荐

Python3网络爬虫之Scrapy框架实现招聘数据抓取

项目需求:

某招聘网上面有公司发布的的各种工作岗位,进入首页 / 后可见 到一个搜索框,如下图所示:

在搜索框输入岗位名称,跳转到如下图所示页面,页面上可见各种工作岗位信息,页面底部是页面选择按钮。

选中其中一个工作岗位点击进去,可见下图所示的岗位信息,其中包括岗位名称、地点、时间、工作职责和工作要求等信息。

现要求如下:

搭建腾讯招聘Scrapy框架 通过框架输入你要抓取的岗位名称,然后搜索结果里面的所有岗位的数据抓取下来,抓取内容包括岗位名称、地点、岗位类别、岗位需求、岗位职责、发布时间 将抓取的数据存入MySQL数据库和CSV文件中 制作岗位需求词云图

最终运行结果:

项目步骤:

搭建腾讯招聘Scrapy爬虫框架 抓取数据包,分析页面结构,厘清抓取思路和抓取策略 在items.py里面定义要抓取的数据字段 编写爬虫文件主体逻辑,实现数据的抓取 修改settings.py文件 编写管道文件pipelines.py,将数据存入MySQL数据库和CSV文件中 编写词云图代码逻辑,实现词云图输出

1.搭建腾讯招聘Scrapy爬虫框架

(1)安装Scrapy框架

pip install scrapy

(2)创建Scrapy项目

scrapy startprojectTencent

(3)创建完成后,切换到项目路径下

cd Tencent

(4)启动Scrapy项目

scrapy genspider

(5)运行Scrapy项目

scrapy crawl tencent

或者在spiders文件夹同级的路径下创建run.py启动文件:

# -*- coding:utf-8 -*-from scrapy import cmdlinecmdline.execute("scrapy crawl tencent".split())

2.抓取数据包,分析页面结构,厘清抓取思路和抓取策略

3.在items.py里面定义要抓取的数据字段

# Define here the models for your scraped items## See documentation in:# /en/latest/topics/items.htmlimport scrapyclass TencentItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# pass# 定义要抓取的数据结构job_name = scrapy.Field()job_location = scrapy.Field()job_requirement = scrapy.Field()job_responsibility = scrapy.Field()

4.编写爬虫文件主体逻辑,实现数据的抓取,tencent.py

import scrapyimport urllib.parseimport jsonimport mathfrom ..items import TencentItemclass TencentSpider(scrapy.Spider):# 对一级页面发送请求,获取岗位的post_id,通过post_id在构建二级页面url,才能获取岗位详情页的岗位信息# 首先获取岗位信息的总页数count,就可以获取所有符合搜索条件的岗位二级页urlname = 'tencent'allowed_domains = ['']# start_urls = ['/']job = input("请输入你要搜索的工作岗位:")# 对job进行编码处理encode_job = urllib.parse.quote(job)# 一级页面地址(搜索页)one_url = "/tencentcareer/api/post/Query?timestamp=1632547113170&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn"# 二级页面地址(岗位详情页)two_url = "/tencentcareer/api/post/ByPostId?timestamp=1632546677601&postId={}&language=zh-cn"# 重写start_urlsstart_urls = [one_url.format(encode_job,1)]def parse(self, response):# pass# 获取response返回的数据文本,并转换成Python json 字典形式json_dic = json.loads(response.text)# 获取"Data"节点下“Count”的值,得到搜索记录的总记录数,并int转换成整型job_count = int(json_dic["Data"]["Count"])# 通过总记录数获得总页数,ceil()函数实现向上取整total_pages = math.ceil(job_count / 10)# 构建每一页的urlfor page in range(1,total_pages+1):# 构建一级页面的urlone_url = self.one_url.format(self.encode_job, page)# 对一级页面发起请求,获取所有岗位的post_id# 利用调度器实现,将url交给调度器进入队列# yield相当于return直接返回,可以参考 /mahaokun/article/details/120471305# callback回调自定义函数parse_post_ids,实现yield scrapy.Request(url=one_url, callback=self.parse_post_ids)# 自定义函数,实现一级页面Request的callback处理def parse_post_ids(self, response):# post_id_list列表获取respons json数据中的post_id数据集字典posts_list = json.loads(response.text)["Data"]["Posts"]for p in posts_list:post_id = p["PostId"]# 构建二级页面的urltwo_url = self.two_url.format(post_id)# 将url交给调度器进入队列yield scrapy.Request(url=two_url, callback=self.parse_job)# 自定义函数,实现二级页面Request的callback处理def parse_job(self, response):# 二级页面岗位详情解析逻辑item = TencentItem()job = json.loads(response.text)["Data"]# job_name = job["RecruitPostName"]# job_location = job["LocationName"]# job_requirement = job["Requirement"]# job_responsibility = job["Responsibility"]# print(job_name)# print(job_location)# print(job_requirement)# print(job_responsibility)item['job_name'] = job["RecruitPostName"]item['job_location'] = job["LocationName"]item['job_requirement'] = job["Requirement"]item['job_responsibility'] = job["Responsibility"]yield item

5.修改settings.py文件

(1)修改是否遵守robots协议,默认是True表示遵守,通常改为FalseROBOTSTXT_OBEY = False(2)修改最大并发请求数量,默认是16CONCURRENT_REQUESTS = 1(3)修改下载延迟,类似time.sleep(2)DOWNLOAD_DELAY = 2(4)修改默认的 request header,加入User-Agent'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'(5)修改管道,300表示优先级,数字越小,优先级越高ITEM_PIPELINES = {'Tencent.pipelines.TencentPipeline': 300,'Tencent.pipelines.TencentMysqlPipeline': 200}

# Scrapy settings for Tencent project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##/en/latest/topics/settings.html#/en/latest/topics/downloader-middleware.html#/en/latest/topics/spider-middleware.htmlBOT_NAME = 'Tencent'SPIDER_MODULES = ['Tencent.spiders']NEWSPIDER_MODULE = 'Tencent.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent# 设置User-Agent# USER_AGENT = 'Tencent (+)'# Obey robots.txt rules# 是否遵守robots协议,默认是True表示遵守,通常改为FalseROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)# 最大并发请求数量,默认是16CONCURRENT_REQUESTS = 1# Configure a delay for requests for the same website (default: 0)# See /en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs# 下载延迟,类似time.sleep(2)DOWNLOAD_DELAY = 2# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:# 重写默认的 request headerDEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'}# Enable or disable spider middlewares# See /en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'Tencent.middlewares.TencentSpiderMiddleware': 543,#}# Enable or disable downloader middlewares# See /en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'Tencent.middlewares.TencentDownloaderMiddleware': 543,#}# Enable or disable extensions# See /en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}# Configure item pipelines# See /en/latest/topics/item-pipeline.html# 管道 300表示优先级,数字越小,优先级越高ITEM_PIPELINES = {'Tencent.pipelines.TencentPipeline': 300,'Tencent.pipelines.TencentMysqlPipeline': 200}# Enable and configure the AutoThrottle extension (disabled by default)# See /en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See /en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

6.编写管道文件pipelines.py,将数据存入MySQL数据库和CSV文件中

# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: /en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom itemadapter import ItemAdapterimport pymysqlclass TencentPipeline:# 处理数据的逻辑def process_item(self, item, spider):print(item)return item# 创建处理MySQL的管道class TencentMysqlPipeline:# 自定义爬虫时开启一次,可以用来链接数据库def open_spider(self, spider):self.db = pymysql.connect(host="localhost", user="root", password="root", database="tencent", port=3306, charset="utf8")# 创建游标对象,用于执行mysql语句self.cursor = self.db.cursor()print("开始爬虫")def process_item(self, item, spider):sql_insert = "insert into tencent(name,location,requirement,responsibility) values(%s,%s,%s,%s)"data = [item["job_name"],item["job_location"],item["job_requirement"],item["job_responsibility"]]self.cursor.execute(sql_insert, data)mit()return item# 自定义关闭爬虫def close_spider(self, spider):self.cursor.close()self.db.close()print("退出爬虫")

7.编写词云图代码逻辑,实现词云图输出

在Tencent工程下创建word_cloud文件夹,把yingwu.jpg和STHUPO.TFF素材拷贝到文件夹中,并创建wc.py文件

import numpy as npimport pandas as pd# jieba用于对象的分词import jieba# wordcloud词云转换 ImageColorGenerator:可设置图片的显示颜色from wordcloud import WordCloud,ImageColorGenerator# 读取图片imagefrom PIL import Imageimport numpy as npimport matplotlib.pyplot as mp# pandas读取csv文件# 返回值为DataFramedf = pd.read_csv("../Tencent/tencent.csv", engine="python")# 获取job_responsibility列的数据并拼成一个字符串job_responsibility = df["job_responsibility"].valuesjob_responsibility_str = "".join(job_responsibility)# 通过jieba分词转换成对象,再用转换成列表jieba_split = list(jieba.cut(job_responsibility_str))text = " ".join(jieba_split)# 读取词云图的模板并将其转换为Numpy数组mask = Image.open("yinwu.jpg")mask = np.array(mask)# 创建词云图对象# mask:词云图模板,stopwords:过滤的词,collocations:为False就会去掉重复的词语,background_color:背景色stopwords = ["的", "和", "技", "品"]wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")# 生产词云图word_image = wc.generate(text)# 设置图片显示颜色image_color =ImageColorGenerator(mask)wc.recolor(color_func=image_color)# 显示词云图mp.imshow(word_image)# 关闭刻度mp.axis("off")# 显示图像mp.show()

效果图如下:

如果想批量的把关键词作为词云图的词库,则可以把上述代码进行改动,创建get_cloud_img函数,代码如下:

import numpy as npimport pandas as pd# jieba用于对象的分词import jieba# wordcloud词云转换 ImageColorGenerator:可设置图片的显示颜色from wordcloud import WordCloud,ImageColorGenerator# 读取图片imagefrom PIL import Imageimport numpy as npimport matplotlib.pyplot as mp# pandas读取csv文件# 返回值为DataFramedf = pd.read_csv("../Tencent/tencent.csv", engine="python")def get_cloud_img(data, label):# 获取job_responsibility列的数据并拼成一个字符串# job_responsibility = df["job_responsibility"].values# job_responsibility_str = "".join(job_responsibility)job_responsibility_str = "".join(data)# 通过jieba分词转换成对象,再用转换成列表jieba_split = list(jieba.cut(job_responsibility_str))text = " ".join(jieba_split)# 读取词云图的模板并将其转换为Numpy数组mask = Image.open("yinwu.jpg")mask = np.array(mask)# 创建词云图对象# mask:词云图模板,stopwords:过滤的词,collocations:为False就会去掉重复的词语,background_color:背景色stopwords = ["的", "和", "技", "品"]wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")# 生产词云图word_image = wc.generate(text)# 设置图片显示颜色image_color =ImageColorGenerator(mask)wc.recolor(color_func=image_color)# 显示词云图mp.imshow(word_image)# 关闭刻度mp.axis("off")# 保存图片mp.savefig("%s.png" % label)# 显示图像mp.show()# 批量调用生产词云图函数get_cloud_img(df["job_responsibility"].values, "腾讯招聘-需求词云图")get_cloud_img(df["job_requirement"].values, "腾讯招聘-职责词云图")

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。