600字范文 > python 命令行抓取分析北上广深房价数据

python 命令行抓取分析北上广深房价数据

时间：2019-03-03 17:56:31

引言

昨天在老家，发布了一篇《python 自动抓取分析房价数据——安居客版》。在文末，第6小节提供了完整代码，可以在 python3 环境，通过命令行传入参数 cookie 自动抓取房价数据。今天回到深圳，才想到，这段脚本只能抓取西双版纳的房价数据，如果读者不自己修改，那么就无法抓取其他城市的房价数据。于是，决定“好事做到底，送佛送到西”，将脚本加以修改，以北上广深为例，提供灵活抓取分析其他城市房价的完整代码。

1. 完整 python 脚本

在上一篇的脚本基础上，稍加修改，将以下代码保存到文件crawl_anjuke.py中。

#!/usr/local/bin/pythonimport requestsfrom bs4 import BeautifulSoupimport pandas as pdimport matplotlib.pyplot as pltimport timeimport argparsedef get_headers(city, page, cookie):headers = {'authority': '{}.'.format(city),'method': 'GET','path': '/community/p{}/'.format(page),'scheme': 'https','accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3','accept-encoding': 'gzip, deflate, br','accept-language': 'zh-CN,zh;q=0.9','cache-control': 'no-cache','cookie': cookie,'pragma': 'no-cache','referer': 'https://{}./community/p{}/'.format(city, page),'sec-fetch-mode': 'navigate','sec-fetch-site': 'none','sec-fetch-user': '?1','upgrade-insecure-requests': '1','user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}return headersdef get_html_by_page(city, page, cookie):headers = get_headers(city, page, cookie)url = 'https://{}./community/p{}/'.format(city, page)res = requests.get(url, headers=headers)if res.status_code != 200:print('页面不存在！')return Nonereturn res.textdef extract_data_from_html(html):soup = BeautifulSoup(html, features='lxml')list_content = soup.find(id="list-content")if not list_content:print('list-content elemet not found!')return Noneitems = list_content.find_all('div', class_='li-itemmod')if len(items) == 0:print('items is empty!')return Nonereturn [extract_data(item) for item in items]def extract_data(item):name = item.find_all('a')[1].text.strip()address = item.address.text.strip()if item.strong is not None:price = item.strong.text.strip()else:price = Nonefinish_date = item.p.text.strip().split('：')[1]latitude, longitude = [d.split('=')[1] for d in item.find_all('a')[3].attrs['href'].split('#')[1].split('&')[:2]]return name, address, price, finish_date, latitude, longitudedef is_in_notebook():import sysreturn 'ipykernel' in sys.modulesdef clear_output():import osos.system('cls' if os.name == 'nt' else 'clear')if is_in_notebook():from IPython.display import clear_output as clearclear()def crawl_all_page(city, cookie, limit=0):page = 1data_raw = []while True:try:if limit != 0 and (page-1 == limit):breakhtml = get_html_by_page(city, page, cookie)data_page = extract_data_from_html(html)if not data_page:breakdata_raw += data_pageclear_output()print('crawling {}th page ...'.format(page))page += 1except:print('maybe cookie expired!')breakprint('crawl {} pages in total.'.format(page-1))return data_rawdef create_df(data):columns = ['name', 'address', 'price', 'finish_date', 'latitude', 'longitude']return pd.DataFrame(data, columns=columns)def clean_data(df):df.dropna(subset=['price'], inplace=True)df = df.astype({'price': 'float64', 'latitude': 'float64', 'longitude': 'float64'})return dfdef visual(df):fig, ax = plt.subplots()df.plot(y='price', ax=ax, bins=20, kind='hist', label='房价频率直方图', legend=False)ax.set_title('房价分布直方图')ax.set_xlabel('房价')ax.set_ylabel('频率')plt.grid()plt.show()def run(city, cookie, limit):data = crawl_all_page(city, cookie, limit)if len(data) == 0:print('empty: crawled noting!')returndf = create_df(data)df = clean_data(df)visual(df)_price = df['price']_max, _min, _average, _median, _std = _price.max(), _price.min(), _price.mean(), _price.median(), _price.std()print('\n{} house price statistics\n-------'.format(city))print('count:\t{}'.format(df.shape[0]))print('max:\t¥{}\nmin:\t¥{}\naverage:\t¥{}\nmedian:\t¥{}\nstd:\t¥{}\n'.format(_max, _min, round(_average, 1), _median, round(_std, 1)))df.sort_values('price', inplace=True)df.reset_index(drop=True, inplace=True)# 保存 csv 文件dt = time.strftime("%Y-%m-%d", time.localtime())csv_file = 'anjuke_{}_community_price_{}.csv'.format(city, dt)df.to_csv(csv_file, index=False)def get_cli_args():parser = argparse.ArgumentParser()parser.add_argument('-c', '--city', type=str, help='city.')parser.add_argument('-k', '--cookie', type=str, help='cookie.')parser.add_argument('-l', '--limit', type=int, default=0, help='page limit (30 records per page).')args = parser.parse_args()return argsif __name__ == '__main__':args = get_cli_args()run(args.city, args.cookie, args.limit)

2. 新增参数说明

2.1 city

顾名思义，city 就是指定脚本将要抓取的城市。这个参数来自哪里，是不是随便传呢？当然不是，因为数据来自网站，因此，就必须是网站支持的城市。在安居客网站，体现为二级域名，如北京站是，那么获取北京站的 city 即为 beijing 。

2.2 limit

抓取最大分页数。之所以需要这个参数，因为抓取城市所有小区的数据，需要分页一次次抓取，通过观察，安居客分页是通过 url 传入的。正常思路，容易想到，从第1页开始，每成功获取1页数据，将页面变量加1，直到获取不到数据。但是，在抓取深圳数据时，我发现，网站上看到最多只能查看到50页, 如下图所示。但实际，在抓取50页面后面的数据时，会返回第1页的数据。这样，导致自动累加的策略失效，不能跳出循环。因此，需要增加 limit 参数，来手动指定加载最大的页面数。这个数，需要自己打开对应城市，如下图，找到最大页面数。以深圳为例（/community/p50/），limit 设置为 50 。

注：cookie 参数和上一篇《python 自动抓取分析房价数据——安居客版》一样

3. 命令行抓取北上广深数据

3.1 抓取北京房价数据

python crawl_anjuke.py --city beijing --limit 50 --cookie "sessid=5AACB464..."

3.2 抓取上海房价数据

python crawl_anjuke.py --city shanghai --limit 50 --cookie "sessid=5AACB464..."

3.3 抓取广州房价数据

python crawl_anjuke.py --city guangzhou --limit 50 --cookie "sessid=5AACB464..."

3.4 抓取深圳房价数据

python crawl_anjuke.py --city shenzhen --limit 50 --cookie "sessid=5AACB464..."

4. 数据分析

4.1 加载数据

运行 3 小节命令后，会在当前目录生成如下四个 csv 文件。后面日期为运行命令当天的日期。

anjuke_beijing_community_price_-09-17.csvanjuke_shanghai_community_price_-09-17.csvanjuke_guangzhou_community_price_-09-17.csvanjuke_shenzhen_community_price_-09-17.csv

import pandas as pdimport timedt = time.strftime("%Y-%m-%d", time.localtime())cities = ['beijing', 'shanghai', 'guangzhou', 'shenzhen']csv_files = ['anjuke_{}_community_price_{}.csv'.format(city, dt) for city in cities]dfs = []city_cn = {'beijing': '北京','shanghai': '上海','guangzhou': '广州','shenzhen': '深圳'}for city in cities:f = 'anjuke_{}_community_price_{}.csv'.format(city, dt)df = pd.read_csv(f)df.insert(0, 'city', city_cn[city])dfs.append(df)df = pd.concat(dfs, ignore_index=True)df.sample(10)

4.2 按城市分组的房价统计数据

count: 数据记录数mean: 平均值std: 标准差min: 最小值25%: 1/4 位数50%: 中位数75%: 3/4 位数max: 最大值

df.groupby('city')['price'].describe()

4.3 数据可视化

按城市分组，显示价格分布小提琴图和箱线图。

import seaborn as sbimport matplotlib.pyplot as plt# %matplotlib inlineplt.figure(figsize=(12, 6))plt.subplot(1,2, 1)ax1 = sb.violinplot(data = df, x = 'city', y = 'price', inner = 'quartile')plt.subplot(1,2, 2)sb.boxplot(data = df, x = 'city', y = 'price')plt.ylim(ax1.get_ylim())plt.show()

微信扫描二维码获取最新技术原创

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。