快来一起挖掘幸福感(完整篇)
本学习笔记为阿里云天池龙珠计划机器学习训练营的学习内容,学习链接为:AI训练营机器学习-阿里云天池
赛题介绍(虽然上面链接里已经有了赛题介绍,但我还是把它摘抄下来了,绝对不是水字数(手动狗头))理解数据,并进行初步探索和可视化特征工程模型搭建
特别说明:由于我目前学识尚浅,所以为了提高精确度,在模型搭建部分参考了论坛上一些坛友的作品,在此特别感谢他们的分享。
在使用代码时,我也在其中将自己学习时的理解注释在旁边,如果你也是第一次接触到这些模块,可能可以给你提供帮助。
一、赛题介绍
赛题背景在社会科学领域,幸福感的研究占有重要的位置。这个涉及了哲学、心理学、社会学、经济学等多方学科的话题复杂而有趣;同时与大家生活息息相关,每个人对幸福感都有自己的衡量标准。如果能发现影响幸福感的共性,生活中是不是将多一些乐趣;如果能找到影响幸福感的政策因素,便能优化资源配置来提升国民的幸福感。目前社会科学研究注重变量的可解释性和未来政策的落地,主要采用了线性回归和逻辑回归的方法,在收入、健康、职业、社交关系、休闲方式等经济人口因素;以及政府公共服务、宏观经济环境、税负等宏观因素上有了一系列的推测和发现。
赛题尝试了幸福感预测这一经典课题,希望在现有社会科学研究外有其他维度的算法尝试,结合多学科各自优势,挖掘潜在的影响因素,发现更多可解释、可理解的相关关系。
赛题说明
赛题使用公开数据的问卷调查结果,选取其中多组变量,包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等等)、家庭变量(父母、配偶、子女、家庭资本等等)、社会态度(公平、信用、公共服务等等),来预测其对幸福感的评价。
幸福感预测的准确性不是赛题的唯一目的,更希望选手对变量间的关系、变量群的意义有所探索与收获。
数据说明
考虑到变量个数较多,部分变量间关系复杂,数据分为完整版和精简版两类。可从精简版入手熟悉赛题后,使用完整版挖掘更多信息。complete文件为变量完整版数据,abbr文件为变量精简版数据。
index文件中包含每个变量对应的问卷题目,以及变量取值的含义。
survey文件是数据源的原版问卷,作为补充以方便理解问题背景。
数据来源:赛题使用的数据来自中国人民大学中国调查与数据中心主持之《中国综合社会调查(CGSS)》项目。赛题感谢此机构及其人员提供数据协助。中国综合社会调查为多阶分层抽样的截面面访调查。
外部数据:赛题以数据挖掘和分析为出发点,不限制外部数据的使用,比如宏观经济指标、政府再分配政策等公开数据,欢迎选手交流分享。
评测指标
提交结果为csv文件,其中包含id和happiness的预测值两列。
分数计算公式:
其中n代表测试集样本数,yi代表第i个样本的预测值,y∗代表真实值。
赛题相关数据请前往阿里云天池获取
快来一起挖掘幸福感!赛题与数据-天池大赛-阿里云天池/competition/entrance/231702/information正在上传…重新上传取消
2、数据处理
# 导入整个项目所需要用到的包import osimport time import pandas as pdimport numpy as npimport lightgbm as lgb # 一种机器学习算法,能够提高预测的准确性import seaborn as snsfrom sklearn.metrics import roc_auc_score, roc_curve # 计算roc曲线下的面积的值from sklearn.model_selection import KFold # k折交叉分析from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVC, LinearSVC from sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.linear_model import Perceptronfrom sklearn.linear_model import SGDClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import GridSearchCVfrom sklearn import metricsfrom sklearn.metrics import mean_squared_error # 均方误差# 1、加载数据train = pd.read_csv("happiness_train_complete.csv", parse_dates=["survey_time"], encoding='latin-1') test = pd.read_csv("happiness_test_complete.csv", parse_dates=["survey_time"], encoding='latin-1')train.head()# 2、理解数据集# 查看特征的数据分布train.describe()# 删除训练集中无效的标签对应的数据train = train.loc[train['happiness'] != -8]# 查看各个类别的分布情况,有很明显的类别不均衡的问题f,ax=plt.subplots(1,2,figsize=(18,8))train['happiness'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],shadow=True)ax[0].set_title('happiness')ax[0].set_ylabel('')train['happiness'].value_counts().plot.bar(ax=ax[1])ax[1].set_title('happiness')plt.show()
# 探究性别和幸福感的分布sns.countplot('gender',hue='happiness',data=train)ax[1].set_title('Sex:happiness')
# 探究年龄和幸福感的关系train['survey_time'] = train['survey_time'].dt.yeartest['survey_time'] = test['survey_time'].dt.yeartrain['Age'] = train['survey_time']-train['birth']test['Age'] = test['survey_time']-test['birth']del_list=['survey_time','birth']figure,ax = plt.subplots(1,1)train['Age'].plot.hist(ax=ax,color='blue')
# 一般会将年龄分箱,避免噪声和异常值的影响combine=[train,test]for dataset in combine:dataset.loc[dataset['Age']<=16,'Age']=0dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3dataset.loc[(dataset['Age'] > 64) & (dataset['Age'] <= 80), 'Age'] = 4dataset.loc[ dataset['Age'] > 80, 'Age'] = 5sns.countplot('Age', hue='happiness', data=train)
figure1,ax1 = plt.subplots(1,5,figsize=(18,4))train['happiness'][train['Age']==1].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[0],shadow=True)train['happiness'][train['Age']==2].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[1],shadow=True)train['happiness'][train['Age']==3].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[2],shadow=True)train['happiness'][train['Age']==4].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[3],shadow=True)train['happiness'][train['Age']==5].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[4],shadow=True)
# 特征选择# 目前只考虑通过相关性选择特征train.corr()['happiness'][abs(train.corr()['happiness'])>0.05]
happiness 1.000000edu 0.103048
edu_yr 0.055564
political 0.080986
join_party 0.069007
property_8 -0.051929
weight_jin 0.085841
health 0.250538
health_problem 0.186620
depression 0.304973
hukou 0.072936
media_1 0.095035
media_2 0.084872
media_3 0.091431
media_4 0.098809
media_5 0.065220
media_6 0.059273
leisure_1 -0.077097
leisure_3 -0.070262
leisure_4 -0.095676
leisure_6 -0.107672
leisure_7 -0.07
leisure_8 -0.100313
leisure_9 -0.148888
leisure_12 -0.068778
socialize 0.082206
relax 0.113233
learn 0.108294
social_friend -0.091079
socia_outing 0.059567
...
family_income 0.051506
family_m 0.061062
family_status 0.204702
house 0.089261
car -0.085387
invest_1 -0.055013
invest_2 0.054019
s_edu 0.125679
s_political 0.068802
s_hukou 0.071953
status_peer -0.150246
status_3_before -0.076808
view 0.078986
trust_1 0.069830
trust_2 0.054909
trust_5 0.102110
trust_7 0.060102
trust_8 0.065644
trust_10 0.069740
trust_12 0.057885
neighbor_familiarity 0.054074
public_service_1 0.112537
public_service_2 0.126029
public_service_3 0.134028
public_service_4 0.129880
public_service_5 0.136347
public_service_6 0.162514
public_service_7 0.154029
public_service_8 0.128678
public_service_9 0.129723
Name: happiness, Length: 65, dtype: float64
# 选择相关性大于0.05的作为候选特征参与训练,并加入我们认为比较重要的特征,总共66个特征参与训练features = (train.corr()['happiness'][abs(train.corr()['happiness'])>0.05]).indexfeatures = features.values.tolist()features.extend(['Age', 'work_exper'])features.remove('happiness')len(features)
3、模型搭建
# 生成数据和标签target = train['happiness']train_selected = train[features]test = test[features]feature_importance_df = pd.DataFrame()oof = np.zeros(len(train))predictions = np.zeros(len(test))# 下面的params主要是参数调整params = {'num_leaves': 9,'min_data_in_leaf': 40,'objective': 'regression','max_depth': 16,'learning_rate': 0.01,'boosting': 'gbdt','bagging_freq': 5,'bagging_fraction': 0.8, # 每次迭代时用的数据比例'feature_fraction': 0.8201,# 每次迭代中随机选择80%的参数来建树'bagging_seed': 11,'reg_alpha': 1.728910519108444,'reg_lambda': 4.9847051755586085,'random_state': 42,'metric': 'rmse','verbosity': -1,'subsample': 0.81,'min_gain_to_split': 0.01077313523861969,'min_child_weight': 19.428902804238373,'num_threads': 4}kfolds = KFold(n_splits=5,shuffle=True,random_state=15)predictions = np.zeros(len(test))for fold_n,(trn_index,val_index) in enumerate(kfolds.split(train_selected,target)):print("fold_n {}".format(fold_n))trn_data = lgb.Dataset(train_selected.iloc[trn_index],label=target.iloc[trn_index])val_data = lgb.Dataset(train_selected.iloc[val_index],label=target.iloc[val_index])num_round=10000clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 100)oof[val_index] = clf.predict(train_selected.iloc[val_index], num_iteration=clf.best_iteration)predictions += clf.predict(test,num_iteration=clf.best_iteration)/5fold_importance_df = pd.DataFrame()fold_importance_df["feature"] = featuresfold_importance_df["importance"] = clf.feature_importance()fold_importance_df["fold"] = fold_n + 1feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)print("CV score: {:<8.5f}".format(mean_squared_error(target, oof)**0.5))
fold_n 0Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is:
[835] training's rmse: 0.631985 valid_1's rmse: 0.681313
CV score: 3.54967
fold_n 1
Training until validation scores don't improve for 100 rounds.
[1000] training's rmse: 0.618499 valid_1's rmse: 0.691913
Early stopping, best iteration is:
[1035] training's rmse: 0.616785 valid_1's rmse: 0.691461
CV score: 3.10335
fold_n 2
Training until validation scores don't improve for 100 rounds.
[1000] training's rmse: 0.629692 valid_1's rmse: 0.649479
Early stopping, best iteration is:
[1545] training's rmse: 0.605341 valid_1's rmse: 0.64739
CV score: 2.55891
fold_n 3
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is:
[865] training's rmse: 0.62686 valid_1's rmse: 0.699352
CV score: 1.87169
fold_n 4
Training until validation scores don't improve for 100 rounds.
[1000] training's rmse: 0.62042 valid_1's rmse: 0.685821
Early stopping, best iteration is:
[1460] training's rmse: 0.600172 valid_1's rmse: 0.684194
CV score: 0.68097
cols = (feature_importance_df[["feature", "importance"]].groupby("feature").mean().sort_values(by="importance", ascending=False)[:1000].index)best_features = feature_importance_df.loc[feature_importance_df.feature.isin(cols)]plt.figure(figsize=(14,26))sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance",ascending=False))plt.title('LightGBM Features (averaged over folds)')plt.tight_layout()