600字范文 > kaggle员工离职预测案例(1)

kaggle员工离职预测案例(1)

时间：2020-08-11 05:51:22

相关推荐

kaggle员工离职预测案例(1)

传送门🚪

建模处理过程请看《kaggle员工离职预测案例(2)》

模型评估《kaggle员工离职预测案例(3)》

背景

朋友圈无意中看到了老师分享的一篇关于模型评估的文章《机器学习模型评估教程！》，文章对建模后的模型评估进行了深入浅出的评估思路梳理和解读。文章的重点在于模型评估部分，数据梳理和建模部分较为基础故省略不谈。作为一名机器学习的初学者，这简直是一次绝佳的案例学习的机会。so,废话不多说，找到文中的给到的数据来源，开整！

等等…开整之前先简单说一下案例的背景。简单来讲，就是通过已有的数据集来建模预测员工的离职情况。是否会离职？(YES OR NO)很明显一个简单的二分类问题。数据集中包含了很多项关于员工的数据维度，比如基本的个人信息，教育程度，员工满意度，薪资情况等等。详细的数据解读会在后面说明。

数据来源

此案例数据集来源kaggle中的一个比赛项目，数据集下载链接如下：

/pavansubhasht/ibm-hr-analytics-attrition-dataset

Do It !

数据展示：

因变量（y）:

（1）Attrition：员工是否已经离职，1表示已经离职，2表示未离职，这是目标预测值；

自变量（X）：

（1）Age：员工年龄；

（2）MonthlyRate：月薪；

（3）BusinessTravel：商务差旅频率，Non-Travel不出差，Travel_Rarely不经常出差，Travel_Frequently经常出差；

（4）Department：员工所在部门，Sales销售部，Research & Development研发部，Human Resources人力资源部；

（5）DistanceFromHome：公司跟家庭住址的距离，从1到29，1最近，29最远；

（6）Education：员工的教育程度，从1到5，5教育程度最高；

（7）EducationField：员工所学习的专业领域，Life Sciences生命科学，Medical医疗，Marketing市场营销，Technical Degree技术学位，Human Resources人力资源，Other其他；

（8）EmployeeNumber：员工号码；

（9）EnvironmentSatisfaction：员工对于工作环境的满意程度，从1到4，1的满意程度最低，4的满意程度最高；

（10）Gender：员工性别，Male男性，Female女性；

（11）JobInvolvement：员工工作投入度，从1到4，1为投入度最低，4为投入度最高；

（12）JobLevel：职业级别，从1到5，1为最低级别，5为最高级别；

（13）JobRole：工作角色：Sales Executive销售主管，Research Scientist科学研究员，Laboratory Technician实验室技术员，Manufacturing Director制造总监，Healthcare Representative医疗代表，Manager经理，Sales Representative销售代表，Research Director研究总监，Human Resources人力资源；

（14）JobSatisfaction：工作满意度，从1到4，1满意程度最低，4满意程度最高；

（15）MaritalStatus：员工婚姻状况，Single单身，Married已婚，Divorced离婚；

（16）MonthlyIncome：员工月收入，范围在1009到19999之间；

（17）NumCompaniesWorked：员工曾经工作过的公司数；

（18）Over18：年龄是否超过18岁；

（19）OverTime：是否加班，Yes表示加班，No表示不加班；

（20）PercentSalaryHike：工资提高的百分比；

（21）PerformanceRating：绩效评估；

（22）RelationshipSatisfaction：关系满意度，从1到4，1满意度最低，4满意度最高；

（23）StandardHours：标准工时；

（24）StockOptionLevel：股票期权水平；

（25）TotalWorkingYears：总工龄；

（26）TrainingTimesLastYear：上一年的培训时长，从0到6，0没有培训，6培训时间最长；

（27）WorkLifeBalance：工作与生活平衡程度，从1到4，1平衡程度最低，4平衡程度最高；

（28）YearsAtCompany：在目前公司工作年数；

（29）YearsInCurrentRole：在目前工作职责的工作年数

（30）YearsSinceLastPromotion：距离上次升职时长

（31）YearsWithCurrManager：跟目前的管理者共事年数；

（32）DailyRate：日薪；

（33）EmployeeCount：员工人数；

（34）HourlyRate：时薪；

（35）MonthlyRate：月薪；

数据分析：

导入一些必需包，涉及建模的包后面会有。

import csvimport pandas as pd import numpy as npfrom scipy import stats #import matplotlib.pyplot as pltimport seaborn as snsimport warnings# 忽略警告warnings.filterwarnings('ignore') # 使用 ggplot 画图风格plt.style.use('ggplot')%matplotlib inline;

data = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')data.tail()

a = data.loc[data['Attrition']=='No'].count()[0]b = data.loc[data['Attrition']=='Yes'].count()[0]labels = ['0','1']explo=[0.1,0]plt.pie([a,b],labels=labels,explode=explo,shadow=True,startangle=0,autopct='%1.2f%%',wedgeprops={'edgecolor':'black'})plt.show()

分析：

数据结构1470行（样本容量）*35列（变量），其中数值型变量26个，文本型变量9个。35个变量中EmployeeCount,EmployeeNumber,Over18,StandardHours的这四个，常识上认为员工编号与是否离职关系不大（否则过于玄学，哈哈），另外三个变量所有员工都一致，对最终的结果影响不大，可以删除上述四个变量。数值型变量需要考虑离散程度，是否需要数据标准化；文本型变量需要考虑one-hot编码。因变量离职情况分布不，离职的样本只占全体样本的16%，这种情况可能会导致模型在很大程度上学到的都是未离职的特征，离职的样本过少而导致没有学习到，sad。针对这种情况需要考虑降采法或者过采法进行数据样本平衡。看到这里你一定会问哪一种方法更好？说实话我也不知道，所以后面我们两种方法都试一遍，看看哪种方法模型的精度更好。

#删除上述四个无关变量data = data.drop(['EmployeeCount','EmployeeNumber','Over18','StandardHours'], axis='columns') #对删除后的数据依旧类型分类，文本型和数值型。category=[f for f in data.columns if data.dtypes[f] =='object'] numeric=[f for f in data.columns if data.dtypes[f] !='object']

可以分别打印出来看一下，很好，删除了得到了新的data。

#展示数值型变量和因变量Attrition的关系，这里用plt.figure(figsize = (30,30))for i in numeric:plt.subplot(5,6,numeric.index(i)+1)sns.boxplot(x = 'Attrition',y = i,data = data)

#展示数值型变量的数据分别情况data.hist(edgecolor='black', linewidth=1.2, figsize=(20, 20));

#热力图展示data.Attrition[data['Attrition']=='Yes']=int(1)data.Attrition[data['Attrition']=='No']=int(0)data[['Attrition']] = data[['Attrition']].astype(int)corrmat = data.corr()k = len(corrmat[abs(corrmat['Attrition'])>0.04].index)# 获取前k个重要的特征名cols = corrmat.nlargest(k,'Attrition')['Attrition'].index.tolist()plt.figure(figsize=(20,15))sns.heatmap(data[cols].corr(),annot=True,square=True)

分析：

从箱型图中可以看出单个变量中哪一种情况对Attrition的影响更大。比如Age变量中可以看出，年纪轻的比年纪大的更容易离职。热力图中可以看出某个变量特征和Attrition的相关性数据。越接近1表示正相关性越高，越接近-1表示负相关性越高，越接近0表示相关性越低。数据分布图中可以看出像Age、Education特征它们的数据分类类似正太分布；但是像DistanceFromHome、MonthlyIncome、YearsAtCompany、YearsInCurrentRole等这类特征它们的数据分布很明显的呈现右偏，也可以说成正偏态，这种情况可以考虑通过.log1p( ) 函数进行纠正。

数据处理：

step 1：对数据进行归一化处理。

这里挑选了部分数值型变量进行归一化，使它们的参考系处于同一标准，比如0-1之间。

（为什么要这么做？可以参考《机器学习之数据的偏态分布和数据的标准化》）

#采用MinMaxScaler对数据进行标准化from sklearn.preprocessing import MinMaxScalerdata_norm = datalst = ['Age','DailyRate','DistanceFromHome','Education','EnvironmentSatisfaction','HourlyRate','JobInvolvement','JobLevel','JobSatisfaction','MonthlyIncome','MonthlyRate','NumCompaniesWorked','PercentSalaryHike','PerformanceRating','RelationshipSatisfaction','StockOptionLevel','TotalWorkingYears','TrainingTimesLastYear','WorkLifeBalance','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']for i in lst:data_norm['{}_norm'.format(i)] = MinMaxScaler().fit_transform(data[i].values.reshape(-1,1))#删除原有变量 data_norm = data_norm.drop(lst,axis='columns')data_norm

step 2: 对文本型变量进行One-Hot编码处理

#先把文本型变量找出来储存为新的datafrme-data1_normcategory=[f for f in data_norm.columns if data_norm.dtypes[f] =='object'] data1_norm = data_norm[category]data1_norm

#可以分别检查一下其中有没有特殊值for i in category:print(data1_norm[i].value_counts())print('*'*40)

#使用pandas的.get_dummies()方法可以轻易的将他们进行one-hot编码操作。data_dummies = pd.get_dummies(data1_norm)print(list(data_dummies))data_dummies

#将编码后的data与原始data 合并，并删除原有的文本型变量。data_final_norm = pd.concat((data_norm.drop(list(data1_norm), axis='columns'),data_dummies),axis=1)

经过一系列的处理，原本35列数据变成52列数据，我们把得到的最终数据保存为一个新的CSV文件，以防丢失（后面建模就用到我们这个最终的数据啦。）

data_final_norm.to_csv('data_final_norm.csv',index=0) #不保存行索引

step 3.1: 降采样

降采样的目的就是让多的情况（未离职）的样本数量变的和少的情况（离职）的数量一样多。我们要做的就是在（未离职）的样本中随机取出n个样本，n=（离职）的数量。

# 得到所有离职样本的索引number_Attrition = len(data_final_norm[data_final_norm.Attrition == 1])Attrition_indices = np.array(data_final_norm[data_final_norm.Attrition == 1].index)# 得到所有非离职样本的索引normal_indices = data_final_norm[data_final_norm.Attrition == 0].index# 在正常样本中随机采样出指定个数的样本，并取其索引random_normal_indices = np.random.choice(normal_indices, number_Attrition, replace = False) #注意replace参数的使用random_normal_indices = np.array(random_normal_indices)# 有了正常和异常样本后把它们的索引都拿到手under_sample_indices = np.concatenate([Attrition_indices,random_normal_indices])# 根据索引得到降采样所有样本点under_sample_data = data_final_norm.iloc[under_sample_indices,:]# 降采样样本比例print("正常样本所占整体比例: ", len(under_sample_data[under_sample_data.Attrition == 0])/len(under_sample_data))print("异常样本所占整体比例: ", len(under_sample_data[under_sample_data.Attrition == 1])/len(under_sample_data))print("降采样策略总体样本数量: ", len(under_sample_data))

可以把最终结果打印出来，平衡了，木问题。

step 3.2 过采样

降采样的目的就是让少的情况（离职）的样本数量变的和多的情况（未离职）的数量一样多。我们要做的就是随机生成n个（离职）样本，n=（未离职）的数量减去（离职）的数量。

因为要随机生成一些样本数据，我们这里直接调用imblearn库的RandomOverSampler（ROSP）方法。关于具体说明，可以参考《》。

from imblearn.over_sampling import RandomOverSamplercolumns=data_final_norm.columns# 在特征中去除掉标签features_columns=columnsfeatures=data_final_norm[features_columns]labels=data_final_norm['Attrition']features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=0)#需要先初始化一下，然后传入数据特征和标签oversampler = RandomOverSampler(random_state=0)os_features,os_labels=oversampler.fit_resample(features_train,labels_train)#查看一下生成的数据量len(os_labels[os_labels==1]) # 输出862

#可以可视化展示过采后的结果，离职和未离职数据一样多了。sns.countplot('Attrition',data=os_features)c = os_features[os_features['Attrition']==0]d = os_features[os_features['Attrition']==1]print(c.shape[0])print(d.shape[0]

对上面两种方法不清楚的可以参考我的老师写的一个案例博客《逻辑回归案例模板——信用卡欺诈检测》，里面有非常详细的解释。我的代码也是参考的这个案例。

训练集、测试集划分：

因为我们要验证降采和过采哪一种方案会更好，所以要分别对降采后的数据和过采后的数据进行数据集划分。

1. 降采后的数据

from sklearn.model_selection import train_test_splitX_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Attrition']y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Attrition']#取出所有的特征数据和标签数据X = data_final_norm.iloc[:, data_final_norm.columns != 'Attrition']y = data_final_norm.iloc[:, data_final_norm.columns == 'Attrition']#整个数据集进行划分，注意random_state一定要设置一样的，因为要进行过采样的对比X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)#降采样数据进行划分X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,y_undersample,test_size = 0.3,random_state = 0)

2.过采后的数据

#可以可视化展示过采后的结果，离职和未离职数据一样多了。X_oversample = os_features.iloc[:, os_features.columns != 'Attrition']y_oversample = os_features.iloc[:, os_features.columns == 'Attrition']X_train_oversample, X_test_oversample, y_train_oversample, y_test_oversample = train_test_split(X_oversample,y_oversample,test_size = 0.3,random_state = 0)

#可以可视化展示过采后的结果，离职和未离职数据一样多了。print('初始训练集包含样本数量：',len(X_train))print('初始测试集包含样本数量：',len(X_test))print('初始样本总数：',len(X_train) + len(X_test))print("")print("过采样训练集包含样本数量: ", len(X_train_oversample))print("过采样测试集包含样本数量: ", len(X_test_oversample))print("过采样样本总数: ", len(X_train_oversample)+len(X_test_oversample))print("")print("降采样训练集包含样本数量: ", len(X_train_undersample))print("降采样测试集包含样本数量: ", len(X_test_undersample))print("降采样样本总数: ", len(X_train_undersample)+len(X_test_undersample))

降采后和过采后的数据比较：

写到这里，数据的处理的工作已经完成了。后面需要做的就是选择不同的模型进行训练学习，并选择其中最好的一个模型进行评估。请继续看《kaggle员工离职预测案例(2)》

这里需要特别说明的是，本文没有对偏态数据进行纠正处理。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。