600字范文 > titanic_kaggle

titanic_kaggle

时间：2021-01-31 00:40:36

titanic_kaggle

利用逻辑回归预测泰坦尼克号生存率

提出问题理解数据采集数据导入数据查看数据集信息数据清洗数据预处理特征工程构建模型模型评估方案实施提交结果到Kaggle

1.提出问题

什么样的人在泰坦尼克号中更容易存活？

2.理解数据

2.1 采集数据

从Kaggle泰坦尼克号项目页面下载数据：/c/titanic

2.2导入数据

import numpy as npimport pandas as pdimport matplotlib as plt

#导入训练集train = pd.read_csv("/Users/qxh/Desktop/titanic/train.csv")

#导入测试集test = pd.read_csv("/Users/qxh/Desktop/titanic/test.csv")

print('训练集数据大小：',train.shape)print('测试集数据大小：',test.shape)

训练集数据大小： (891, 12)测试集数据大小： (418, 11)

#合并训练集和测试集，为数据处理做准备full = train.append(test, ignore_index = True)

print('整体数据集大小：',full.shape)

整体数据集大小： (1309, 12)

2.3 查看数据集信息

#查看数据，了解各特征的表达含义：'''Age：年龄Cabin：船舱号Embarked：登船地点Fare：船票价格Name：乘客名字Parch：不同代直系亲属数（父母，子女）PassengerId：乘客编号Pclass：舱位等级Sex：性别SibSp：同代直系亲属数（兄弟姐妹，配偶）Survived：是否存活Ticket：船票编码'''full.head()

#查看具体统计信息full.describe()

#查看每一列的数据类型，和数据总数full.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1309 entries, 0 to 1308Data columns (total 12 columns):Age 1046 non-null float64Cabin295 non-null objectEmbarked 1307 non-null objectFare 1308 non-null float64Name 1309 non-null objectParch1309 non-null int64PassengerId 1309 non-null int64Pclass 1309 non-null int64Sex 1309 non-null objectSibSp1309 non-null int64Survived 891 non-null float64Ticket 1309 non-null objectdtypes: float64(3), int64(4), object(5)memory usage: 122.8+ KB

3. 数据清洗

3.1数据预处理

缺失值处理

所有数据总共有1309行。

其中的缺失数据有：

年龄（Age）里面数据总数是1046条，缺失了263条数据，用平均值填补。船票价格（Fare）里面数据总数是1308条，缺失了1条数据，用平均值填补。登船港口（Embarked）里面数据总数是1308条，缺失了2条数据，用出现最频繁的值填补。船舱号（Cabin）里面数据总数是295条，缺失了1014条数据，缺失较多，增添新标记unknown进行填补。

#年龄（age）full['Age']=full['Age'].fillna(full['Age'].mean())#船票价格（fare）full['Fare']=full['Fare'].fillna(full['Fare'].mean())

#登船港口：最频繁的值full['Embarked'].describe()

count1307unique 3topSfreq 914Name: Embarked, dtype: object

full['Embarked']=full['Embarked'].fillna('S')

#船舱号：缺失较多，填充为unknownfull['Cabin']=full['Cabin'].fillna('U')

#查看缺失值填补后的信息full.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1309 entries, 0 to 1308Data columns (total 12 columns):Age 1309 non-null float64Cabin1309 non-null objectEmbarked 1309 non-null objectFare 1309 non-null float64Name 1309 non-null objectParch1309 non-null int64PassengerId 1309 non-null int64Pclass 1309 non-null int64Sex 1309 non-null objectSibSp1309 non-null int64Survived 891 non-null float64Ticket 1309 non-null objectdtypes: float64(3), int64(4), object(5)memory usage: 122.8+ KB

3.2 特征工程

将12个因素通过其数据类型分为3类：

数值类型：乘客编号（PassengerId）年龄（Age）船票价格（Fare）同代直系亲属人数（SibSp）不同代直系亲属人数（Parch）时间序列：无分类数据（直接分类）乘客性别（Sex）：男性male，女性female登船港口（Embarked）：出发地点S=英国南安普顿Southampton，途径地点1：C=法国瑟堡市Cherbourg，出发地点2：Q=爱尔兰昆士敦Queenstown客舱等级（Pclass）：1=1等舱，2=2等舱，3=3等舱分类数据（字符串类型）：可能从这里面提取出特征来乘客姓名（Name）客舱号（Cabin）船票编号（Ticket）

full.info()

3.2.1 分类数据（直接分类）

在乘客性别（Sex），登船港口（Embarked），客舱等级（Pclass）中，找出每个类别的分类标签进行分割，用0和1表示。

性别（Sex）

sex_mapDict = {'male':1,'female':0}#map:对series每个数据应用自定义的函数计算full['Sex'] = full['Sex'].map(sex_mapDict)

登陆港口（Embarked）

embarkedDf = pd.DataFrame()#get_dummies进行one_hot编码embarkedDf = pd.get_dummies(full['Embarked'],prefix='Embarked')

embarkedDf.head()

full = pd.concat([full,embarkedDf],axis=1)

full.drop('Embarked',axis=1,inplace=True)

客舱等级（Pclass）

pcalssDf = pd.DataFrame()pcalssDf = pd.get_dummies(full['Pclass'],prefix='Pclass')

pcalssDf.head()

full = pd.concat([full,pcalssDf],axis=1)

full.drop('Pclass',axis=1,inplace=True)

3.2.2分类数据（字符串类型）

从名字中提取头衔

full['Name'].head()

0Braund, Mr. Owen Harris1 Cumings, Mrs. John Bradley (Florence Briggs Th...2 Heikkinen, Miss. Laina3 Futrelle, Mrs. Jacques Heath (Lily May Peel)4 Allen, Mr. William HenryName: Name, dtype: object

#提取出头衔def get_title(name):str1 = name.split(',')[1]str2 = str1.split('.')[0]str3 = str2.strip()#strip()用于移除字符串头尾指定字符,这里是移除头尾空格return str3

titleDf = pd.DataFrame()titleDf['Title'] = full['Name'].map(get_title)

titleDf.groupby('Title').count()

'''定义以下几种头衔类别：Officer政府官员Royalty王室（皇室）Mr已婚男士Mrs已婚妇女Miss年轻未婚女子Master有技能的人/教师'''#姓名中头衔字符串与定义头衔类别的映射关系title_mapDict = {"Capt": "Officer","Col": "Officer","Major":"Officer","Jonkheer": "Royalty","Don": "Royalty","Sir" : "Royalty","Dr": "Officer","Rev": "Officer","the Countess":"Royalty","Dona": "Royalty","Mme": "Mrs","Mlle": "Miss","Ms": "Mrs","Mr" : "Mr","Mrs" : "Mrs","Miss" :"Miss","Master" : "Master","Lady" :"Royalty"}

titleDf['Title'] = titleDf['Title'].map(title_mapDict)

titleDf = pd.get_dummies(titleDf['Title'])

titleDf.head()

full = pd.concat([full,titleDf],axis=1)full.drop('Name',axis=1,inplace=True)

客舱号

full['Cabin'].head()

0 U1C852 U3 C1234 UName: Cabin, dtype: object

#客舱号的首字母是客舱的类别cabinDf = pd.DataFrame()full['Cabin'] = full['Cabin'].map(lambda c : c[0])

full['Cabin'].head()

0 U1 C2 U3 C4 UName: Cabin, dtype: object

cabinDf = pd.get_dummies( full['Cabin'] , prefix = 'Cabin' )

cabinDf.head()

full = pd.concat([full,cabinDf],axis=1)

full.drop('Cabin',axis=1,inplace= True)

full.head()

5 rows × 27 columns

3.2.3 数据类型

家庭人员和家庭类别

#存放家庭信息familyDf = pd.DataFrame()'''家庭人数=同代直系亲属数（Parch）+不同代直系亲属数（SibSp）+乘客自己（因为乘客自己也是家庭成员的一个，所以这里加1）'''familyDf[ 'family_size' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1

familyDf['family_size'].describe()

count 1309.000000mean 1.883881std 1.583639min 1.00000025% 1.00000050% 1.00000075% 2.000000max 11.000000Name: family_size, dtype: float64

%matplotlib notebookfamilyDf['family_size'].plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x110675400>

'''家庭类别：小家庭Family_Single：家庭人数=1中等家庭Family_Small: 2<=家庭人数<=4大家庭Family_Large: 家庭人数>=5'''familyDf['family_single'] = familyDf['family_size'].map(lambda s : 1 if s==1 else 0)familyDf['family_small'] = familyDf['family_size'].map(lambda s : 1 if 2<=s<=4 else 0)familyDf['family_large'] = familyDf['family_size'].map(lambda s : 1 if s>4 else 0)

familyDf.head()

full = pd.concat([full,familyDf],axis=1)

full.drop([ 'Parch','SibSp','family_size' ],axis=1, inplace=True)

full.head()

5 rows × 28 columns

年龄（Age）和船票费用（Fare）

年龄和费用的数值范围相较于别的类别的数值范围（0，1）相差太大，遂对其进行scaling，使他们的取值范围落在[-1,1]上

import sklearn.preprocessing as preprocessingscaler = preprocessing.StandardScaler()

len(full['Age'].reshape(-1,1))

/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.1309

age_scale_param = scaler.fit(full['Age'].reshape(-1,1))

full['Age_scaled'] = scaler.fit_transform(full['Age'].reshape(-1,1), age_scale_param)

full.head()

5 rows × 29 columns

full.drop([ 'Age'],axis=1, inplace=True)

full.head()

5 rows × 28 columns

fare_scale_param = scaler.fit(full['Fare'].reshape(-1,1))

full['Fare_scaled'] = scaler.fit_transform(full['Fare'].reshape(-1,1), fare_scale_param)

full.head()

5 rows × 29 columns

full.drop([ 'Fare'],axis=1, inplace=True)

full.head()

5 rows × 28 columns

#处理完毕后的数据特征信息full.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1309 entries, 0 to 1308Data columns (total 28 columns):PassengerId1309 non-null int64Pclass 1309 non-null int64Sex 1309 non-null int64Survived 891 non-null float64Ticket 1309 non-null objectEmbarked_C 1309 non-null uint8Embarked_Q 1309 non-null uint8Embarked_S 1309 non-null uint8Master 1309 non-null uint8Miss 1309 non-null uint8Mr1309 non-null uint8Mrs 1309 non-null uint8Officer1309 non-null uint8Royalty1309 non-null uint8Cabin_A1309 non-null uint8Cabin_B1309 non-null uint8Cabin_C1309 non-null uint8Cabin_D1309 non-null uint8Cabin_E1309 non-null uint8Cabin_F1309 non-null uint8Cabin_G1309 non-null uint8Cabin_T1309 non-null uint8Cabin_U1309 non-null uint8family_single 1309 non-null int64family_small1309 non-null int64family_large1309 non-null int64Age_scaled 1309 non-null float64Fare_scaled1309 non-null float64dtypes: float64(3), int64(6), object(1), uint8(18)memory usage: 125.4+ KB

3.3特征选择

通过计算各个特征与survived之间的相关系数，选择和生存率有关的特征

#特征选择corrDf = full.corr()

corrDf

27 rows × 27 columns

'''查看各个特征与生成情况（Survived）的相关系数，ascending=False表示按降序排列'''corrDf['Survived'].sort_values(ascending =False)

Survived 1.000000Mrs 0.344935Miss 0.332795family_small0.279855Fare_scaled0.257307Cabin_B0.175095Embarked_C 0.168240Cabin_D0.150716Cabin_E0.145321Cabin_C0.114652Master 0.085221Cabin_F0.057935Royalty0.033391Cabin_A0.022287Cabin_G0.016040Embarked_Q 0.003650PassengerId-0.005007Cabin_T -0.026456Officer -0.031316Age_scaled-0.070323family_large -0.125147Embarked_S-0.149683family_single -0.203367Cabin_U -0.316912Pclass-0.338481Sex -0.543351Mr -0.549199Name: Survived, dtype: float64

full_x = pd.concat([ titleDf,#头衔pcalssDf,#客舱等级familyDf,#家庭大小full['Fare_scaled'],#船票价格full['Age_scaled'],cabinDf,#船舱号embarkedDf,#登船港口full['Sex']#性别],axis=1)

full_x.head()

5 rows × 28 columns

4.构建模型

用训练数据和某个机器学习算法得到机器学习模型，用测试数据评估模型

4.1 建立训练数据集和测试数据集

sourceRow = 891source_x = full_x.loc[0:sourceRow-1, :]source_y = full.loc[0:sourceRow-1,'Survived']pred_x = full_x.loc[sourceRow:,:]

print('训练集数据大小:',source_x.shape)

训练集数据大小: (891, 28)

print('测试集数据大小:',pred_x.shape)

测试集数据大小: (418, 28)

'''从原始数据集（source）中拆分出训练数据集（用于模型训练train），测试数据集（用于模型评估test）train_test_split是交叉验证中常用的函数，功能是从样本中随机的按比例选取train data和test datatrain_data：所要划分的样本特征集train_target：所要划分的样本结果test_size：样本占比，如果是整数的话就是样本的数量'''from sklearn.cross_validation import train_test_split#建立模型用的训练数据sour集和测试数据集train_x, test_x, train_y, test_y = train_test_split(source_x ,source_y,train_size=.8)

#输出数据集大小print ('原始数据集特征：',source_x.shape, '训练数据集特征：',train_x.shape ,'测试数据集特征：',test_x.shape)print ('原始数据集标签：',source_y.shape, '训练数据集标签：',train_y.shape ,'测试数据集标签：',test_y.shape)

原始数据集特征： (891, 28) 训练数据集特征： (712, 28) 测试数据集特征： (179, 28)原始数据集标签： (891,) 训练数据集标签： (712,) 测试数据集标签： (179,)

4.2 选择机器学习算法

#逻辑回归from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()

4.3 训练模型

model.fit(train_x, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0, warm_start=False)