600字范文,内容丰富有趣,生活中的好帮手!
600字范文 > titanic_kaggle

titanic_kaggle

时间:2021-01-31 00:40:36

titanic_kaggle

利用逻辑回归预测泰坦尼克号生存率

目录

提出问题理解数据 采集数据导入数据查看数据集信息 数据清洗 数据预处理特征工程 构建模型模型评估方案实施 提交结果到Kaggle

1.提出问题

什么样的人在泰坦尼克号中更容易存活?

2.理解数据

2.1 采集数据

从Kaggle泰坦尼克号项目页面下载数据:/c/titanic

2.2导入数据

import numpy as npimport pandas as pdimport matplotlib as plt

#导入训练集train = pd.read_csv("/Users/qxh/Desktop/titanic/train.csv")

#导入测试集test = pd.read_csv("/Users/qxh/Desktop/titanic/test.csv")

print('训练集数据大小:',train.shape)print('测试集数据大小:',test.shape)

训练集数据大小: (891, 12)测试集数据大小: (418, 11)

#合并训练集和测试集,为数据处理做准备full = train.append(test, ignore_index = True)

print('整体数据集大小:',full.shape)

整体数据集大小: (1309, 12)

2.3 查看数据集信息

#查看数据,了解各特征的表达含义:'''Age:年龄Cabin:船舱号Embarked:登船地点Fare:船票价格Name:乘客名字Parch:不同代直系亲属数(父母,子女)PassengerId:乘客编号Pclass:舱位等级Sex:性别SibSp:同代直系亲属数(兄弟姐妹,配偶)Survived:是否存活Ticket:船票编码'''full.head()

#查看具体统计信息full.describe()

#查看每一列的数据类型,和数据总数full.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1309 entries, 0 to 1308Data columns (total 12 columns):Age 1046 non-null float64Cabin295 non-null objectEmbarked 1307 non-null objectFare 1308 non-null float64Name 1309 non-null objectParch1309 non-null int64PassengerId 1309 non-null int64Pclass 1309 non-null int64Sex 1309 non-null objectSibSp1309 non-null int64Survived 891 non-null float64Ticket 1309 non-null objectdtypes: float64(3), int64(4), object(5)memory usage: 122.8+ KB

3. 数据清洗

3.1数据预处理

缺失值处理

所有数据总共有1309行。

其中的缺失数据有:

年龄(Age)里面数据总数是1046条,缺失了263条数据,用平均值填补。船票价格(Fare)里面数据总数是1308条,缺失了1条数据,用平均值填补。登船港口(Embarked)里面数据总数是1308条,缺失了2条数据,用出现最频繁的值填补。船舱号(Cabin)里面数据总数是295条,缺失了1014条数据,缺失较多,增添新标记unknown进行填补。

#年龄(age)full['Age']=full['Age'].fillna(full['Age'].mean())#船票价格(fare)full['Fare']=full['Fare'].fillna(full['Fare'].mean())

#登船港口:最频繁的值full['Embarked'].describe()

count1307unique 3topSfreq 914Name: Embarked, dtype: object

full['Embarked']=full['Embarked'].fillna('S')

#船舱号:缺失较多,填充为unknownfull['Cabin']=full['Cabin'].fillna('U')

#查看缺失值填补后的信息full.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1309 entries, 0 to 1308Data columns (total 12 columns):Age 1309 non-null float64Cabin1309 non-null objectEmbarked 1309 non-null objectFare 1309 non-null float64Name 1309 non-null objectParch1309 non-null int64PassengerId 1309 non-null int64Pclass 1309 non-null int64Sex 1309 non-null objectSibSp1309 non-null int64Survived 891 non-null float64Ticket 1309 non-null objectdtypes: float64(3), int64(4), object(5)memory usage: 122.8+ KB

3.2 特征工程

将12个因素通过其数据类型分为3类:

数值类型: 乘客编号(PassengerId)年龄(Age)船票价格(Fare)同代直系亲属人数(SibSp)不同代直系亲属人数(Parch) 时间序列:无分类数据(直接分类) 乘客性别(Sex):男性male,女性female登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱 分类数据(字符串类型):可能从这里面提取出特征来 乘客姓名(Name)客舱号(Cabin)船票编号(Ticket)

full.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1309 entries, 0 to 1308Data columns (total 12 columns):Age 1309 non-null float64Cabin1309 non-null objectEmbarked 1309 non-null objectFare 1309 non-null float64Name 1309 non-null objectParch1309 non-null int64PassengerId 1309 non-null int64Pclass 1309 non-null int64Sex 1309 non-null objectSibSp1309 non-null int64Survived 891 non-null float64Ticket 1309 non-null objectdtypes: float64(3), int64(4), object(5)memory usage: 122.8+ KB

3.2.1 分类数据(直接分类)

在乘客性别(Sex),登船港口(Embarked),客舱等级(Pclass)中,找出每个类别的分类标签进行分割,用0和1表示。

性别(Sex)

sex_mapDict = {'male':1,'female':0}#map:对series每个数据应用自定义的函数计算full['Sex'] = full['Sex'].map(sex_mapDict)

登陆港口(Embarked)

embarkedDf = pd.DataFrame()#get_dummies进行one_hot编码embarkedDf = pd.get_dummies(full['Embarked'],prefix='Embarked')

embarkedDf.head()

full = pd.concat([full,embarkedDf],axis=1)

full.drop('Embarked',axis=1,inplace=True)

客舱等级(Pclass)

pcalssDf = pd.DataFrame()pcalssDf = pd.get_dummies(full['Pclass'],prefix='Pclass')

pcalssDf.head()

full = pd.concat([full,pcalssDf],axis=1)

full.drop('Pclass',axis=1,inplace=True)

3.2.2分类数据(字符串类型)

从名字中提取头衔

full['Name'].head()

0Braund, Mr. Owen Harris1 Cumings, Mrs. John Bradley (Florence Briggs Th...2 Heikkinen, Miss. Laina3 Futrelle, Mrs. Jacques Heath (Lily May Peel)4 Allen, Mr. William HenryName: Name, dtype: object

#提取出头衔def get_title(name):str1 = name.split(',')[1]str2 = str1.split('.')[0]str3 = str2.strip()#strip()用于移除字符串头尾指定字符,这里是移除头尾空格return str3

titleDf = pd.DataFrame()titleDf['Title'] = full['Name'].map(get_title)

titleDf.groupby('Title').count()

'''定义以下几种头衔类别:Officer政府官员Royalty王室(皇室)Mr已婚男士Mrs已婚妇女Miss年轻未婚女子Master有技能的人/教师'''#姓名中头衔字符串与定义头衔类别的映射关系title_mapDict = {"Capt": "Officer","Col": "Officer","Major":"Officer","Jonkheer": "Royalty","Don": "Royalty","Sir" : "Royalty","Dr": "Officer","Rev": "Officer","the Countess":"Royalty","Dona": "Royalty","Mme": "Mrs","Mlle": "Miss","Ms": "Mrs","Mr" : "Mr","Mrs" : "Mrs","Miss" :"Miss","Master" : "Master","Lady" :"Royalty"}

titleDf['Title'] = titleDf['Title'].map(title_mapDict)

titleDf = pd.get_dummies(titleDf['Title'])

titleDf.head()

full = pd.concat([full,titleDf],axis=1)full.drop('Name',axis=1,inplace=True)

客舱号

full['Cabin'].head()

0 U1C852 U3 C1234 UName: Cabin, dtype: object

#客舱号的首字母是客舱的类别cabinDf = pd.DataFrame()full['Cabin'] = full['Cabin'].map(lambda c : c[0])

full['Cabin'].head()

0 U1 C2 U3 C4 UName: Cabin, dtype: object

cabinDf = pd.get_dummies( full['Cabin'] , prefix = 'Cabin' )

cabinDf.head()

full = pd.concat([full,cabinDf],axis=1)

full.drop('Cabin',axis=1,inplace= True)

full.head()

5 rows × 27 columns

3.2.3 数据类型

家庭人员和家庭类别

#存放家庭信息familyDf = pd.DataFrame()'''家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己(因为乘客自己也是家庭成员的一个,所以这里加1)'''familyDf[ 'family_size' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1

familyDf['family_size'].describe()

count 1309.000000mean 1.883881std 1.583639min 1.00000025% 1.00000050% 1.00000075% 2.000000max 11.000000Name: family_size, dtype: float64

%matplotlib notebookfamilyDf['family_size'].plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x110675400>

'''家庭类别:小家庭Family_Single:家庭人数=1中等家庭Family_Small: 2<=家庭人数<=4大家庭Family_Large: 家庭人数>=5'''familyDf['family_single'] = familyDf['family_size'].map(lambda s : 1 if s==1 else 0)familyDf['family_small'] = familyDf['family_size'].map(lambda s : 1 if 2<=s<=4 else 0)familyDf['family_large'] = familyDf['family_size'].map(lambda s : 1 if s>4 else 0)

familyDf.head()

full = pd.concat([full,familyDf],axis=1)

full.drop([ 'Parch','SibSp','family_size' ],axis=1, inplace=True)

full.head()

5 rows × 28 columns

年龄(Age)和船票费用(Fare)

年龄和费用的数值范围相较于别的类别的数值范围(0,1)相差太大,遂对其进行scaling,使他们的取值范围落在[-1,1]上

import sklearn.preprocessing as preprocessingscaler = preprocessing.StandardScaler()

len(full['Age'].reshape(-1,1))

/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.1309

age_scale_param = scaler.fit(full['Age'].reshape(-1,1))

/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.

full['Age_scaled'] = scaler.fit_transform(full['Age'].reshape(-1,1), age_scale_param)

/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.

full.head()

5 rows × 29 columns

full.drop([ 'Age'],axis=1, inplace=True)

full.head()

5 rows × 28 columns

fare_scale_param = scaler.fit(full['Fare'].reshape(-1,1))

/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.

full['Fare_scaled'] = scaler.fit_transform(full['Fare'].reshape(-1,1), fare_scale_param)

/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.

full.head()

5 rows × 29 columns

full.drop([ 'Fare'],axis=1, inplace=True)

full.head()

5 rows × 28 columns

#处理完毕后的数据特征信息full.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1309 entries, 0 to 1308Data columns (total 28 columns):PassengerId1309 non-null int64Pclass 1309 non-null int64Sex 1309 non-null int64Survived 891 non-null float64Ticket 1309 non-null objectEmbarked_C 1309 non-null uint8Embarked_Q 1309 non-null uint8Embarked_S 1309 non-null uint8Master 1309 non-null uint8Miss 1309 non-null uint8Mr1309 non-null uint8Mrs 1309 non-null uint8Officer1309 non-null uint8Royalty1309 non-null uint8Cabin_A1309 non-null uint8Cabin_B1309 non-null uint8Cabin_C1309 non-null uint8Cabin_D1309 non-null uint8Cabin_E1309 non-null uint8Cabin_F1309 non-null uint8Cabin_G1309 non-null uint8Cabin_T1309 non-null uint8Cabin_U1309 non-null uint8family_single 1309 non-null int64family_small1309 non-null int64family_large1309 non-null int64Age_scaled 1309 non-null float64Fare_scaled1309 non-null float64dtypes: float64(3), int64(6), object(1), uint8(18)memory usage: 125.4+ KB

3.3特征选择

通过计算各个特征与survived之间的相关系数,选择和生存率有关的特征

#特征选择corrDf = full.corr()

corrDf

27 rows × 27 columns

'''查看各个特征与生成情况(Survived)的相关系数,ascending=False表示按降序排列'''corrDf['Survived'].sort_values(ascending =False)

Survived 1.000000Mrs 0.344935Miss 0.332795family_small0.279855Fare_scaled0.257307Cabin_B0.175095Embarked_C 0.168240Cabin_D0.150716Cabin_E0.145321Cabin_C0.114652Master 0.085221Cabin_F0.057935Royalty0.033391Cabin_A0.022287Cabin_G0.016040Embarked_Q 0.003650PassengerId-0.005007Cabin_T -0.026456Officer -0.031316Age_scaled-0.070323family_large -0.125147Embarked_S-0.149683family_single -0.203367Cabin_U -0.316912Pclass-0.338481Sex -0.543351Mr -0.549199Name: Survived, dtype: float64

full_x = pd.concat([ titleDf,#头衔pcalssDf,#客舱等级familyDf,#家庭大小full['Fare_scaled'],#船票价格full['Age_scaled'],cabinDf,#船舱号embarkedDf,#登船港口full['Sex']#性别],axis=1)

full_x.head()

5 rows × 28 columns

4.构建模型

用训练数据和某个机器学习算法得到机器学习模型,用测试数据评估模型

4.1 建立训练数据集和测试数据集

sourceRow = 891source_x = full_x.loc[0:sourceRow-1, :]source_y = full.loc[0:sourceRow-1,'Survived']pred_x = full_x.loc[sourceRow:,:]

print('训练集数据大小:',source_x.shape)

训练集数据大小: (891, 28)

print('测试集数据大小:',pred_x.shape)

测试集数据大小: (418, 28)

'''从原始数据集(source)中拆分出训练数据集(用于模型训练train),测试数据集(用于模型评估test)train_test_split是交叉验证中常用的函数,功能是从样本中随机的按比例选取train data和test datatrain_data:所要划分的样本特征集train_target:所要划分的样本结果test_size:样本占比,如果是整数的话就是样本的数量'''from sklearn.cross_validation import train_test_split#建立模型用的训练数据sour集和测试数据集train_x, test_x, train_y, test_y = train_test_split(source_x ,source_y,train_size=.8)

#输出数据集大小print ('原始数据集特征:',source_x.shape, '训练数据集特征:',train_x.shape ,'测试数据集特征:',test_x.shape)print ('原始数据集标签:',source_y.shape, '训练数据集标签:',train_y.shape ,'测试数据集标签:',test_y.shape)

原始数据集特征: (891, 28) 训练数据集特征: (712, 28) 测试数据集特征: (179, 28)原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,)

4.2 选择机器学习算法

#逻辑回归from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()

4.3 训练模型

model.fit(train_x, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0, warm_start=False)

5 模型评估

model.score(test_x , test_y )

0.82681564245810057

6.实施方案

#得上预测结果上传到kagglepred_y = model.predict(pred_x)'''生成的预测值是浮点数(0.0,1,0)但是Kaggle要求提交的结果是整型(0,1)所以要对数据类型进行转换'''pred_y=pred_y.astype(int)

#乘客idpassenger_id = full.loc[sourceRow:,'PassengerId']#数据框:乘客id,预测生存情况的值predDf = pd.DataFrame( {'PassengerId': passenger_id ,'Survived': pred_y } )

predDf.shape

(418, 2)

predDf.head()

predDf.to_csv( '/Users/qxh/Desktop/titanic/titanic_pred.csv' , index = False )

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。
相关阅读
扩展阅读