600字范文 > 【kaggle入门题一】Titanic: Machine Learning from Disaster

【kaggle入门题一】Titanic: Machine Learning from Disaster

时间：2019-02-05 15:02:44

原题：

Start here if...

You're new to data science and machine learning, or looking for a simple intro to the Kaggle prediction competitions.

Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Practice Skills

Binary classificationPython and R basics

训练数据：

训练数据中的特征：

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked

解决思路：加载样本->求出总数、总计、均值、方差->利用均值补全空白值->。。。->交叉验证(将训练数据做测试，123选中其二作为训练模型，剩下一个作为测试(原测试集不用)，交叉训练验证取平均值)->线性回归->逻辑回归->随机森林

#coding=utf-8import osfile_root = os.path.realpath('titanic')file_name_test = os.path.join(file_root, "test.csv")file_name_train = os.path.join(file_root, "train.csv")import pandas as pd#显示所有信息pd.set_option('display.max_columns' , None)titanic = pd.read_csv(file_name_train)data = titanic.describe()#可以查看有哪些缺失值titanic.info()#缺失的Age内容进行取均值替换titanic['Age'].fillna(titanic['Age'].median(), inplace=True)data = titanic.describe()print(data)#查看Sex下属性值,并替换print("Sex原属性值", titanic['Sex'].unique())titanic.loc[titanic['Sex'] == "male", "Sex"] = 0titanic.loc[titanic['Sex'] == "female", "Sex"] = 1print("Sex替换后的属性值", titanic['Sex'].unique())#查看Embarked下属性值，并替换print("Embarked原属性值", titanic['Embarked'].unique())titanic["Embarked"] = titanic["Embarked"].fillna('S')titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2print("Embarked替换后的属性值", titanic['Embarked'].unique())#线性回归模型预测from sklearn.linear_model import LinearRegression#交叉验证from sklearn import model_selection#特征值predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]#初始化alg = LinearRegression()#titanic.shape[0]:表示得到m和n的二元组，也就是样本数目；表示n_folds：表示做基层的交叉验证；print("titanic.shape[0]:", titanic.shape[0])# kf = model_selection.KFold(titanic.shape[0], n_folds=3, random_state=1)kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False)predictions = []#n_folds=3遍历三层for train, test in kf.split(titanic['Survived']):#把训练数据拿出来train_predictors = titanic[predictors].iloc[train,:]#我们使用样本训练的目标值train_target = titanic['Survived'].iloc[train]#应用线性回归,训练回归模型alg.fit(train_predictors, train_target)#利用测试集预测test_predictions = alg.predict(titanic[predictors].iloc[test,:])predictions.append(test_predictions)#看测试集的效果，回归值区间值为[0-1]import numpy as np#numpy提供了numpy.concatenate((a1,a2,...), axis=0)函数。能够一次完成多个数组的拼接。其中a1,a2,...是数组类型的参数predictions = np.concatenate(predictions, axis=0)predictions[predictions > .5] = 1predictions[predictions <= .5] = 0accuracy = sum(predictions[predictions == titanic['Survived']]) / len(predictions)print("线性回归模型： ", accuracy)#输出：0.78...#采用逻辑回归方式实现from sklearn import model_selectionfrom sklearn.linear_model import LogisticRegressionimport warningswarnings.filterwarnings("ignore")#初始化alg = LogisticRegression(random_state=1)#比较测试值scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=3)print("逻辑回归模型： ", scores.mean())#采用随机森林实现：构造多颗决策树共同决策结果，取出多次结果的平均值。#随机森林在这七个特征当中进行随机选择个数from sklearn import model_selectionfrom sklearn.ensemble import RandomForestClassifierpridictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]#参数：随机数、用了多少树、最小样本个数、最小叶子结点个数alg = RandomForestClassifier(random_state=1, n_estimators=50, min_impurity_split=4, min_samples_leaf=2)kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False)kf = kf.split(titanic['Survived'])scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=kf)print("随机森林： ", scores.mean())

视频地址：/course/courseLearn.htm?courseId=1003551009#/learn/video?lessonId=1004052091&courseId=1003551009

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。