UID 2
精华
积分 7736
威望 点
宅币 个
贡献 次
宅之契约 份
最后登录 1970-1-1
在线时间 小时
泰坦尼克生存数据集分类及调优
一、数据集介绍
泰坦尼克生存数据集(https://www.kaggle.com/c/titanic/download/train.csv ) 共有891位乘客的数据信息,通过分析以下数据特征寻找与成功生存的关系:
PassengerId:乘客编号
Survived:乘客是否存活
Pclass:乘客所在的船舱等级
Name:乘客姓名
Sex:乘客性别
Age:乘客年龄
SibSp:乘客的兄弟姐妹和配偶数量
Parch:乘客的父母与子女数量
Ticket:票的编号
Fare:票价
Cabin:座位号
Embarked:乘客登船码头
目前泰坦尼克号数据预测准确度是80%+
二、数据集初步分析
train.csv数据集前几行
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
import pandas as pd
# header=0指定第一行为表头
dataset = pd.read_csv('train.csv', header = 0)
print("\n----------------- head -------------------")
print(dataset.head())
print("\n----------------- info -------------------")
print(dataset.info())
print("\n----------------- Sex -------------------")
print(dataset["Sex"].value_counts())
print("\n----------------- Pclass -------------------")
print(dataset["Pclass"].value_counts())
print("\n----------------- SibSp -------------------")
print(dataset["SibSp"].value_counts())
print("\n----------------- Parch -------------------")
print(dataset["Parch"].value_counts())
print("\n----------------- Embarked -------------------")
print(dataset["Embarked"].value_counts())
复制代码
----------------- head -------------------
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
----------------- info -------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
----------------- Sex -------------------
male 577
female 314
Name: Sex, dtype: int64
----------------- Pclass -------------------
3 491
1 216
2 184
Name: Pclass, dtype: int64
----------------- SibSp -------------------
0 608
1 209
2 28
4 18
3 16
8 7
5 5
Name: SibSp, dtype: int64
----------------- Parch -------------------
0 678
1 118
2 80
5 5
3 5
4 4
6 1
Name: Parch, dtype: int64
----------------- Embarked -------------------
S 644
C 168
Q 77
Name: Embarked, dtype: int64
上述数据可以得知
Age缺失20%数据,Cabin缺失77%数据(缺失过多,去除),Embarked缺失2个数据
PassengerId,一定是与生存结果无关的数据
Survived是生存结果,需要从输入中剔除
Pclass Sex Age SibSp Parch Fare Embarked可以编码成普通数据,与生存结果可能是相关的
Name,名字或许存在某些隐含联系,但是我们不好直接提取特征
Ticket Cabin,我们暂时不从这里提取特征
Sex Embarked字段需要编码,Ticket Cabin字段可能需要编码
由于数据量较小,采用机器学习算法较合适
import warnings
warnings.filterwarnings("ignore")
import time
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import *
from sklearn.tree import *
from sklearn.neighbors import *
from sklearn.ensemble import *
from sklearn.svm import *
from sklearn.naive_bayes import *
from sklearn.preprocessing import *
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import *
from sklearn.model_selection import *
from sklearn.preprocessing import Imputer
clfs = [
[LogisticRegression(max_iter=1000), {}],
[SVC(), {'C': [0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0]}],
[PassiveAggressiveClassifier(), {}],
[RidgeClassifier(), {}],
[RidgeClassifierCV(), {}],
[SGDClassifier(), {}],
[KNeighborsClassifier(n_neighbors=20), {}],
[NearestCentroid(), {}],
[DecisionTreeClassifier(), {}],
[ExtraTreeClassifier(), {}],
[AdaBoostClassifier(), {}],
[BaggingClassifier(), {}],
[ExtraTreeClassifier(), {}],
[GradientBoostingClassifier(), {}],
[RandomForestClassifier(n_estimators=100), {}],
[BernoulliNB(), {}],
[GaussianNB(), {}],
]
pipes = [Pipeline([
['sc', StandardScaler()],
['clf', GridSearchCV(pair[0], param_grid=pair[1])]
]) for pair in clfs] # 用于统一化初值处理、分类
def test_classifier():
for i in range(0, len(clfs)):
minscore = 1.0 # 记录最小准确度用于后续进一步优化
start = time.time()
acc_arr = []
for j in range(0, testnum):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
pipes[i].fit(X_train, y_train)
y_pred = pipes[i].predict(X_test)
acc_arr.append(accuracy_score(y_test, y_pred))
npacc = np.array(acc_arr)
end = time.time()
print('Accuraty:%s meanscore=%.2f minscore=%.2f time=%d' % (type(clfs[i][0]), npacc.mean(), npacc.min(), end - start))
dataset = pd.read_csv('train.csv', header = 0)
# 由于Embarked只有2个缺失,用出现最多的S代替
dataset["Embarked"].fillna("S", inplace=True)
# 编码Embarked字段
dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
# 编码Sex字段
dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)
# Age缺失较多,这次用平均值代替
dataset["Age"].fillna(dataset["Age"].mean(), inplace=True)
X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].values
y = dataset["Survived"].values
testnum=100
test_classifier()
复制代码
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.71 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.80 minscore=0.70 time=23
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.83 minscore=0.71 time=34
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.69 minscore=0.36 time=0
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.79 minscore=0.67 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.73 minscore=0.40 time=0
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.80 minscore=0.72 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.77 minscore=0.63 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.77 minscore=0.67 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.76 minscore=0.67 time=0
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.81 minscore=0.73 time=33
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.80 minscore=0.70 time=8
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.77 minscore=0.62 time=0
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.82 minscore=0.72 time=28
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.81 minscore=0.71 time=66
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.77 minscore=0.61 time=0
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.78 minscore=0.70 time=0
三、第一次参数调优
从上面结果可以看到,没有任何优化的分类器,对于titanic数据集的准确率是80+%,下面通过对分类器进行参数和数据调整达到更高的准确率,使用随机森林预测Age参数。通常遇到缺值的情况,我们会有几种常见的处理方式:
如果缺值的样本占总数比例极高,我们可能就直接舍弃了,作为特征加入的话,可能反倒带入noise,影响最后的结果了,或者考虑有值的是一类,没有值的是一类,
如果缺值的样本适中,而该属性非连续值特征属性(比如说类目属性),那就把NaN作为一个新类别,加到类别特征中
有些情况下,缺失的值个数并不是特别多,那我们也可以试着根据已有的值,拟合一下数据,补充上。
from sklearn.ensemble import RandomForestRegressor
dataset = pd.read_csv('train.csv', header = 0)
dataset.drop("PassengerId", inplace=True, axis=1)
dataset.drop("Name", inplace=True, axis=1)
dataset.drop("Ticket", inplace=True, axis=1)
dataset.drop("Cabin", inplace=True, axis=1)
# 由于Embarked只有2个缺失,用出现最多的S代替
dataset["Embarked"].fillna("S", inplace=True)
# 编码Embarked字段
dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
# 编码Sex字段
dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)
# 使用随机森林预测缺失值,Age作为目标值进行预测
datasetxx = dataset[["Age", "Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Embarked"]]
known_age = datasetxx[datasetxx["Age"].notnull()].as_matrix()
unknown_age = datasetxx[datasetxx["Age"].isnull()].as_matrix()
yy = known_age[:, 0]
XX = known_age[:, 1:]
rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
rfr.fit(XX, yy)
dataset["Age"][dataset["Age"].isnull()] = rfr.predict(unknown_age[:, 1:]) # 预测值写入数据集
X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].values
y = dataset["Survived"].values
testnum=100
test_classifier()
复制代码
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.71 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.81 minscore=0.67 time=24
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.82 minscore=0.72 time=33
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.73 minscore=0.31 time=0
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.80 minscore=0.68 time=2
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.74 minscore=0.42 time=1
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.81 minscore=0.72 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.77 minscore=0.67 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.79 minscore=0.69 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.79 minscore=0.70 time=0
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.80 minscore=0.70 time=32
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.82 minscore=0.70 time=9
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.78 minscore=0.66 time=0
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.82 minscore=0.72 time=28
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.82 minscore=0.70 time=65
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.78 minscore=0.66 time=1
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.79 minscore=0.69 time=1
四、 第二次参数调优
可以看到使用随机森林进行缺失数据估计,和使用平均值估值的结果差不多,这次我们从Ticket Cabin字段解析特征值
* Ticket的值大概是一个字符串加数字形式,如“SOTON/OQ 392076”
import re
def get_ticket(ticket):
out = re.compile("[0-9]+$").search(ticket)
if out is None:
return 0
else:
return int(out.group(0))
dataset = pd.read_csv('train.csv', header = 0)
# 由于Embarked只有2个缺失,用出现最多的S代替
dataset["Embarked"].fillna("S", inplace=True)
# 编码Embarked字段
dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
# 编码Sex字段
dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)
# Age缺失较多,这次用平均值代替
dataset["Age"].fillna(dataset["Age"].mean(), inplace=True)
dataset['Ticket'] = dataset['Ticket'].apply(get_ticket)
X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "Ticket"]].values
y = dataset["Survived"].values
testnum=100
test_classifier()
复制代码
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.79 minscore=0.63 time=25
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.82 minscore=0.76 time=41
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.72 minscore=0.43 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.79 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.79 minscore=0.71 time=3
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.74 minscore=0.57 time=1
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.81 minscore=0.71 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.76 minscore=0.64 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.78 minscore=0.69 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.77 minscore=0.67 time=1
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.81 minscore=0.69 time=36
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.82 minscore=0.70 time=14
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.76 minscore=0.63 time=1
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.84 minscore=0.72 time=31
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.84 minscore=0.76 time=66
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.75 minscore=0.61 time=1
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.78 minscore=0.69 time=1
五、总结
参数调优对于taitanic数据集表现不大,表现较好的有以下算法:
* RandomForestClassifier
* GradientBoostingClassifier
* AdaBoostClassifier/SVC等
* 神经网络