我的博客

kaggle Titanic 泰坦尼克号 数据处理 - 机器学习-简单教程

目录
  1. 读取数据
  2. 数据处理
  3. 训练模型和预测
  4. 输出结果数据
  5. 完整代码

Titanic: Machine Learning from Disaster

我参考了一下两个公开的 notebook:
Titanic Data Science Solutions
Introduction to Ensembling/Stacking in Python

在 kaggle 上查看本文代码:

https://www.kaggle.com/lsxwxs/simple-decisiontree-solution

读取数据

1
2
3
import pandas as pd
train_df = pd.read_csv('../input/titanic/train.csv')
test_df = pd.read_csv('../input/titanic/test.csv')

数据分为训练数据(891 条)和测试数据(418 条),训练数据有 12 列,测试数据 11 列(测试数据中不包括是否生还),每列内容:

列名 含义 取值 类型 缺失值数量(训练/测试)
PassengerId 用户 ID 训练集:1 ~891,测试集:892~1309 int 0
survival 是否生还 0 = 未生还, 1 = 生还, int 这是要与测的值
pclass 舱位 1 = 头等舱, 2 = 二等, 3 = 三等 int 0
name 姓名 str 0
sex 性别 str 0
Age 年龄(岁) float 177,86
sibsp 在船上的兄弟/配偶数 int 0
parch 在船上父母/孩子数 int 0
ticket 票号 str 0
fare 票价 float 0 / 1
cabin Cabin number str 687 / 327
embarked 登船地点 C = Cherbourg, Q = Queenstown, S = Southampto str 2 / 0

  1. 查看缺失值信息和类型

    1
    train_df.info()
  2. 查看统计信息

    1
    2
    train_df.describe()
    train_df.describe(include=['O'])
  3. 查看生还率和某个指标的关系

    1
    train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

数据处理

这里要处理的有两个地方:

  1. 非数量的字段要量化
  2. 填充缺失值

票号这一列含有的有用信息少,Cabin 缺失太多,价值都不大,可以直接删掉。

  1. PassengerId

    无需处理

  2. Survived

    无需处理

  3. Pclass

  1. Name

    姓名本身价值不大,但是可以提取称谓信息:有的称谓如 Lady、Dr 等比较少,而且可能对应的社会地位较高。

  2. Sex

    映射为数字

  3. Age

    填充缺失值

  4. SibSp

  5. Parch

  6. Ticket

  7. Fare

  8. Cabin

  9. Embarked

    填充缺失值,映射为数字

简洁的方案:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# 删除 Ticket、Cabin 和 name
train_df = train_df.drop(['Ticket', 'Cabin', 'Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin', 'Name'], axis=1)

combine = [train_df, test_df]

# sex 映射为数字
for dataset in combine:
dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0}).astype(int)

# 用整体的年龄平均值填充缺失的年龄
age_mean = (train_df['Age'].sum() + test_df['Age'].sum()) / (train_df['Age'].count() + test_df['Age'].count())
for dataset in combine:
dataset['Age'].fillna(age_mean, inplace=True)

# 使用中位数填充 Fare 的缺失值
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)

# 用数量最多的一个登船城市填充 Embarked 的缺失值
freq_port = train_df.Embarked.dropna().mode()[0]
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)

# Embarked 映射为数字
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

复杂一些的方案:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# 删除 Cabin 和 Ticket
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

# 从姓名提取称谓信息
for dataset in combine:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
# 称谓信息处理
for dataset in combine:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
# 按照每个称谓的生还率,把称谓映射成数字,使用 train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean() 查看每个称谓的生还率
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)

# 根据性别和舱位填充年龄数据

guess_ages = np.zeros((2,3))
for dataset in combine:
for i in range(0, 2):
for j in range(0, 3):
guess_df = dataset[(dataset['Sex'] == i) & \
(dataset['Pclass'] == j+1)]['Age'].dropna()
# age_mean = guess_df.mean()
# age_std = guess_df.std()
# age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)
age_guess = guess_df.median()
# Convert random age float to nearest .5 age
guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5

for i in range(0, 2):
for j in range(0, 3):
dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1), 'Age'] = guess_ages[i,j]
dataset['Age'] = dataset['Age'].astype(int)

附:

  1. crosstable

    1
    pd.crosstab(train_df['Title'], train_df['Sex'])

训练模型和预测

1
2
3
4
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
1
2
3
4
5
6
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

输出结果数据

1
2
3
Submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
'Survived': Y_pred })
Submission.to_csv('submission.csv', index=False)

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import pandas as pd
train_df = pd.read_csv('../input/titanic/train.csv')
test_df = pd.read_csv('../input/titanic/test.csv')

# 删除 Ticket、Cabin 和 Name
# delete Ticket, Cabin, Name
train_df = train_df.drop(['Ticket', 'Cabin', 'Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin', 'Name'], axis=1)

combine = [train_df, test_df]

# sex 映射为数字
# map Sex to 1 and 0
for dataset in combine:
dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0}).astype(int)

# 用整体的年龄平均值填充缺失的年龄
# fill missing Ages with mean value
age_mean = (train_df['Age'].sum() + test_df['Age'].sum()) / (train_df['Age'].count() + test_df['Age'].count())
for dataset in combine:
dataset['Age'].fillna(age_mean, inplace=True)

# 使用中位数填充 Fare 的缺失值
# fill missing Fares
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)

# 用数量最多的一个登船城市填充 Embarked 的缺失值
# fill missing Embarked
freq_port = train_df.Embarked.dropna().mode()[0]
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)

# Embarked 映射为数字
# map Embarked
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()

from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
print(acc_decision_tree)
# save submission
Submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
'Survived': Y_pred })
Submission.to_csv('submission.csv', index=False)

评论无需登录,可以匿名,欢迎评论!