1,介绍
Titanic: Machine Learning from Disaster是kaggle比赛的入门训练,具体介绍可以看链接,数据在官网上下载,但需要注册登录。训练集在train.csv中,测试集在test.csv。这里对特征的处理主要是来自Sina的Titanic best working Classifier。
首先对训练集的信息进行了解,从中可以看出训练集有891个样本,10个特征,分别是乘客姓名,船票类型,性别,年龄,兄弟姐妹在船上的人数,父母小孩在船上的人数,船票号码,船票价格,客舱号码,终点位置。这些特征里面有的特征存在缺失值,而且特征数据有离散型数据,连续型数据和字符串数据,需要对这些进行处理,下面通过对特征进行分析然后提取我们想要的特征。
import numpy as np import pandas as pd import re as re train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 66.2+ KB
2,特征分析
看了整体数据以后,接下来对各个特征分别进行处理,看看特征对生还率的影响。
(1)船票类型Pclass
船票类型总共有三种,是离散型数据,所以没有进行特别处理,而且可以看出不同船票类型对生还率影响挺大的。
print (train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean()) Pclass Survived 0 1 0.629630 1 2 0.472826 2 3 0.242363
(2)性别Sex
性别对最后结果有着重要影响,可以作为一个重要的特征。这个特征为数据为字符串,后续处理需要映射为0,1。
print (train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean()) Sex Survived 0 female 0.742038 1 male 0.188908
(3)家庭大小FamilySize
在这里,将SibSp和Parch两个合并为一个特征,家庭大小,并同时进行扩展为是否独自一人的特征。
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1 print (train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean()) train['IsAlone'] = 0 train.loc[train['FamilySize'] == 1, 'IsAlone'] = 1 print (train[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()) FamilySize Survived 0 1 0.303538 1 2 0.552795 2 3 0.578431 3 4 0.724138 4 5 0.200000 5 6 0.136364 6 7 0.333333 7 8 0.000000 8 11 0.000000 IsAlone Survived 0 0 0.505650 1 1 0.303538
(4)终点站Embarked
终点站这个特征有C,Q,S三个值,这个特征有缺失值,将其填充为S。同样需要进行映射为0,1,2。
train['Embarked'] = train['Embarked'].fillna('S') print (train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean()) Embarked Survived 0 C 0.553571 1 Q 0.389610 2 S 0.339009
(5)船票票价Fare
船票票价这个特征是一个连续型数据,我们对其进行处理平分为四等分,后面分别映射为0,1,2,3。
train['Fare'] = train['Fare'].fillna(train['Fare'].median()) train['CategoricalFare'] = pd.qcut(train['Fare'], 4) print (train[['CategoricalFare', 'Survived']].groupby(['CategoricalFare'], as_index=False).mean()) CategoricalFare Survived 0 [0, 7.91] 0.197309 1 (7.91, 14.454] 0.303571 2 (14.454, 31] 0.454955 3 (31, 512.329] 0.581081
(6)年龄Age
年龄这个特征同样也是连续型数据,而且缺失值比较多,我们可以将缺失值当做一个类别进行处理,其他的年龄可以等分为五种类别,在后续的数据清理中处理,这里就总共有六种类别。
age_null_count = train['Age'].isnull().sum() print(train['Survived'][train['Age'].isnull()].mean()) train['CategoricalAge'] = pd.cut(train['Age'], 5) print (train[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean()) 0.293785310734 CategoricalAge Survived 0 (0.34, 16.336] 0.550000 1 (16.336, 32.252] 0.369942 2 (32.252, 48.168] 0.404255 3 (48.168, 64.084] 0.434783 4 (64.084, 80] 0.090909
(7)姓名Name
姓名这个特征是字符串数据,从中挖掘出特征比较难,这里采用的是从姓名中找出称呼,并将其中较少的几个,比如'Lady', 'Countess','Capt', 'Col'等归为一类,总共有五类。
ef get_title(name): title_search = re.search(' ([A-Za-z]+).', name) #如果称呼存在,返回称呼 if title_search: return title_search.group(1) return "" train['Title'] = train['Name'].apply(get_title) print(pd.crosstab(train['Title'], train['Sex'])) train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') train['Title'] = train['Title'].replace('Mlle', 'Miss') train['Title'] = train['Title'].replace('Ms', 'Miss') train['Title'] = train['Title'].replace('Mme', 'Mrs') print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()) ex female male Title Capt 0 1 Col 0 2 Countess 1 0 Don 0 1 Dr 1 6 Jonkheer 0 1 Lady 1 0 Major 0 2 Master 0 40 Miss 182 0 Mlle 2 0 Mme 1 0 Mr 0 517 Mrs 125 0 Ms 1 0 Rev 0 6 Sir 0 1 Title Survived 0 Master 0.575000 1 Miss 0.702703 2 Mr 0.156673 3 Mrs 0.793651 4 Rare 0.347826
(8)其他
Ticket和Cabin这两个特征没有进行挖掘,主要Ticket这个特征对于每个乘客来说都是独特的,从中挖掘信息比较难。而Cabin这个特征丢失值比较多,所以不对它进行处理。
train['Ticket'].head(5) 0 A/5 21171 1 PC 17599 2 STON/O2. 3101282 3 113803 4 373450 Name: Ticket, dtype: object
3,数据清理
将train和test放到一个列表中,同时对训练集和测试集进行处理,按照上述的分析进行处理,主要包括字符型特征的映射和连续型数据的分类映射。
full_data = [train, test] for dataset in full_data: # 性别映射为0,1 dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int) # 将SibSp和Parch两个合并为一个特征,家庭大小,并同时扩展为是否独自一人的特征 dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1 dataset['IsAlone'] = 0 dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1 # 称呼分别为0,1,2,3,4,5,5为没有称呼 dataset['Title'] = dataset['Name'].apply(get_title) dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss') dataset['Title'] = dataset['Title'].replace('Ms', 'Miss') dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs') title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5} dataset['Title'] = dataset['Title'].map(title_mapping) dataset['Title'] = dataset['Title'].fillna(0) # 终点站,缺失值补充为S,有三种类型 dataset['Embarked'] = dataset['Embarked'].fillna('S') dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int) # 票价,0,1,2,3四种 dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].median()) dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0 dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1 dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2 dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3 dataset['Fare'] = dataset['Fare'].astype(int) # 年龄,缺失值为类别5 dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0 dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1 dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2 dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3 dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 dataset.loc[ dataset['Age'].isnull(), 'Age'] = 5 dataset['Age'] = dataset['Age'].astype(int)
4,特征选择
把多余的特征去除,剩下我们想要的特征。最后的特征如下所示,主要包括Pclass,Sex,Age,Fare,Embarked,FamilySize,IsAlone,Title这八个特征,在训练的时候需要将Survived这项提取出来,作为训练集的目标值。
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp','Parch'] train = train.drop(drop_elements, axis = 1) train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1) test = test.drop(drop_elements, axis = 1)
print(train.head(5)) train = train.values test = test.values Survived Pclass Sex Age Fare Embarked FamilySize IsAlone Title 0 0 3 1 1 0 0 2 0 1 1 1 1 0 2 3 1 2 0 3 2 1 3 0 1 1 0 1 1 2 3 1 1 0 2 3 0 2 0 3 4 0 3 1 2 1 0 1 1 1
5,总结
到这里,对数据的处理就完成了,下面就可以用得到的特征对模型进行训练了。