• Kaggle入门Titanic——特征工程


    1,介绍

    Titanic: Machine Learning from Disaster是kaggle比赛的入门训练,具体介绍可以看链接,数据在官网上下载,但需要注册登录。训练集在train.csv中,测试集在test.csv。这里对特征的处理主要是来自Sina的Titanic best working Classifier。

    首先对训练集的信息进行了解,从中可以看出训练集有891个样本,10个特征,分别是乘客姓名,船票类型,性别,年龄,兄弟姐妹在船上的人数,父母小孩在船上的人数,船票号码,船票价格,客舱号码,终点位置。这些特征里面有的特征存在缺失值,而且特征数据有离散型数据,连续型数据和字符串数据,需要对这些进行处理,下面通过对特征进行分析然后提取我们想要的特征。

    import numpy as np
    import pandas as pd
    import re as re
    
    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')
    train.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Name           891 non-null object
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Ticket         891 non-null object
    Fare           891 non-null float64
    Cabin          204 non-null object
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(5)
    memory usage: 66.2+ KB

    2,特征分析

    看了整体数据以后,接下来对各个特征分别进行处理,看看特征对生还率的影响。

    (1)船票类型Pclass

    船票类型总共有三种,是离散型数据,所以没有进行特别处理,而且可以看出不同船票类型对生还率影响挺大的。

    print (train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean())
    
     Pclass  Survived
    0       1  0.629630
    1       2  0.472826
    2       3  0.242363

    (2)性别Sex

    性别对最后结果有着重要影响,可以作为一个重要的特征。这个特征为数据为字符串,后续处理需要映射为0,1。

    print (train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean())
    
          Sex  Survived
    0  female  0.742038
    1    male  0.188908

    (3)家庭大小FamilySize

    在这里,将SibSp和Parch两个合并为一个特征,家庭大小,并同时进行扩展为是否独自一人的特征。

    train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
    print (train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean())
    train['IsAlone'] = 0
    train.loc[train['FamilySize'] == 1, 'IsAlone'] = 1
    print (train[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean())
    
    FamilySize  Survived
    0           1  0.303538
    1           2  0.552795
    2           3  0.578431
    3           4  0.724138
    4           5  0.200000
    5           6  0.136364
    6           7  0.333333
    7           8  0.000000
    8          11  0.000000
       IsAlone  Survived
    0        0  0.505650
    1        1  0.303538

    (4)终点站Embarked

    终点站这个特征有C,Q,S三个值,这个特征有缺失值,将其填充为S。同样需要进行映射为0,1,2。

    train['Embarked'] = train['Embarked'].fillna('S')
    print (train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean())
    
      Embarked  Survived
    0        C  0.553571
    1        Q  0.389610
    2        S  0.339009

    (5)船票票价Fare

    船票票价这个特征是一个连续型数据,我们对其进行处理平分为四等分,后面分别映射为0,1,2,3。

    train['Fare'] = train['Fare'].fillna(train['Fare'].median())
    train['CategoricalFare'] = pd.qcut(train['Fare'], 4)
    print (train[['CategoricalFare', 'Survived']].groupby(['CategoricalFare'], as_index=False).mean())
    
      CategoricalFare  Survived
    0       [0, 7.91]  0.197309
    1  (7.91, 14.454]  0.303571
    2    (14.454, 31]  0.454955
    3   (31, 512.329]  0.581081

    (6)年龄Age

    年龄这个特征同样也是连续型数据,而且缺失值比较多,我们可以将缺失值当做一个类别进行处理,其他的年龄可以等分为五种类别,在后续的数据清理中处理,这里就总共有六种类别。

    age_null_count = train['Age'].isnull().sum()
    print(train['Survived'][train['Age'].isnull()].mean())
    train['CategoricalAge'] = pd.cut(train['Age'], 5)
    print (train[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean())
    
    0.293785310734
         CategoricalAge  Survived
    0    (0.34, 16.336]  0.550000
    1  (16.336, 32.252]  0.369942
    2  (32.252, 48.168]  0.404255
    3  (48.168, 64.084]  0.434783
    4      (64.084, 80]  0.090909

    (7)姓名Name

    姓名这个特征是字符串数据,从中挖掘出特征比较难,这里采用的是从姓名中找出称呼,并将其中较少的几个,比如'Lady', 'Countess','Capt', 'Col'等归为一类,总共有五类。

    ef get_title(name):
        title_search = re.search(' ([A-Za-z]+).', name)
        #如果称呼存在,返回称呼
        if title_search:
            return title_search.group(1)
        return ""
    train['Title'] = train['Name'].apply(get_title)
    print(pd.crosstab(train['Title'], train['Sex']))
    train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col',
         'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    train['Title'] = train['Title'].replace('Mlle', 'Miss')
    train['Title'] = train['Title'].replace('Ms', 'Miss')
    train['Title'] = train['Title'].replace('Mme', 'Mrs')
    print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())
    
    ex       female  male
    Title                 
    Capt           0     1
    Col            0     2
    Countess       1     0
    Don            0     1
    Dr             1     6
    Jonkheer       0     1
    Lady           1     0
    Major          0     2
    Master         0    40
    Miss         182     0
    Mlle           2     0
    Mme            1     0
    Mr             0   517
    Mrs          125     0
    Ms             1     0
    Rev            0     6
    Sir            0     1
        Title  Survived
    0  Master  0.575000
    1    Miss  0.702703
    2      Mr  0.156673
    3     Mrs  0.793651
    4    Rare  0.347826

    (8)其他

    Ticket和Cabin这两个特征没有进行挖掘,主要Ticket这个特征对于每个乘客来说都是独特的,从中挖掘信息比较难。而Cabin这个特征丢失值比较多,所以不对它进行处理。

    train['Ticket'].head(5)
    
    0           A/5 21171
    1            PC 17599
    2    STON/O2. 3101282
    3              113803
    4              373450
    Name: Ticket, dtype: object

    3,数据清理

    将train和test放到一个列表中,同时对训练集和测试集进行处理,按照上述的分析进行处理,主要包括字符型特征的映射和连续型数据的分类映射。

    full_data = [train, test]
    for dataset in full_data:
        # 性别映射为0,1
        dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
        
        # 将SibSp和Parch两个合并为一个特征,家庭大小,并同时扩展为是否独自一人的特征
        dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
        dataset['IsAlone'] = 0
        dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
        
        # 称呼分别为0,1,2,3,4,5,5为没有称呼
        dataset['Title'] = dataset['Name'].apply(get_title)
        dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',
         'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
        dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
        dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
        dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
        title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
        dataset['Title'] = dataset['Title'].map(title_mapping)
        dataset['Title'] = dataset['Title'].fillna(0)
        
        # 终点站,缺失值补充为S,有三种类型 
        dataset['Embarked'] = dataset['Embarked'].fillna('S')
        dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
        
        # 票价,0,1,2,3四种
        dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].median())
        dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0
        dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
        dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
        dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
        dataset['Fare'] = dataset['Fare'].astype(int)
        
        # 年龄,缺失值为类别5
        dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
        dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
        dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
        dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
        dataset.loc[ dataset['Age'] > 64, 'Age']                           = 4
        dataset.loc[ dataset['Age'].isnull(), 'Age']                       = 5
        dataset['Age'] = dataset['Age'].astype(int)

    4,特征选择

    把多余的特征去除,剩下我们想要的特征。最后的特征如下所示,主要包括Pclass,Sex,Age,Fare,Embarked,FamilySize,IsAlone,Title这八个特征,在训练的时候需要将Survived这项提取出来,作为训练集的目标值。

    drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp','Parch']
    train = train.drop(drop_elements, axis = 1)
    train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)
    test  = test.drop(drop_elements, axis = 1)
    print(train.head(5))
    train = train.values
    test  = test.values
    
       Survived  Pclass  Sex  Age  Fare  Embarked  FamilySize  IsAlone  Title
    0         0       3    1    1     0         0           2        0      1
    1         1       1    0    2     3         1           2        0      3
    2         1       3    0    1     1         0           1        1      2
    3         1       1    0    2     3         0           2        0      3
    4         0       3    1    2     1         0           1        1      1

    5,总结

    到这里,对数据的处理就完成了,下面就可以用得到的特征对模型进行训练了。

  • 相关阅读:
    消息中间件(一)MQ详解及四大MQ比较
    SIP协议
    PAT (Basic Level) Practice 1008 数组元素循环右移问题
    LeetCode-Algorithms 1. 两数之和
    PAT (Basic Level) Practice 1040 有几个PAT
    PAT (Basic Level) Practice 1023 组个最小数
    PAT (Basic Level) Practice 1021 个位数统计
    PAT (Basic Level) Practice 1007 素数对猜想
    PAT (Basic Level) Practice 1006 换个格式输出整数
    PAT (Basic Level) Practice 1004 成绩排名
  • 原文地址:https://www.cnblogs.com/linhao-0204/p/6524106.html
Copyright © 2020-2023  润新知