• 52 特征列feature_column——eat_tensorflow2_in_30_days


    5-2 特征列feature_column

    特征列 通常用于对结构化数据实施特征工程时候使用,图像或者文本数据一般不会用到特征列

    特征列用法概述

    使用特征列可以将类别特征转换为one-hot编码特征,将连续特征构建分桶特征,以及对多个特征生成交叉特征等等。

    要创建特征列,请调用 tf.feature_column 模块的函数。该模块中常用的九个函数如下图所示,所有九个函数都会返回一个 Categorical-Column 或一个 Dense-Column 对象,但却不会返回 bucketized_column,后者继承自这两个类。

    注意:所有的Catogorical Column类型最终都要通过indicator_column转换成Dense Column类型才能传入模型!

    • numeric_column 数值列,最常用。
    • bucketized_column 分桶列,由数值列生成,可以由一个数值列出多个特征,one-hot编码。
    • categorical_column_with_identity 分类标识列,one-hot编码,相当于分桶列每个桶为1个整数的情况。
    • categorical_column_with_vocabulary_list 分类词汇列,one-hot编码,由list指定词典。
    • categorical_column_with_vocabulary_file 分类词汇列,由文件file指定词典。
    • categorical_column_with_hash_bucket 哈希列,整数或词典较大时采用。
    • indicator_column 指标列,由Categorical Column生成,one-hot编码
    • embedding_column 嵌入列,由Categorical Column生成,嵌入矢量分布参数需要学习。嵌入矢量维数建议取类别数量的 4 次方根。
    • crossed_column 交叉列,可以由除categorical_column_with_hash_bucket的任意分类列构成。

    特征列使用范例

    import datetime
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import tensorflow as tf
    from tensorflow.keras import layers, models
    
    # 打印日志
    def printlog(info):
        nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print('\n' + '=========='*8 + '%s' % nowtime)
        print(info + '...\n\n')
    
    # 一、构建数据管道
    printlog('step1: prepare dataset...')
    
    dftrain_raw = pd.read_csv('./data/titanic/train.csv')
    dftest_raw = pd.read_csv('./data/titanic/test.csv')
    
    df_raw = pd.concat((dftrain_raw, dftest_raw))
    
    def prepare_dfdata(df_raw):
        dfdata = df_raw.copy()
        dfdata.columns =  [x.lower() for x in dfdata.columns]
        dfdata = dfdata.rename(columns={'survived': 'label'})
        dfdata = dfdata.drop(['passengerid', 'name'], axis=1)
        for col, dtype in dict(dfdata.dtypes).items():
            # 判断是否包含缺失值
            if dfdata[col].hasnans:
                # 添加标识是否缺失列
                dfdata[col + '_nan'] = pd.isna(dfdata[col]).astype('int32')
                # 填充
                if dtype not in [np.object, np.str, np.unicode]:
                    dfdata[col].fillna(dfdata[col].mean(), inplace=True)
                else:
                    dfdata[col].fillna('', inplace=True)
        return dfdata
    
    dfdata = prepare_dfdata(df_raw)
    dftrain = dfdata.iloc[0:len(dftrain_raw), :]
    dftest = dfdata.iloc[len(dftrain_raw):, :]
    
    # 从DataFrame导入数据
    def df_to_dataset(df, shuffle=True, batch_size=32):
        dfdata = df.copy()
        if 'label' not in dfdata.columns:
            ds = tf.data.Dataset.from_tensor_slices(dfdata.to_dict(orient='list'))
        else:
            labels = dfdata.pop('label')
            ds = tf.data.Dataset.from_tensor_slices((dfdata.to_dict(orient='list'), labels))
        if shuffle:
            ds = ds.shuffle(buffer_size=len(dfdata))
        ds = ds.batch(batch_size)
        return ds
    
    ds_train = df_to_dataset(dftrain)
    ds_test = df_to_dataset(dftest)
    
    """
    ================================================================================2022-06-26 14:43:16
    step1: prepare dataset......
    """
    
    # 二、定义特征列
    printlog('step2: make feature columns...')
    
    feature_columns = []
    # 数值列
    for col in ['age', 'fare', 'parch', 'sibsp'] + [c for c in dfdata.columns if c.endswith('_nan')]:
        feature_columns.append(tf.feature_column.numeric_column(col))
        
    # 分桶列
    age = tf.feature_column.numeric_column('age')
    age_buckets = tf.feature_column.bucketized_column(age, boundaries=[18, 25, 30, 40, 45, 50, 55, 60, 65])
    feature_columns.append(age_buckets)
    
    # 类别列
    # 注意:所有的Categorical Column类型最终都要通过indicator_column转换成Dense Column类型才能传入模型
    sex = tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
            key='sex',
            vocabulary_list=['male', 'female']
        )
    )
    feature_columns.append(sex)
    
    pclass = tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
            key='pclass', vocabulary_list=[1, 2, 3]
        )
    )
    feature_columns.append(pclass)
    
    ticket = tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_hash_bucket('ticket', 3)  # 桶个数为3
    )
    feature_columns.append(ticket)
    
    embarked = tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
            key='embarked',
            vocabulary_list=['S', 'C', 'B']
        )
    )
    feature_columns.append(embarked)
    
    # 浅入列
    cabin = tf.feature_column.embedding_column(
        tf.feature_column.categorical_column_with_hash_bucket('cabin', 32),
        2
    )
    feature_columns.append(cabin)
    
    # 交叉列
    pclass_cate = tf.feature_column.categorical_column_with_vocabulary_list(
        key='pclass',
        vocabulary_list=[1, 2, 3]
    )
    crossed_feature = tf.feature_column.indicator_column(
        tf.feature_column.crossed_column([age_buckets, pclass_cate], hash_bucket_size=15)
    )
    
    feature_columns.append(crossed_feature)
    
    """
    ================================================================================2022-06-26 14:43:19
    step2: make feature columns......
    """
    
    # 定义模型
    printlog('step3: define model...')
    
    tf.keras.backend.clear_session()
    model = tf.keras.Sequential([
        layers.DenseFeatures(feature_columns),  # 将特征列放入到tf.keras.layers.DenseFeatures中
        layers.Dense(64, activation='relu'),
        layers.Dense(64, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ])
    
    """
    ================================================================================2022-06-26 14:43:21
    step3: define model......
    """
    
    # 四、训练模型
    printlog('step4: train model...')
    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    history = model.fit(ds_train, validation_data=ds_test, epochs=10)
    
    """
    ================================================================================2022-06-26 14:50:42
    step4: train model......
    
    
    Epoch 1/10
    WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'dict'> input: {'pclass': <tf.Tensor 'ExpandDims_8:0' shape=(None, 1) dtype=int32>, 'sex': <tf.Tensor 'ExpandDims_9:0' shape=(None, 1) dtype=string>, 'age': <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=float32>, 'sibsp': <tf.Tensor 'ExpandDims_10:0' shape=(None, 1) dtype=int32>, 'parch': <tf.Tensor 'ExpandDims_7:0' shape=(None, 1) dtype=int32>, 'ticket': <tf.Tensor 'ExpandDims_11:0' shape=(None, 1) dtype=string>, 'fare': <tf.Tensor 'ExpandDims_6:0' shape=(None, 1) dtype=float32>, 'cabin': <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>, 'embarked': <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>, 'age_nan': <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=int32>, 'cabin_nan': <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=int32>, 'embarked_nan': <tf.Tensor 'ExpandDims_5:0' shape=(None, 1) dtype=int32>}
    Consider rewriting this model with the Functional API.
    WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'dict'> input: {'pclass': <tf.Tensor 'ExpandDims_8:0' shape=(None, 1) dtype=int32>, 'sex': <tf.Tensor 'ExpandDims_9:0' shape=(None, 1) dtype=string>, 'age': <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=float32>, 'sibsp': <tf.Tensor 'ExpandDims_10:0' shape=(None, 1) dtype=int32>, 'parch': <tf.Tensor 'ExpandDims_7:0' shape=(None, 1) dtype=int32>, 'ticket': <tf.Tensor 'ExpandDims_11:0' shape=(None, 1) dtype=string>, 'fare': <tf.Tensor 'ExpandDims_6:0' shape=(None, 1) dtype=float32>, 'cabin': <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>, 'embarked': <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>, 'age_nan': <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=int32>, 'cabin_nan': <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=int32>, 'embarked_nan': <tf.Tensor 'ExpandDims_5:0' shape=(None, 1) dtype=int32>}
    Consider rewriting this model with the Functional API.
    18/23 [======================>.......] - ETA: 0s - loss: 0.5058 - accuracy: 0.7882WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'dict'> input: {'pclass': <tf.Tensor 'ExpandDims_8:0' shape=(None, 1) dtype=int32>, 'sex': <tf.Tensor 'ExpandDims_9:0' shape=(None, 1) dtype=string>, 'age': <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=float32>, 'sibsp': <tf.Tensor 'ExpandDims_10:0' shape=(None, 1) dtype=int32>, 'parch': <tf.Tensor 'ExpandDims_7:0' shape=(None, 1) dtype=int32>, 'ticket': <tf.Tensor 'ExpandDims_11:0' shape=(None, 1) dtype=string>, 'fare': <tf.Tensor 'ExpandDims_6:0' shape=(None, 1) dtype=float32>, 'cabin': <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>, 'embarked': <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>, 'age_nan': <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=int32>, 'cabin_nan': <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=int32>, 'embarked_nan': <tf.Tensor 'ExpandDims_5:0' shape=(None, 1) dtype=int32>}
    Consider rewriting this model with the Functional API.
    23/23 [==============================] - 1s 10ms/step - loss: 0.5035 - accuracy: 0.7823 - val_loss: 0.4723 - val_accuracy: 0.7709
    Epoch 2/10
    23/23 [==============================] - 0s 7ms/step - loss: 0.4636 - accuracy: 0.8006 - val_loss: 0.4668 - val_accuracy: 0.7654
    Epoch 3/10
    23/23 [==============================] - 0s 10ms/step - loss: 0.4410 - accuracy: 0.8160 - val_loss: 0.4457 - val_accuracy: 0.7821
    Epoch 4/10
    23/23 [==============================] - 0s 11ms/step - loss: 0.4418 - accuracy: 0.8104 - val_loss: 0.4493 - val_accuracy: 0.7654
    Epoch 5/10
    23/23 [==============================] - 0s 12ms/step - loss: 0.4266 - accuracy: 0.8343 - val_loss: 0.4597 - val_accuracy: 0.7765
    Epoch 6/10
    23/23 [==============================] - 0s 12ms/step - loss: 0.4232 - accuracy: 0.8272 - val_loss: 0.4441 - val_accuracy: 0.7821
    Epoch 7/10
    23/23 [==============================] - 0s 10ms/step - loss: 0.4201 - accuracy: 0.8315 - val_loss: 0.4567 - val_accuracy: 0.7654
    Epoch 8/10
    23/23 [==============================] - 0s 12ms/step - loss: 0.4178 - accuracy: 0.8244 - val_loss: 0.4493 - val_accuracy: 0.7821
    Epoch 9/10
    23/23 [==============================] - 0s 8ms/step - loss: 0.4201 - accuracy: 0.8188 - val_loss: 0.4399 - val_accuracy: 0.7765
    Epoch 10/10
    23/23 [==============================] - 0s 10ms/step - loss: 0.4190 - accuracy: 0.8202 - val_loss: 0.4649 - val_accuracy: 0.7821
    """
    
    # 五、评估模型
    printlog('step5: eval model...')
    
    model.summary()
    
    %matplotlib inline
    %config InlineBackend.figure_format='svg'
    
    import matplotlib.pyplot as plt
    
    def plot_metric(history, metric):
        train_metrics = history.history[metric]
        val_metrics = history.history['val_' + metric]
        epochs = range(1, len(train_metrics) + 1)
        plt.plot(epochs, train_metrics, 'bo--')
        plt.plot(epochs, val_metrics, 'ro-')
        plt.title('Training and validation ' + metric)
        plt.xlabel('Epochs')
        plt.ylabel(metric)
        plt.legend(['train_'+metric, 'val_'+metric])
        plt.show()
    plot_metric(history, 'accuracy')
    
    """
    ================================================================================2022-06-26 14:57:28
    step5: eval model......
    
    
    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    dense_features (DenseFeature multiple                  64        
    _________________________________________________________________
    dense (Dense)                multiple                  2944      
    _________________________________________________________________
    dense_1 (Dense)              multiple                  4160      
    _________________________________________________________________
    dense_2 (Dense)              multiple                  65        
    =================================================================
    Total params: 7,233
    Trainable params: 7,233
    Non-trainable params: 0
    _________________________________________________________________
    """
    

  • 相关阅读:
    Insert into select语句把生产服务器炸了!
    人人都能看懂的 6 种限流实现方案
    Idea 快捷生成类注释与方法注释
    拦截器
    java 泛型
    SQL语句总结
    深入浅出Git教程(转载)
    (转载)CSS3与页面布局学习总结(三)——BFC、定位、浮动、7种垂直居中方法
    css中常见margin塌陷问题之解决办法
    css中固定宽高div与不固定宽高div垂直居中的处理办法
  • 原文地址:https://www.cnblogs.com/lotuslaw/p/16413547.html
Copyright © 2020-2023  润新知