Classify structured data using Keras Preprocessing Layers

对于既有数值特征，又有类别特征的输入情况，使用 keras的预处理层进行转换。

https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/structured_data/preprocessing_layers.ipynb?hl=ar-bh#scrollTo=sMYQvJuBi7MS

This tutorial demonstrates how to classify structured data (e.g. tabular data in a CSV). You will use Keras to define the model, and preprocessing layers as a bridge to map from columns in a CSV to features used to train the model. This tutorial contains complete code to:

Load a CSV file using Pandas.

Build an input pipeline to batch and shuffle the rows using tf.data.

Map from columns in the CSV to features used to train the model using Keras Preprocessing layers.

Build, train, and evaluate a model using Keras.

Note: This tutorial is similar to Classify structured data with feature columns. This version uses new experimental Keras Preprocessing Layers instead of tf.feature_column. Keras Preprocessing Layers are more intuitive, and can be easily included inside your model to simplify deployment.

处理效果

数值型，定义输入，进行正规化变换。

类别性，转换为整数，对整数进行类别编码， one-hot-code

`CategoryEncoding`

https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/CategoryEncoding

This layer provides options for condensing data into a categorical encoding. It accepts integer values as inputs and outputs a dense representation (one sample = 1-index tensor of float values representing data about the sample's tokens) of those inputs.

类别编码对应的CODE

index -- 字符查询表，根据字符，查到index

encoder-- 根据index查询到的index，生成one-hot码

def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a StringLookup layer which will turn strings into integer indices
  if dtype == 'string':
    index = preprocessing.StringLookup(max_tokens=max_tokens)
  else:
    index = preprocessing.IntegerLookup(max_values=max_tokens)

  # Prepare a Dataset that only yields our feature
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Create a Discretization for our integer indices.
  encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())

  # Apply one-hot encoding to our indices. The lambda function captures the
  # layer so we can use them, or include them in the functional model later.
  return lambda feature: encoder(index(feature))

测试输出

type_col = train_features['Type']
layer = get_category_encoding_layer('Type', train_ds, 'string')
layer(type_col)

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

数值型处理

生成一个正则转换器。

def get_normalization_layer(name, dataset):
  # Create a Normalization layer for our feature.
  normalizer = preprocessing.Normalization()

  # Prepare a Dataset that only yields our feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the statistics of the data.
  normalizer.adapt(feature_ds)

  return normalizer

例子，将对应特征，做正规化处理。

photo_count_col = train_features['PhotoAmt']
layer = get_normalization_layer('PhotoAmt', train_ds)
layer(photo_count_col)

<tf.Tensor: shape=(5, 1), dtype=float32, numpy=
array([[ 1.3705449 ],
       [ 0.74395925],
       [-0.19591942],
       [-0.8225052 ],
       [-0.8225052 ]], dtype=float32)>

预处理管线定义

all_input 存储所有的输入层单元

encoded_features 存储所有的输入层单元，经过预处理管线后的逻辑单元，例如正规化对象，和 ONE-HOT 编码对象。

all_inputs = []
encoded_features = []

# Numeric features.
for header in ['PhotoAmt', 'Fee']:
  numeric_col = tf.keras.Input(shape=(1,), name=header)
  normalization_layer = get_normalization_layer(header, train_ds)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)




# Categorical features encoded as string.
categorical_cols = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
                    'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Breed1']
for header in categorical_cols:
  categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
  encoding_layer = get_category_encoding_layer(header, train_ds, dtype='string',
                                               max_tokens=5)
  encoded_categorical_col = encoding_layer(categorical_col)
  all_inputs.append(categorical_col)
  encoded_features.append(encoded_categorical_col)

使用 concatenate 接口，将所有的编码特征，连接起来。

构成统一的特征向量输出。

然后，将统一的特征向量输出，连接到第一层稠密层单元，然后连接到第二层稠密层单元。

第二层稠密层单元，就是作为模型输出。

tf.keras.Model 模型的一个参数，是对所有输入输出的规约；

第二个参数，是经过一些列预处理和模型本身的管线输出，模型输出的规约。

all_features = tf.keras.layers.concatenate(encoded_features)


x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)

model = tf.keras.Model(all_inputs, output)

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=["accuracy"])

tf.keras.layers.concatenate

https://www.tensorflow.org/api_docs/python/tf/keras/layers/concatenate

tensor

https://www.tensorflow.org/guide/tensor

print("Type of every element:", rank_4_tensor.dtype)
print("Number of axes:", rank_4_tensor.ndim)
print("Shape of tensor:", rank_4_tensor.shape)
print("Elements along axis 0 of tensor:", rank_4_tensor.shape[0])
print("Elements along the last axis of tensor:", rank_4_tensor.shape[-1])
print("Total number of elements (3*2*4*5): ", tf.size(rank_4_tensor).numpy())

Type of every element: <dtype: 'float32'>
Number of axes: 4
Shape of tensor: (3, 2, 4, 5)
Elements along axis 0 of tensor: 3
Elements along the last axis of tensor: 5
Total number of elements (3*2*4*5):  120

A rank-4 tensor, shape: `[3, 2, 4, 5]`

出处：http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。

相关阅读:
$("").append无反应
 go 客户端、服务端
 go mysql insert变量到数据库
 .gvfs: Permission denied
go笔记
 java socket通信笔记
 （转）linux中top命令下显示出的PRNIRESSHRS\%MEM TIME+都代表什么
 adb Android Debug Bridge 安卓调试桥
 一阶段冲刺（四）
一阶段冲刺（三）
原文地址：https://www.cnblogs.com/lightsong/p/14750273.html