TensorFlow dataset API 使用

由于本人感兴趣的是自然语言处理，所以下面有关dataset API 的使用偏向于变长数据的处理。

1. 从迭代器中引入数据

import numpy as np
import tensorflow as tf


def gen():
    for _ in range(10):
        sz = np.random.randint(3, 20, 1)[0]
        yield np.random.randint(1, 100, sz), np.random.randint(0, 10, 1)[0]


dataset = tf.data.Dataset.from_generator(
    gen, (tf.int32, tf.int32)).repeat(2).shuffle(buffer_size=100).padded_batch(3, padded_shapes=([None], []))
iter = dataset.make_one_shot_iterator()
x, y = iter.get_next()


with tf.Session() as sess:
    try:
        while True:
            _x, _y = sess.run([x, y])
            print("x is :
", _x)
            print("y is :
", _y)
            print("*" * 50)

    except tf.errors.OutOfRangeError:
        print("done")
    finally:
        pass

输出的结果如下所示，我们可以将X看作是句子，存的是词的ID，Y看作是对句子的分类标签。由于不同句子长度不一样，所以这里使用了0进行填充，使得每个batch内的句子长度一样。

x is :
 [[41 57 68 84 40 72 98 71 95 50 94 17 78 60 69 29 77]
 [55 44 11 70 39 39 97 86 71 20  0  0  0  0  0  0  0]
 [12 36 75 49 86  0  0  0  0  0  0  0  0  0  0  0  0]]
y is :
 [4 1 9]
**************************************************
x is :
 [[59 33 64 47 20 53 93 68 73 57 68 59 34]
 [69 39 12 83 54 11 92 89 60 21 30 30 31]
 [19 32 62  9 66 34 85 86 22 33 19 79 28]]
y is :
 [8 1 5]
**************************************************
x is :
 [[47 24 96 38 21 53 78 52 74 15 87 37 21 29 45 61 19 56 73]
 [ 1 24 32  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [73 52 14 11 83 77 83 24 34  0  0  0  0  0  0  0  0  0  0]]
y is :
 [9 4 4]
**************************************************
x is :
 [[34 21 36 17 90 96 19  3 28 60 87 93  4 41 22 89 70 83 58]
 [70 25 84 42 45 29 40  0  0  0  0  0  0  0  0  0  0  0  0]
 [97 72 19 73  7  9 83 46 72 64 98 13 78 94 66 10 30 46 13]]
y is :
 [9 9 4]
**************************************************
x is :
 [[33 27 59 45 79 21 57 17 46 24 67 64 83 95 59 65  7 26 82]
 [84 31 48 91  7 51 14 71 17 40 89 44 25 17 42 13 99  0  0]
 [63 97 45 49 68 70 79 28 90  4 68 77 27  0  0  0  0  0  0]]
y is :
 [8 1 8]
**************************************************
x is :
 [[62 19 42 88  3 16 20 38  5 59]
 [99 84 87 10  8 13  0  0  0  0]
 [44 45 45 58 34 53  8 54  0  0]]
y is :
 [1 1 4]
**************************************************
x is :
 [[77 51 44 51  2 38 60 46 12 78 20 15 23 57]
 [25 81 23 22  0  0  0  0  0  0  0  0  0  0]]
y is :
 [4 5]
**************************************************
done

相关阅读:
【SpringFramework】Spring 事务控制
Mini 学生管理器
方法的重写override，重载overload。
方法封装，属性调用以及设置。
判断hdfs文件是否存在
模拟(删除/远程拷贝)当前一周的日志文件
2.上传hdfs系统：将logs目录下的日志文件每隔十分钟上传一次要求：上传后的文件名修为：2017111513xx.log_copy
使用定时器：在logs目录，每两分钟产生一个文件
五个节点的hadoop集群--主要配置文件
hadoop集群配置文件与功能对应解析

原文地址：https://www.cnblogs.com/crackpotisback/p/10131431.html