• how can I ues Dataset to shuffle a large whole dataset?


    The Dataset.shuffle() implementation is designed for data that could be shuffled in memory; we're considering whether to add support for external-memory shuffles, but this is in the early stages. In case it works for you, here's the usual approach we use when the data are too large to fit in memory:

    Randomly shuffle the entire data once using a MapReduce/Spark/Beam/etc. job to create a set of roughly equal-sized files ("shards").
    In each epoch:

    1. Randomly shuffle the list of shard filenames, using Dataset.list_files(...).shuffle(num_shards).
    2. Use dataset.interleave(lambda filename: tf.data.TextLineDataset(filename), cycle_length=N) to mix together records from N different shards.
    3. Use dataset.shuffle(B) to shuffle the resulting dataset. Setting B might require some experimentation, but you will probably want to set it to some value larger than the number of records in a single shard.
  • 相关阅读:
    单例模式
    说说抽象类接口
    闲说多态
    理解C#中的继承
    可变个数的形参的方法
    java 关键字
    数组的常见异常
    内存的基本结构 图片
    Java中的名称命名规范:
    死锁的例子 代码练习
  • 原文地址:https://www.cnblogs.com/crackpotisback/p/9227523.html
Copyright © 2020-2023  润新知