• How to use Datasets and DataLoader in PyTorch for custom text data


    ref:

    https://towardsdatascience.com/how-to-use-datasets-and-dataloader-in-pytorch-for-custom-text-data-270eed7f7c00

    https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

    https://sparrow.dev/pytorch-dataloader/

    Creating a PyTorch Dataset and managing it with Dataloader keeps your data manageable and helps to simplify your machine learning pipeline. a Dataset stores all your data, and Dataloader is can be used to iterate through the data, manage batches, transform the data, and much more.

    Import libraries

    import pandas as pd
    import torch
    from torch.utils.data import Dataset, DataLoader

    Create a custom Dataset class

    If the original data are as follows:

    in numbers.cvs:

    torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:

    • __len__ so that len(dataset) returns the size of the dataset.
    • __getitem__ to support the indexing such that dataset[i] can be used to get iith sample.

    We will read the csv in __init__ but leave the reading of images to __getitem__. This is memory efficient because all the images are not stored in the memory at once but read as required.

    class SeqDataset(Dataset):
        def __init__(self, file_root, max_length) -> None:
            super(SeqDataset).__init__()
    
            self.sentences = pd.read_csv(file_root)
            self.max_length = max_length
    
        def __len__(self):
            return len(self.sentences)
        
        def __getitem__(self, index):
            # 字符串处理
            sentence_a = self.sentences.sentence_a[index][1:-1].split(",")
            sentence_b = self.sentences.sentence_b[index][1:-1].split(",")
            # ['3', '4', '5']
            # ['6', '7', '8']
    
            # listz转array
            sentence_a = np.array([int(x) for x in sentence_a])
            sentence_b = np.array([int(x) for x in sentence_b])
            # array([3, 4, 5])
            # array([6, 7, 8])
    
            # 补齐
            sentence_a = np.pad(sentence_a, (0, self.max_length-sentence_a.shape[0]), 'constant', constant_values=(0,0))
            sentence_b = np.pad(sentence_b, (0, self.max_length-sentence_b.shape[0]), 'constant', constant_values=(0,0))
            # array([3, 4, 5, 0, 0, 0, 0, 0, 0, 0])
            # array([6, 7, 8, 0, 0, 0, 0, 0, 0, 0])
    
            return sentence_a, sentence_b

    Iterating through the dataset

    We can iterate over the created dataset with a for in range loop as before.

    However, we are losing a lot of features by using a simple for loop to iterate over the data. In particular, we are missing out on:

    • Batching the data
    • Shuffling the data
    • Load the data in parallel using multiprocessing workers.

    torch.utils.data.DataLoader is an iterator which provides all these features. Parameters used below should be clear. One parameter of interest is collate_fn. You can specify how exactly the samples need to be batched using collate_fn. However, default collate should work fine for most use cases.

        dataloader = DataLoader(dataset, batch_size=4,
                            shuffle=False, num_workers=0,  collate_fn=None)
    
        for batch_idx, batch in enumerate(dataloader):
            src, trg = batch
            print(src.shape)
            print(trg.shape)

    Output:

    (deeplearning) ➜  TransformerScratch python generate_data.py
    torch.Size([4, 10]) torch.Size([4, 10])
    tensor([[3, 4, 5, 0, 0, 0, 0, 0, 0, 0],
            [2, 3, 4, 0, 0, 0, 0, 0, 0, 0],
            [3, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [5, 6, 7, 8, 9, 0, 0, 0, 0, 0]])
    tensor([[ 6,  7,  8,  0,  0,  0,  0,  0,  0,  0],
            [ 5,  6,  7,  0,  0,  0,  0,  0,  0,  0],
            [ 4,  0,  0,  0,  0,  0,  0,  0,  0,  0],
            [10, 11, 12, 13, 14,  0,  0,  0,  0,  0]])
    torch.Size([4, 10]) torch.Size([4, 10])
    tensor([[4, 5, 0, 0, 0, 0, 0, 0, 0, 0],
            [5, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [1, 2, 3, 4, 5, 0, 0, 0, 0, 0],
            [1, 2, 3, 4, 5, 0, 0, 0, 0, 0]])
    tensor([[ 6,  7,  0,  0,  0,  0,  0,  0,  0,  0],
            [ 6,  0,  0,  0,  0,  0,  0,  0,  0,  0],
            [ 6,  7,  8,  9, 10,  0,  0,  0,  0,  0],
            [ 6,  7,  8,  9, 10,  0,  0,  0,  0,  0]])
    torch.Size([2, 10]) torch.Size([2, 10])
    tensor([[4, 5, 6, 0, 0, 0, 0, 0, 0, 0],
            [3, 4, 5, 0, 0, 0, 0, 0, 0, 0]])
    tensor([[7, 8, 9, 0, 0, 0, 0, 0, 0, 0],
            [6, 7, 8, 0, 0, 0, 0, 0, 0, 0]])

    Full Code

    import torch
    from torch.utils.data import Dataset, DataLoader
    import pandas as pd
    import numpy as np
    import ipdb
    
    
    class SeqDataset(Dataset):
        def __init__(self, file_root, max_length) -> None:
            super(SeqDataset).__init__()
    
            self.sentences = pd.read_csv(file_root)
            self.max_length = max_length
    
        def __len__(self):
            return len(self.sentences)
        
        def __getitem__(self, index):
            # 字符串处理
            sentence_a = self.sentences.sentence_a[index][1:-1].split(",")
            sentence_b = self.sentences.sentence_b[index][1:-1].split(",")
            # ['3', '4', '5']
            # ['6', '7', '8']
    
            # listz转array
            sentence_a = np.array([int(x) for x in sentence_a])
            sentence_b = np.array([int(x) for x in sentence_b])
            # array([3, 4, 5])
            # array([6, 7, 8])
    
            # 补齐
            sentence_a = np.pad(sentence_a, (0, self.max_length-sentence_a.shape[0]), 'constant', constant_values=(0,0))
            sentence_b = np.pad(sentence_b, (0, self.max_length-sentence_b.shape[0]), 'constant', constant_values=(0,0))
            # array([3, 4, 5, 0, 0, 0, 0, 0, 0, 0])
            # array([6, 7, 8, 0, 0, 0, 0, 0, 0, 0])
    
            return sentence_a, sentence_b
    
    
    
    if __name__ == "__main__":
        dataset = SeqDataset("./numbers.csv", 10)
        # print(dataset.__len__())
        # print(dataset.__getitem__(0))
        # print(dataset.__getitem__(6))
    
    
        dataloader = DataLoader(dataset, batch_size=4,
                            shuffle=False, num_workers=0,  collate_fn=None)
    
        for batch_idx, batch in enumerate(dataloader):
            src, trg = batch
            print(src.shape, trg.shape)
            print(src)
            print(trg)
            # ipdb.set_trace()
    个性签名:时间会解决一切
  • 相关阅读:
    解决Redis之MISCONF Redis is configured to save RDB snapshots
    lnmp一键安装redis以及php组件
    多版本PHP使用composer
    composer设置成阿里云镜像
    Yaf 简介
    Ubuntu配置php7.2开机启动
    Ubuntu查看开机日志
    ssh生成公钥
    Ubuntu系统如何分区
    初始化一个yaf项目
  • 原文地址:https://www.cnblogs.com/lfri/p/15479166.html
Copyright © 2020-2023  润新知