[PyTorch] Datasets & DataLoaders

Datasets & DataLoaders

PyTorch provides two data primitives (基本資料型態) that allow you to use pre-loaded datasets as well as your own data, as below:

  • torch.utils.data.Dataset

    Stores the samples and their corresponding labels

    Dataset 定義資料的結構並且將其包起來,利如:

    • 一張影像和一個標籤
    • 一張影像和多個標籤
    • 一張影像和 Bounding box 的座標與長寬等
  • torch.utils.data.DataLoader

    DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

    DataLoader 將資料集 (Dataset) 進行包裝,定義如何讀取資料,以及每一個 batch 要讀取多少資料 (batch size)。

一定要先有 Dataset 才可以用 DataLoader 操作

Built-in Datasets

Torchvision provides many built-in datasets in the torchvision.datasets module, as well as utility classes for building your own datasets.

內建的 Dataset 如下:

Example

Fashion-MNIST,詳細: built_in_dataset

# Loading a Dataset
training_data = datasets.FashionMNIST(
        root="data",
        train=True,
        download=True,
        transform=ToTensor()
    )

其他 Built-in Dataset 使用方式請查看官網

Custom Dataset

A custom Dataset class must implement three functions: init, len, and getitem.

建立自己的 Dataset 會需要繼承 torch.utils.data.Dataset,且需要實現三個function:

  • __init__(self): 初始化,進行資料定義,如: self.data, self.label。

    We initialize the directory containing the images, the annotations file, and both transforms (covered in more detail in the next section).

The labels.csv file looks like:

  • __len__(self): 獲取資料長度。

    The __len__ function returns the number of samples in our dataset.

  • __getitem__(self, index): 進行資料前處理(如: Transform)與相關讀取方式。

    接收一個索引 (index),然後返回影像資料和相關標簽。 其中 index 是根據 __len__返回值,如:

    __len__ 返回: 4
    index = 0, 1, 2, 3
    

建立完 Dataset 後用 DataLoader 進行包裝,以方便我們進行 training。如下所示:

1. Creating a Custom Dataset for your files

import os

import numpy as pd
from torch.utils.data import Dataset
from torchvision.io import read_image


class CustomImageDataset(Dataset):
    def __init__(self, img_dir, annotations_file,
                 transform=None, target_transform=None):
        self.img_dir = img_dir
        self.img_labels = pd.read_csv(annotations_file)
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = read_image(img_path)
        label = self.img_labels.iloc[idx, 1]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

2. Preparing your data for training with DataLoaders

from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

# Iterate through the DataLoader

Example: Fashion-MNIST

  1. Creating a Custom Dataset for your files,詳細: custom_dataset.py

    import numpy as np
    from torch.utils.data import Dataset
    from torchvision.io import read_image
    
    
    class CustomDatasetFromFile(Dataset):
        def __init__(self, image_file, label_file, transform=None, target_transform=None):
            self.image_file = image_file
            self.label_file = label_file
            self.transform = transform
            self.target_transform = target_transform
    
            with open(self.label_file, 'rb') as lbpath:
                self.labels = np.fromfile(lbpath, dtype=np.uint8)
    
            with open(self.image_file, 'rb') as imgpath:
                self.images = np.fromfile(imgpath, dtype=np.uint8).reshape(
                    len(self.labels), 28, 28)
    
        def __len__(self):
            return len(self.labels)
    
        def __getitem__(self, idx):
            image = self.images[idx]
            label = self.labels[idx]
            if self.transform:
                image = self.transform(image)
            if self.target_transform:
                label = self.target_transform(label)
            return image, label
    
  2. Preparing your data for training with DataLoaders,詳細: main.py

    # Creating a Custom Dataset for your files
    training_data = CustomDatasetFromFile(
        image_file="data/train-images-idx3-ubyte",
        label_file="data/train-labels-idx1-ubyte",
        transform=ToTensor()
    )
    
    test_data = CustomDatasetFromFile(
        image_file="data/t10k-images-idx3-ubyte",
        label_file="data/t10k-labels-idx1-ubyte",
        transform=ToTensor()
    )
    
    # Creating DataLoader
    train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
    test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)
    
    # Iterate through the DataLoader
    

Reference