Datasets & DataLoaders
PyTorch provides two data primitives (基本資料型態)
that allow you to use pre-loaded datasets as well as your own data, as below:
Stores the samples and their corresponding labels
Dataset 定義資料的結構並且將其包起來,利如:
- 一張影像和一個標籤
- 一張影像和多個標籤
- 一張影像和 Bounding box 的座標與長寬等
DataLoader wraps an iterable around the Dataset to enable easy access to the samples.
DataLoader 將資料集 (Dataset) 進行包裝,定義如何讀取資料,以及每一個 batch 要讀取多少資料 (batch size)。
一定要先有 Dataset 才可以用 DataLoader 操作
Built-in Datasets
Torchvision
provides many built-in datasets in the torchvision.datasets module, as well as utility classes for building your own datasets.
內建的 Dataset 如下:
Example
Fashion-MNIST,詳細: built_in_dataset
# Loading a Dataset
training_data = datasets.FashionMNIST(
root="data",
train=True,
download=True,
transform=ToTensor()
)
其他 Built-in Dataset 使用方式請查看官網。
Custom Dataset
A custom Dataset class must implement three functions: init, len, and getitem.
建立自己的 Dataset 會需要繼承 torch.utils.data.Dataset
,且需要實現三個function:
__init__(self)
: 初始化,進行資料定義,如: self.data, self.label。We initialize the directory containing the images, the annotations file, and both transforms (covered in more detail in the next section).
The labels.csv file looks like:
__len__(self)
: 獲取資料長度。The
__len__
function returns the number of samples in our dataset.__getitem__(self, index)
: 進行資料前處理(如: Transform)
與相關讀取方式。接收一個索引 (index),然後返回影像資料和相關標簽。 其中 index 是根據
__len__
返回值,如:__len__ 返回: 4 index = 0, 1, 2, 3
建立完 Dataset 後用 DataLoader
進行包裝,以方便我們進行 training。如下所示:
1. Creating a Custom Dataset for your files
import os
import numpy as pd
from torch.utils.data import Dataset
from torchvision.io import read_image
class CustomImageDataset(Dataset):
def __init__(self, img_dir, annotations_file,
transform=None, target_transform=None):
self.img_dir = img_dir
self.img_labels = pd.read_csv(annotations_file)
self.transform = transform
self.target_transform = target_transform
def __len__(self):
return len(self.img_labels)
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
image = read_image(img_path)
label = self.img_labels.iloc[idx, 1]
if self.transform:
image = self.transform(image)
if self.target_transform:
label = self.target_transform(label)
return image, label
2. Preparing your data for training with DataLoaders
from torch.utils.data import DataLoader
train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)
# Iterate through the DataLoader
Example: Fashion-MNIST
Creating a Custom Dataset for your files,詳細: custom_dataset.py
import numpy as np from torch.utils.data import Dataset from torchvision.io import read_image class CustomDatasetFromFile(Dataset): def __init__(self, image_file, label_file, transform=None, target_transform=None): self.image_file = image_file self.label_file = label_file self.transform = transform self.target_transform = target_transform with open(self.label_file, 'rb') as lbpath: self.labels = np.fromfile(lbpath, dtype=np.uint8) with open(self.image_file, 'rb') as imgpath: self.images = np.fromfile(imgpath, dtype=np.uint8).reshape( len(self.labels), 28, 28) def __len__(self): return len(self.labels) def __getitem__(self, idx): image = self.images[idx] label = self.labels[idx] if self.transform: image = self.transform(image) if self.target_transform: label = self.target_transform(label) return image, label
Preparing your data for training with DataLoaders,詳細: main.py
# Creating a Custom Dataset for your files training_data = CustomDatasetFromFile( image_file="data/train-images-idx3-ubyte", label_file="data/train-labels-idx1-ubyte", transform=ToTensor() ) test_data = CustomDatasetFromFile( image_file="data/t10k-images-idx3-ubyte", label_file="data/t10k-labels-idx1-ubyte", transform=ToTensor() ) # Creating DataLoader train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True) test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True) # Iterate through the DataLoader