dataset
orchard.data_handler.dataset
¶
PyTorch Dataset Definition Module.
This module contains the custom Dataset class for NPZ-based vision datasets, handling the conversion from NumPy arrays to PyTorch tensors and applying image transformations for training and inference.
It supports two loading strategies via classmethod factories:
from_npz: Eager loading into RAM with transforms, subsampling, and PIL conversion.lazy: Memory-mapped loading for large datasets or lightweight health checks.
Key Components:
VisionDataset: Full-featured dataset with eager and lazy loading modes.
VisionDataset(images, labels, *, transform=None)
¶
Bases: Dataset[tuple[Tensor, Tensor]]
PyTorch Dataset for NPZ-based vision data.
The constructor accepts raw NumPy arrays directly (no I/O). Use the classmethod factories to load from disk:
VisionDataset.from_npz(...)— eager, full split into RAM.VisionDataset.lazy(...)— memory-mapped, pages loaded on demand.
Initializes the dataset from pre-loaded arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
images
|
NDArray[Any]
|
Image array with shape |
required |
labels
|
NDArray[Any]
|
Label array, any shape that flattens to |
required |
transform
|
Compose | None
|
Pipeline of Torchvision transforms. |
None
|
Source code in orchard/data_handler/dataset.py
from_npz(path, split='train', *, transform=None, max_samples=None, seed=42)
classmethod
¶
Eagerly load a split from an NPZ archive into RAM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the dataset |
required |
split
|
str
|
Dataset split to load ( |
'train'
|
transform
|
Compose | None
|
Pipeline of Torchvision transforms. |
None
|
max_samples
|
int | None
|
If set, limits the number of samples (subsampling). |
None
|
seed
|
int
|
Random seed for deterministic subsampling. |
42
|
Source code in orchard/data_handler/dataset.py
lazy(path, split='train', *, transform=None, max_samples=None, seed=42)
classmethod
¶
Memory-mapped load from an NPZ archive (no full RAM copy).
Images are loaded page-by-page on demand. Suitable for large datasets that do not fit in RAM and for lightweight health checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the |
required |
split
|
str
|
Dataset split to load (default |
'train'
|
transform
|
Compose | None
|
Pipeline of Torchvision transforms. |
None
|
max_samples
|
int | None
|
If set, limits the number of samples (subsampling). |
None
|
seed
|
int
|
Random seed for deterministic subsampling. |
42
|
Source code in orchard/data_handler/dataset.py
__len__()
¶
__getitem__(idx)
¶
Retrieves a standardized sample-label pair.
The image is converted to a PIL object to ensure compatibility with Torchvision V2 transforms before being returned as a PyTorch Tensor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Sample index. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
A pair of (image, label) where image is a |
Tensor
|
tensor and label is a scalar long tensor. |