data_handler
orchard.data_handler
¶
Data Handler Package.
This package manages the end-to-end data pipeline, from downloading raw NPZ files using the Dataset Registry to providing fully configured PyTorch DataLoaders.
VisionDataset(images, labels, *, transform=None)
¶
Bases: Dataset[tuple[Tensor, Tensor]]
PyTorch Dataset for NPZ-based vision data.
The constructor accepts raw NumPy arrays directly (no I/O). Use the classmethod factories to load from disk:
VisionDataset.from_npz(...)— eager, full split into RAM.VisionDataset.lazy(...)— memory-mapped, pages loaded on demand.
Initializes the dataset from pre-loaded arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
images
|
NDArray[Any]
|
Image array with shape |
required |
labels
|
NDArray[Any]
|
Label array, any shape that flattens to |
required |
transform
|
Compose | None
|
Pipeline of Torchvision transforms. |
None
|
Source code in orchard/data_handler/dataset.py
from_npz(path, split='train', *, transform=None, max_samples=None, seed=42)
classmethod
¶
Eagerly load a split from an NPZ archive into RAM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the dataset |
required |
split
|
str
|
Dataset split to load ( |
'train'
|
transform
|
Compose | None
|
Pipeline of Torchvision transforms. |
None
|
max_samples
|
int | None
|
If set, limits the number of samples (subsampling). |
None
|
seed
|
int
|
Random seed for deterministic subsampling. |
42
|
Source code in orchard/data_handler/dataset.py
lazy(path, split='train', *, transform=None, max_samples=None, seed=42)
classmethod
¶
Memory-mapped load from an NPZ archive (no full RAM copy).
Images are loaded page-by-page on demand. Suitable for large datasets that do not fit in RAM and for lightweight health checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the |
required |
split
|
str
|
Dataset split to load (default |
'train'
|
transform
|
Compose | None
|
Pipeline of Torchvision transforms. |
None
|
max_samples
|
int | None
|
If set, limits the number of samples (subsampling). |
None
|
seed
|
int
|
Random seed for deterministic subsampling. |
42
|
Source code in orchard/data_handler/dataset.py
__len__()
¶
__getitem__(idx)
¶
Retrieves a standardized sample-label pair.
The image is converted to a PIL object to ensure compatibility with Torchvision V2 transforms before being returned as a PyTorch Tensor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Sample index. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
A pair of (image, label) where image is a |
Tensor
|
tensor and label is a scalar long tensor. |
Source code in orchard/data_handler/dataset.py
DatasetData(path, name, is_rgb, num_classes)
dataclass
¶
Metadata container for a loaded dataset.
Stores path and format info instead of raw arrays to save RAM.
DataLoaderFactory(dataset_cfg, training_cfg, aug_cfg, num_workers, metadata)
¶
Orchestrates the creation of optimized PyTorch DataLoaders.
This factory centralizes the configuration of training, validation, and testing pipelines. It ensures that data transformations, class balancing, and hardware settings are synchronized across all splits.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_cfg |
DatasetConfig
|
Dataset sub-config. |
training_cfg |
TrainingConfig
|
Training sub-config. |
aug_cfg |
AugmentationConfig
|
Augmentation sub-config. |
num_workers |
int
|
Resolved worker count from hardware config. |
metadata |
DatasetData
|
Data path and raw format information. |
ds_meta |
DatasetMetadata
|
Official dataset registry specifications. |
logger |
Logger
|
Module-specific logger. |
Initializes the factory with environment and dataset metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_cfg
|
DatasetConfig
|
Dataset sub-config (splits, classes, resolution). |
required |
training_cfg
|
TrainingConfig
|
Training sub-config (batch size, seed). |
required |
aug_cfg
|
AugmentationConfig
|
Augmentation sub-config (transforms pipeline). |
required |
num_workers
|
int
|
Resolved worker count from hardware config. |
required |
metadata
|
DatasetData
|
Metadata from the data fetcher/downloader. |
required |
Source code in orchard/data_handler/loader.py
build(is_optuna=False)
¶
Constructs and returns the full suite of DataLoaders.
Assembles train/val/test splits with transforms, optional class balancing, and hardware-aware infrastructure settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
is_optuna
|
bool
|
If True, use memory-conservative settings for hyperparameter tuning (fewer workers, no persistent workers). |
False
|
Returns:
| Type | Description |
|---|---|
tuple[DataLoader[Any], DataLoader[Any], DataLoader[Any]]
|
A tuple of (train_loader, val_loader, test_loader). |
Source code in orchard/data_handler/loader.py
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | |
show_sample_images(loader, save_path, *, mean=None, std=None, arch_name='Model', fig_dpi=_DEFAULT_DPI, num_samples=16, title_prefix=None)
¶
Extract a batch from the DataLoader and save a grid of sample images.
Saves images with their corresponding labels to verify data integrity and augmentations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loader
|
DataLoader[Any]
|
The PyTorch DataLoader to sample from. |
required |
save_path
|
Path
|
Full path (including filename) to save the resulting image. |
required |
mean
|
tuple[float, ...] | None
|
Per-channel mean for denormalization. |
None
|
std
|
tuple[float, ...] | None
|
Per-channel std for denormalization. |
None
|
arch_name
|
str
|
Architecture name for the figure title. |
'Model'
|
fig_dpi
|
int
|
DPI for the saved figure. |
_DEFAULT_DPI
|
num_samples
|
int
|
Number of images to display in the grid. |
16
|
title_prefix
|
str | None
|
Optional string to prepend to the figure title. |
None
|
Source code in orchard/data_handler/data_explorer.py
show_samples_for_dataset(loader, dataset_name, run_paths, *, mean=None, std=None, arch_name='Model', fig_dpi=_DEFAULT_DPI, num_samples=16, resolution=None)
¶
Generate a grid of sample images from a dataset and save to the figures directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loader
|
DataLoader[Any]
|
PyTorch DataLoader to sample images from. |
required |
dataset_name
|
str
|
Name of the dataset, used in the filename and title. |
required |
run_paths
|
RunPaths
|
RunPaths instance to resolve figure saving path. |
required |
mean
|
tuple[float, ...] | None
|
Per-channel mean for denormalization. |
None
|
std
|
tuple[float, ...] | None
|
Per-channel std for denormalization. |
None
|
arch_name
|
str
|
Architecture name for the figure title. |
'Model'
|
fig_dpi
|
int
|
DPI for the saved figure. |
_DEFAULT_DPI
|
num_samples
|
int
|
Number of images to include in the grid. |
16
|
resolution
|
int | None
|
Resolution to include in filename to avoid overwriting. |
None
|
Source code in orchard/data_handler/data_explorer.py
create_synthetic_dataset(num_classes=8, samples=100, resolution=28, channels=3, name='syntheticmnist')
¶
Create a synthetic NPZ-compatible dataset for testing.
This function generates random image data and labels, saves them to a temporary .npz file, and returns a DatasetData object that can be used with the existing data pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_classes
|
int
|
Number of classification categories (default: 8) |
8
|
samples
|
int
|
Number of training samples (default: 100) |
100
|
resolution
|
int
|
Image resolution (HxW) (default: 28) |
28
|
channels
|
int
|
Number of color channels (default: 3 for RGB) |
3
|
name
|
str
|
Dataset name for identification (default: "syntheticmnist") |
'syntheticmnist'
|
Returns:
| Name | Type | Description |
|---|---|---|
DatasetData |
DatasetData
|
A data object compatible with the existing pipeline |
Example
data = create_synthetic_dataset(num_classes=8, samples=100) train_loader, val_loader, test_loader = get_dataloaders( ... data, cfg.dataset, cfg.training, cfg.augmentation, cfg.num_workers ... )
Source code in orchard/data_handler/diagnostic/synthetic.py
create_synthetic_grayscale_dataset(num_classes=8, samples=100, resolution=28)
¶
Create a synthetic grayscale NPZ dataset for testing.
Convenience function for creating single-channel (grayscale) synthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_classes
|
int
|
Number of classification categories (default: 8) |
8
|
samples
|
int
|
Number of training samples (default: 100) |
100
|
resolution
|
int
|
Image resolution (HxW) (default: 28) |
28
|
Returns:
| Name | Type | Description |
|---|---|---|
DatasetData |
DatasetData
|
A grayscale data object compatible with the pipeline |
Source code in orchard/data_handler/diagnostic/synthetic.py
create_temp_loader(dataset_path, batch_size=_DEFAULT_HEALTHCHECK_BATCH_SIZE)
¶
Load a NPZ dataset lazily and return a DataLoader for health checks.
This avoids loading the entire dataset into RAM at once, which is critical for large datasets (e.g., 224x224 images).
Source code in orchard/data_handler/diagnostic/temp_loader.py
ensure_dataset_npz(metadata, retries=5, delay=5.0)
¶
Dispatcher that routes each dataset to its dedicated fetch pipeline.
Automatically detects dataset type from metadata.name and delegates
to the appropriate download/conversion module. Adding a new domain
(e.g. a new resolution or source) only requires a new branch here and
a corresponding fetch module.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
DatasetMetadata
|
Metadata containing URL, MD5, name and target path. |
required |
retries
|
int
|
Max number of download attempts (NPZ fetcher only). |
5
|
delay
|
float
|
Delay (seconds) between retries (NPZ fetcher only). |
5.0
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Path to the successfully validated .npz file. |
Source code in orchard/data_handler/fetcher.py
load_dataset(metadata)
¶
Ensures the dataset is present and returns its metadata container.
get_dataloaders(metadata, dataset_cfg, training_cfg, aug_cfg, num_workers, is_optuna=False)
¶
Convenience function for creating train/val/test DataLoaders.
Wraps DataLoaderFactory for streamlined loader construction with automatic class balancing, hardware optimization, and Optuna support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
DatasetData
|
Dataset metadata from load_dataset (paths, splits). |
required |
dataset_cfg
|
DatasetConfig
|
Dataset sub-config (splits, classes, resolution). |
required |
training_cfg
|
TrainingConfig
|
Training sub-config (batch size, seed). |
required |
aug_cfg
|
AugmentationConfig
|
Augmentation sub-config (transforms pipeline). |
required |
num_workers
|
int
|
Resolved worker count from hardware config. |
required |
is_optuna
|
bool
|
If True, use memory-conservative settings for hyperparameter tuning. |
False
|
Returns:
| Type | Description |
|---|---|
tuple[DataLoader[Any], DataLoader[Any], DataLoader[Any]]
|
A 3-tuple of (train_loader, val_loader, test_loader). |
Example
data = load_dataset(ds_meta) loaders = get_dataloaders( ... data, cfg.dataset, cfg.training, cfg.augmentation, cfg.num_workers ... )
Source code in orchard/data_handler/loader.py
get_augmentations_description(aug_cfg, img_size, mixup_alpha, ds_meta=None)
¶
Generates descriptive string of augmentations for logging.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
aug_cfg
|
AugmentationConfig
|
Augmentation sub-configuration |
required |
img_size
|
int
|
Target image size for resized crop |
required |
mixup_alpha
|
float
|
MixUp alpha (0.0 to disable) |
required |
ds_meta
|
DatasetMetadata | None
|
Dataset metadata (if provided, respects domain flags) |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Human-readable augmentation summary |
Source code in orchard/data_handler/transforms.py
get_pipeline_transforms(aug_cfg, img_size, ds_meta, *, force_rgb=True, norm_mean=None, norm_std=None)
¶
Constructs training and validation transformation pipelines.
Dynamically adapts to dataset characteristics (RGB vs Grayscale) and optionally promotes grayscale to 3-channel for pretrained-weight compatibility. Uses torchvision v2 transforms for improved CPU/GPU performance.
Pipeline Logic
- Convert to tensor format (ToImage + ToDtype)
- Promote 1-channel to 3-channel when
force_rgbis True and the dataset is native grayscale - Apply domain-aware augmentations (training only): geometric transforms disabled for anatomical datasets, color jitter reduced for texture-based datasets
- Normalize with dataset-specific statistics
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
aug_cfg
|
AugmentationConfig
|
Augmentation sub-configuration |
required |
img_size
|
int
|
Target image size |
required |
ds_meta
|
DatasetMetadata
|
Dataset metadata (channels, domain flags) |
required |
force_rgb
|
bool
|
Promote grayscale datasets to 3-channel RGB |
True
|
norm_mean
|
tuple[float, ...] | None
|
Pre-computed normalization mean (from DatasetConfig). When None, computed from ds_meta + force_rgb. |
None
|
norm_std
|
tuple[float, ...] | None
|
Pre-computed normalization std (from DatasetConfig). When None, computed from ds_meta + force_rgb. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Compose, Compose]
|
tuple[v2.Compose, v2.Compose]: (train_transform, val_transform) |
Source code in orchard/data_handler/transforms.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | |