data_handler
orchard.data_handler
¶
Data Handler Package.
This package manages the end-to-end data pipeline, from downloading raw NPZ files using the Dataset Registry to providing fully configured PyTorch DataLoaders.
VisionDataset(images, labels, *, transform=None)
¶
Bases: Dataset[tuple[Tensor, Tensor]]
PyTorch Dataset for NPZ-based image data.
The constructor accepts raw NumPy arrays directly (no I/O). Use the classmethod factories to load from disk:
VisionDataset.from_npz(...)— eager, full split into RAM.VisionDataset.lazy(...)— memory-mapped, pages loaded on demand.
Initializes the dataset from pre-loaded arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
images
|
NDArray[Any]
|
Image array with shape |
required |
labels
|
NDArray[Any]
|
Label array, any shape that flattens to |
required |
transform
|
Compose | None
|
Pipeline of Torchvision transforms. |
None
|
Source code in orchard/data_handler/dataset.py
from_npz(path, split='train', *, transform=None, max_samples=None, seed=DEFAULT_SEED)
classmethod
¶
Eagerly load a split from an NPZ archive into RAM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the dataset |
required |
split
|
str
|
Dataset split to load ( |
'train'
|
transform
|
Compose | None
|
Pipeline of Torchvision transforms. |
None
|
max_samples
|
int | None
|
If set, limits the number of samples (subsampling). |
None
|
seed
|
int
|
Random seed for deterministic subsampling. |
DEFAULT_SEED
|
Source code in orchard/data_handler/dataset.py
lazy(path, split='train', *, transform=None, max_samples=None, seed=DEFAULT_SEED)
classmethod
¶
Memory-mapped load from an NPZ archive (no full RAM copy).
Images are loaded page-by-page on demand. Suitable for large datasets that do not fit in RAM and for lightweight health checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the |
required |
split
|
str
|
Dataset split to load (default |
'train'
|
transform
|
Compose | None
|
Pipeline of Torchvision transforms. |
None
|
max_samples
|
int | None
|
If set, limits the number of samples (subsampling). |
None
|
seed
|
int
|
Random seed for deterministic subsampling. |
DEFAULT_SEED
|
Source code in orchard/data_handler/dataset.py
__len__()
¶
__getitem__(idx)
¶
Retrieves a standardized sample-label pair.
The image is converted to a PIL object to ensure compatibility with Torchvision V2 transforms before being returned as a PyTorch Tensor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Sample index. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
A pair of (image, label) where image is a |
Tensor
|
tensor and label is a scalar long tensor. |
Source code in orchard/data_handler/dataset.py
DetectionDataset(images, annotations, *, transform=None)
¶
Bases: Dataset[tuple[Tensor, dict[str, Tensor]]]
PyTorch Dataset for detection tasks with bounding-box annotations.
Each sample returns an (image, target) pair where target is a
dict with boxes (N, 4) in [x1, y1, x2, y2] format and
labels (N,) as int64 class indices.
Initialize from pre-loaded arrays and annotation list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
images
|
NDArray[Any]
|
Image array |
required |
annotations
|
list[dict[str, NDArray[Any]]]
|
Per-image annotation dicts with |
required |
transform
|
Compose | None
|
Torchvision transform pipeline for images. |
None
|
Source code in orchard/data_handler/detection_dataset.py
from_arrays(images, annotations, *, transform=None, max_samples=None, seed=DEFAULT_SEED)
classmethod
¶
Build a DetectionDataset with optional subsampling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
images
|
NDArray[Any]
|
Image array |
required |
annotations
|
list[dict[str, NDArray[Any]]]
|
Per-image annotation dicts. |
required |
transform
|
Compose | None
|
Transform pipeline. |
None
|
max_samples
|
int | None
|
Limit number of samples. |
None
|
seed
|
int
|
Random seed for deterministic subsampling. |
DEFAULT_SEED
|
Source code in orchard/data_handler/detection_dataset.py
from_npz(image_path, annotation_path, split='train', *, transform=None, max_samples=None, seed=DEFAULT_SEED)
classmethod
¶
Load a detection dataset from NPZ (images) and NPZ (annotations).
The image NPZ has key {split}_images. The annotation NPZ has
keys {split}_boxes (list of (N_i, 4) arrays) and
{split}_labels (list of (N_i,) arrays), stored as object arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_path
|
Path
|
Path to images NPZ. |
required |
annotation_path
|
Path
|
Path to annotations NPZ. |
required |
split
|
str
|
Dataset split ( |
'train'
|
transform
|
Compose | None
|
Transform pipeline. |
None
|
max_samples
|
int | None
|
Limit number of samples. |
None
|
seed
|
int
|
Random seed. |
DEFAULT_SEED
|
Source code in orchard/data_handler/detection_dataset.py
__len__()
¶
__getitem__(idx)
¶
Retrieve an image and its bounding-box annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Sample index. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple of (image_tensor, target_dict) where target_dict has |
dict[str, Tensor]
|
|
Source code in orchard/data_handler/detection_dataset.py
DatasetData(path, name, is_rgb, num_classes, annotation_path=None)
dataclass
¶
Metadata container for a loaded dataset.
Stores path and format info instead of raw arrays to save RAM.
Attributes:
| Name | Type | Description |
|---|---|---|
path |
Path
|
Path to images NPZ. |
name |
str
|
Dataset identifier. |
is_rgb |
bool
|
Whether images are RGB (3 channels). |
num_classes |
int
|
Number of categories. |
annotation_path |
Path | None
|
Path to annotations NPZ (detection datasets only; None for classification). |
DataLoaderFactory(dataset_cfg, training_cfg, aug_cfg, num_workers, metadata, task_type='classification')
¶
Orchestrates the creation of optimized PyTorch DataLoaders.
This factory centralizes the configuration of training, validation, and testing pipelines. It ensures that data transformations, class balancing, and hardware settings are synchronized across all splits.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_cfg |
DatasetConfig
|
Dataset sub-config. |
training_cfg |
TrainingConfig
|
Training sub-config. |
aug_cfg |
AugmentationConfig
|
Augmentation sub-config. |
num_workers |
int
|
Resolved worker count from hardware config. |
metadata |
DatasetData
|
Data path and raw format information. |
ds_meta |
DatasetMetadata
|
Official dataset registry specifications. |
logger |
Logger
|
Module-specific logger. |
Initializes the factory with environment and dataset metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_cfg
|
DatasetConfig
|
Dataset sub-config (splits, classes, resolution). |
required |
training_cfg
|
TrainingConfig
|
Training sub-config (batch size, seed). |
required |
aug_cfg
|
AugmentationConfig
|
Augmentation sub-config (transforms pipeline). |
required |
num_workers
|
int
|
Resolved worker count from hardware config. |
required |
metadata
|
DatasetData
|
Metadata from the data fetcher/downloader. |
required |
task_type
|
str
|
Task type ( |
'classification'
|
Source code in orchard/data_handler/loader.py
build(is_optuna=False)
¶
Constructs and returns the full suite of DataLoaders.
Assembles train/val/test splits with transforms, optional class balancing, and hardware-aware infrastructure settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
is_optuna
|
bool
|
If True, use memory-conservative settings for hyperparameter tuning (fewer workers, no persistent workers). |
False
|
Returns:
| Type | Description |
|---|---|
tuple[DataLoader[Any], DataLoader[Any], DataLoader[Any]]
|
A tuple of (train_loader, val_loader, test_loader). |
Source code in orchard/data_handler/loader.py
293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 | |
detection_collate_fn(batch)
¶
Collate detection samples into list-based batches.
Unlike the default PyTorch collate (which stacks tensors), detection requires list-based batching because each image can have a different number of bounding boxes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch
|
list[tuple[Tensor, dict[str, Any]]]
|
List of (image, target_dict) tuples from the dataset. |
required |
Returns:
| Type | Description |
|---|---|
tuple[list[Tensor], list[dict[str, Any]]]
|
Tuple of (list of image tensors, list of target dicts). |
Source code in orchard/data_handler/collate.py
show_sample_images(loader, save_path, *, mean=None, std=None, arch_name='Model', fig_dpi=_DEFAULT_DPI, num_samples=16, title_prefix=None)
¶
Extract a batch from the DataLoader and save a grid of sample images.
Saves images with their corresponding labels to verify data integrity and augmentations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loader
|
DataLoader[Any]
|
The PyTorch DataLoader to sample from. |
required |
save_path
|
Path
|
Full path (including filename) to save the resulting image. |
required |
mean
|
tuple[float, ...] | None
|
Per-channel mean for denormalization. |
None
|
std
|
tuple[float, ...] | None
|
Per-channel std for denormalization. |
None
|
arch_name
|
str
|
Architecture name for the figure title. |
'Model'
|
fig_dpi
|
int
|
DPI for the saved figure. |
_DEFAULT_DPI
|
num_samples
|
int
|
Number of images to display in the grid. |
16
|
title_prefix
|
str | None
|
Optional string to prepend to the figure title. |
None
|
Source code in orchard/data_handler/data_explorer.py
show_samples_for_dataset(loader, dataset_name, run_paths, *, mean=None, std=None, arch_name='Model', fig_dpi=_DEFAULT_DPI, num_samples=16, resolution=None)
¶
Generate a grid of sample images from a dataset and save to the figures directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loader
|
DataLoader[Any]
|
PyTorch DataLoader to sample images from. |
required |
dataset_name
|
str
|
Name of the dataset, used in the filename and title. |
required |
run_paths
|
RunPaths
|
RunPaths instance to resolve figure saving path. |
required |
mean
|
tuple[float, ...] | None
|
Per-channel mean for denormalization. |
None
|
std
|
tuple[float, ...] | None
|
Per-channel std for denormalization. |
None
|
arch_name
|
str
|
Architecture name for the figure title. |
'Model'
|
fig_dpi
|
int
|
DPI for the saved figure. |
_DEFAULT_DPI
|
num_samples
|
int
|
Number of images to include in the grid. |
16
|
resolution
|
int | None
|
Resolution to include in filename to avoid overwriting. |
None
|
Source code in orchard/data_handler/data_explorer.py
create_synthetic_dataset(num_classes=8, samples=100, resolution=28, channels=3, name='syntheticmnist')
¶
Create a synthetic NPZ-compatible dataset for testing.
This function generates random image data and labels, saves them to a temporary .npz file, and returns a DatasetData object that can be used with the existing data pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_classes
|
int
|
Number of target categories (default: 8) |
8
|
samples
|
int
|
Number of training samples (default: 100) |
100
|
resolution
|
int
|
Image resolution (HxW) (default: 28) |
28
|
channels
|
int
|
Number of color channels (default: 3 for RGB) |
3
|
name
|
str
|
Dataset name for identification (default: "syntheticmnist") |
'syntheticmnist'
|
Returns:
| Name | Type | Description |
|---|---|---|
DatasetData |
DatasetData
|
A data object compatible with the existing pipeline |
Example
data = create_synthetic_dataset(num_classes=8, samples=100) train_loader, val_loader, test_loader = get_dataloaders( ... data, cfg.dataset, cfg.training, cfg.augmentation, cfg.num_workers ... )
Source code in orchard/data_handler/diagnostic/synthetic.py
create_synthetic_grayscale_dataset(num_classes=8, samples=100, resolution=28)
¶
Create a synthetic grayscale NPZ dataset for testing.
Convenience function for creating single-channel (grayscale) synthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_classes
|
int
|
Number of target categories (default: 8) |
8
|
samples
|
int
|
Number of training samples (default: 100) |
100
|
resolution
|
int
|
Image resolution (HxW) (default: 28) |
28
|
Returns:
| Name | Type | Description |
|---|---|---|
DatasetData |
DatasetData
|
A grayscale data object compatible with the pipeline |
Source code in orchard/data_handler/diagnostic/synthetic.py
create_temp_loader(dataset_path, batch_size=_DEFAULT_HEALTHCHECK_BATCH_SIZE)
¶
Load a NPZ dataset lazily and return a DataLoader for health checks.
This avoids loading the entire dataset into RAM at once, which is critical for large datasets (e.g., 224x224 images).
Source code in orchard/data_handler/diagnostic/temp_loader.py
ensure_dataset_npz(metadata, retries=5, delay=5.0)
¶
Dispatcher that routes each dataset to its dedicated fetch pipeline.
Automatically detects dataset type from metadata.name and delegates
to the appropriate download/conversion module. Adding a new domain
(e.g. a new resolution or source) only requires a new branch here and
a corresponding fetch module.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
DatasetMetadata
|
Metadata containing URL, MD5, name and target path. |
required |
retries
|
int
|
Max number of download attempts (NPZ fetcher only). |
5
|
delay
|
float
|
Delay (seconds) between retries (NPZ fetcher only). |
5.0
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Path to the successfully validated .npz file. |
Source code in orchard/data_handler/dispatcher.py
load_dataset(metadata)
¶
Ensure the dataset is present on disk and return its inspection results.
Downloads the dataset if missing, then inspects all samples to determine format properties (color mode, class count).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
DatasetMetadata
|
Dataset metadata (URL, MD5, name, path). |
required |
Returns:
| Type | Description |
|---|---|
DatasetData
|
Inspection results with format info derived from the full dataset. |
Source code in orchard/data_handler/dispatcher.py
get_dataloaders(metadata, dataset_cfg, training_cfg, aug_cfg, num_workers, is_optuna=False, task_type='classification')
¶
Convenience function for creating train/val/test DataLoaders.
Wraps DataLoaderFactory for streamlined loader construction with automatic class balancing, hardware optimization, and Optuna support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
DatasetData
|
Dataset metadata from load_dataset (paths, splits). |
required |
dataset_cfg
|
DatasetConfig
|
Dataset sub-config (splits, classes, resolution). |
required |
training_cfg
|
TrainingConfig
|
Training sub-config (batch size, seed). |
required |
aug_cfg
|
AugmentationConfig
|
Augmentation sub-config (transforms pipeline). |
required |
num_workers
|
int
|
Resolved worker count from hardware config. |
required |
is_optuna
|
bool
|
If True, use memory-conservative settings for hyperparameter tuning. |
False
|
task_type
|
str
|
Task type ( |
'classification'
|
Returns:
| Type | Description |
|---|---|
tuple[DataLoader[Any], DataLoader[Any], DataLoader[Any]]
|
A 3-tuple of (train_loader, val_loader, test_loader). |
Example
data = load_dataset(ds_meta) loaders = get_dataloaders( ... data, cfg.dataset, cfg.training, cfg.augmentation, cfg.num_workers ... )
Source code in orchard/data_handler/loader.py
get_augmentations_description(aug_cfg, img_size, mixup_alpha, ds_meta=None)
¶
Generates descriptive string of augmentations for logging.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
aug_cfg
|
AugmentationConfig
|
Augmentation sub-configuration |
required |
img_size
|
int
|
Target image size for resized crop |
required |
mixup_alpha
|
float
|
MixUp alpha (0.0 to disable) |
required |
ds_meta
|
DatasetMetadata | None
|
Dataset metadata (if provided, respects domain flags) |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Human-readable augmentation summary |
Source code in orchard/data_handler/transforms.py
get_pipeline_transforms(aug_cfg, img_size, ds_meta, *, force_rgb=True, norm_mean=None, norm_std=None)
¶
Constructs training and validation transformation pipelines.
Dynamically adapts to dataset characteristics (RGB vs Grayscale) and optionally promotes grayscale to 3-channel for pretrained-weight compatibility. Uses torchvision v2 transforms for improved CPU/GPU performance.
Pipeline Logic
- Convert to tensor format (ToImage + ToDtype)
- Promote 1-channel to 3-channel when
force_rgbis True and the dataset is native grayscale - Apply domain-aware augmentations (training only): geometric transforms disabled for anatomical datasets, color jitter reduced for texture-based datasets
- Normalize with dataset-specific statistics
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
aug_cfg
|
AugmentationConfig
|
Augmentation sub-configuration |
required |
img_size
|
int
|
Target image size |
required |
ds_meta
|
DatasetMetadata
|
Dataset metadata (channels, domain flags) |
required |
force_rgb
|
bool
|
Promote grayscale datasets to 3-channel RGB |
True
|
norm_mean
|
tuple[float, ...] | None
|
Pre-computed normalization mean (from DatasetConfig). When None, computed from ds_meta + force_rgb. |
None
|
norm_std
|
tuple[float, ...] | None
|
Pre-computed normalization std (from DatasetConfig). When None, computed from ds_meta + force_rgb. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Compose, Compose]
|
tuple[v2.Compose, v2.Compose]: (train_transform, val_transform) |
Source code in orchard/data_handler/transforms.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | |