Skip to content

loader

orchard.data_handler.loader

Data Loader Orchestration Module.

Provides the DataLoaderFactory for building PyTorch DataLoaders with advanced features: class balancing via WeightedRandomSampler, hardware-aware configuration (workers, pinned memory), and Optuna-compatible resource management.

Architecture:

  • Factory Pattern: Centralizes DataLoader construction logic
  • Hardware Optimization: Adaptive workers and memory pinning (CUDA/MPS)
  • Class Balancing: WeightedRandomSampler for imbalanced datasets
  • Optuna Integration: Resource-conservative settings for hyperparameter tuning

Key Components:

  • DataLoaderFactory: Main orchestrator for train/val/test loader creation
  • get_dataloaders: Convenience function for direct loader retrieval Example: >>> from orchard.data_handler import get_dataloaders, load_dataset >>> data = load_dataset(ds_meta) >>> train_loader, val_loader, test_loader = get_dataloaders( ... data, cfg.dataset, cfg.training, cfg.augmentation, cfg.num_workers ... ) >>> print(f"Batches: {len(train_loader)}")

DataLoaderFactory(dataset_cfg, training_cfg, aug_cfg, num_workers, metadata)

Orchestrates the creation of optimized PyTorch DataLoaders.

This factory centralizes the configuration of training, validation, and testing pipelines. It ensures that data transformations, class balancing, and hardware settings are synchronized across all splits.

Attributes:

Name Type Description
dataset_cfg DatasetConfig

Dataset sub-config.

training_cfg TrainingConfig

Training sub-config.

aug_cfg AugmentationConfig

Augmentation sub-config.

num_workers int

Resolved worker count from hardware config.

metadata DatasetData

Data path and raw format information.

ds_meta DatasetMetadata

Official dataset registry specifications.

logger Logger

Module-specific logger.

Initializes the factory with environment and dataset metadata.

Parameters:

Name Type Description Default
dataset_cfg DatasetConfig

Dataset sub-config (splits, classes, resolution).

required
training_cfg TrainingConfig

Training sub-config (batch size, seed).

required
aug_cfg AugmentationConfig

Augmentation sub-config (transforms pipeline).

required
num_workers int

Resolved worker count from hardware config.

required
metadata DatasetData

Metadata from the data fetcher/downloader.

required
Source code in orchard/data_handler/loader.py
def __init__(
    self,
    dataset_cfg: DatasetConfig,
    training_cfg: TrainingConfig,
    aug_cfg: AugmentationConfig,
    num_workers: int,
    metadata: DatasetData,
) -> None:
    """
    Initializes the factory with environment and dataset metadata.

    Args:
        dataset_cfg: Dataset sub-config (splits, classes, resolution).
        training_cfg: Training sub-config (batch size, seed).
        aug_cfg: Augmentation sub-config (transforms pipeline).
        num_workers: Resolved worker count from hardware config.
        metadata: Metadata from the data fetcher/downloader.
    """
    self.dataset_cfg = dataset_cfg
    self.training_cfg = training_cfg
    self.aug_cfg = aug_cfg
    self._num_workers = num_workers
    self.metadata = metadata

    wrapper = DatasetRegistryWrapper(resolution=dataset_cfg.resolution)
    self.ds_meta = wrapper.get_dataset(dataset_cfg.dataset_name)
    self.logger = logging.getLogger(LOGGER_NAME)

build(is_optuna=False)

Constructs and returns the full suite of DataLoaders.

Assembles train/val/test splits with transforms, optional class balancing, and hardware-aware infrastructure settings.

Parameters:

Name Type Description Default
is_optuna bool

If True, use memory-conservative settings for hyperparameter tuning (fewer workers, no persistent workers).

False

Returns:

Type Description
tuple[DataLoader[Any], DataLoader[Any], DataLoader[Any]]

A tuple of (train_loader, val_loader, test_loader).

Source code in orchard/data_handler/loader.py
def build(
    self, is_optuna: bool = False
) -> tuple[DataLoader[Any], DataLoader[Any], DataLoader[Any]]:
    """
    Constructs and returns the full suite of DataLoaders.

    Assembles train/val/test splits with transforms, optional class
    balancing, and hardware-aware infrastructure settings.

    Args:
        is_optuna: If True, use memory-conservative settings for
            hyperparameter tuning (fewer workers, no persistent workers).

    Returns:
        A tuple of (train_loader, val_loader, test_loader).
    """
    # 1. Setup transforms
    train_trans, val_trans = self._get_transformation_pipelines()

    # 2. Instantiate Dataset splits (lazy=mmap, eager=full RAM copy)
    _build = VisionDataset.lazy if self.dataset_cfg.lazy_loading else VisionDataset.from_npz
    ds_params = {"path": self.metadata.path, "seed": self.training_cfg.seed}

    train_ds = _build(
        **ds_params,
        split="train",
        transform=train_trans,
        max_samples=self.dataset_cfg.max_samples,
    )

    # Proportional downsizing for validation/testing if max_samples is set
    sub_samples = None
    if self.dataset_cfg.max_samples:
        sub_samples = max(
            _MIN_SUBSAMPLED_SPLIT,
            int(self.dataset_cfg.max_samples * self.dataset_cfg.val_ratio),
        )

    val_ds = _build(**ds_params, split="val", transform=val_trans, max_samples=sub_samples)
    test_ds = _build(**ds_params, split="test", transform=val_trans, max_samples=sub_samples)

    # 3. Resolve Sampler and Infrastructure
    sampler = self._get_balancing_sampler(train_ds)
    infra_kwargs = self._get_infrastructure_kwargs(is_optuna=is_optuna)

    # 4. Construct DataLoaders
    train_loader = DataLoader(
        train_ds,
        batch_size=self.training_cfg.batch_size,
        shuffle=(sampler is None),
        sampler=sampler,
        drop_last=True,
        **infra_kwargs,
    )

    val_loader = DataLoader(
        val_ds, batch_size=self.training_cfg.batch_size, shuffle=False, **infra_kwargs
    )

    test_loader = DataLoader(
        test_ds, batch_size=self.training_cfg.batch_size, shuffle=False, **infra_kwargs
    )

    optuna_str = " (Optuna)" if is_optuna else ""
    self.logger.info(
        "%s%s %-18s: (%s)%s → Train:[%d] Val:[%d] Test:[%d]",
        LogStyle.INDENT,
        LogStyle.ARROW,
        "DataLoaders",
        self.dataset_cfg.processing_mode,
        optuna_str,
        len(train_ds),
        len(val_ds),
        len(test_ds),
    )

    return train_loader, val_loader, test_loader

get_dataloaders(metadata, dataset_cfg, training_cfg, aug_cfg, num_workers, is_optuna=False)

Convenience function for creating train/val/test DataLoaders.

Wraps DataLoaderFactory for streamlined loader construction with automatic class balancing, hardware optimization, and Optuna support.

Parameters:

Name Type Description Default
metadata DatasetData

Dataset metadata from load_dataset (paths, splits).

required
dataset_cfg DatasetConfig

Dataset sub-config (splits, classes, resolution).

required
training_cfg TrainingConfig

Training sub-config (batch size, seed).

required
aug_cfg AugmentationConfig

Augmentation sub-config (transforms pipeline).

required
num_workers int

Resolved worker count from hardware config.

required
is_optuna bool

If True, use memory-conservative settings for hyperparameter tuning.

False

Returns:

Type Description
tuple[DataLoader[Any], DataLoader[Any], DataLoader[Any]]

A 3-tuple of (train_loader, val_loader, test_loader).

Example

data = load_dataset(ds_meta) loaders = get_dataloaders( ... data, cfg.dataset, cfg.training, cfg.augmentation, cfg.num_workers ... )

Source code in orchard/data_handler/loader.py
def get_dataloaders(
    metadata: DatasetData,
    dataset_cfg: DatasetConfig,
    training_cfg: TrainingConfig,
    aug_cfg: AugmentationConfig,
    num_workers: int,
    is_optuna: bool = False,
) -> tuple[DataLoader[Any], DataLoader[Any], DataLoader[Any]]:
    """
    Convenience function for creating train/val/test DataLoaders.

    Wraps DataLoaderFactory for streamlined loader construction with
    automatic class balancing, hardware optimization, and Optuna support.

    Args:
        metadata: Dataset metadata from load_dataset (paths, splits).
        dataset_cfg: Dataset sub-config (splits, classes, resolution).
        training_cfg: Training sub-config (batch size, seed).
        aug_cfg: Augmentation sub-config (transforms pipeline).
        num_workers: Resolved worker count from hardware config.
        is_optuna: If True, use memory-conservative settings for
            hyperparameter tuning.

    Returns:
        A 3-tuple of (train_loader, val_loader, test_loader).

    Example:
        >>> data = load_dataset(ds_meta)
        >>> loaders = get_dataloaders(
        ...     data, cfg.dataset, cfg.training, cfg.augmentation, cfg.num_workers
        ... )
    """
    factory = DataLoaderFactory(dataset_cfg, training_cfg, aug_cfg, num_workers, metadata)
    return factory.build(is_optuna=is_optuna)