Skip to content

fetcher

orchard.data_handler.fetcher

Dataset Fetching Dispatcher and Loading Interface.

Central entry point for dataset retrieval. Routes each dataset to its dedicated fetch module inside the fetchers/ sub-package and exposes the loading functions that return DatasetData containers.

Adding a new domain only requires a new branch in ensure_dataset_npz and a corresponding module in fetchers/.

DatasetData(path, name, is_rgb, num_classes) dataclass

Metadata container for a loaded dataset.

Stores path and format info instead of raw arrays to save RAM.

ensure_dataset_npz(metadata, retries=5, delay=5.0)

Dispatcher that routes each dataset to its dedicated fetch pipeline.

Automatically detects dataset type from metadata.name and delegates to the appropriate download/conversion module. Adding a new domain (e.g. a new resolution or source) only requires a new branch here and a corresponding fetch module.

Parameters:

Name Type Description Default
metadata DatasetMetadata

Metadata containing URL, MD5, name and target path.

required
retries int

Max number of download attempts (NPZ fetcher only).

5
delay float

Delay (seconds) between retries (NPZ fetcher only).

5.0

Returns:

Name Type Description
Path Path

Path to the successfully validated .npz file.

Source code in orchard/data_handler/fetcher.py
def ensure_dataset_npz(
    metadata: DatasetMetadata,
    retries: int = 5,
    delay: float = 5.0,
) -> Path:
    """
    Dispatcher that routes each dataset to its dedicated fetch pipeline.

    Automatically detects dataset type from ``metadata.name`` and delegates
    to the appropriate download/conversion module. Adding a new domain
    (e.g. a new resolution or source) only requires a new branch here and
    a corresponding fetch module.

    Args:
        metadata (DatasetMetadata): Metadata containing URL, MD5, name and target path.
        retries (int): Max number of download attempts (NPZ fetcher only).
        delay (float): Delay (seconds) between retries (NPZ fetcher only).

    Returns:
        Path: Path to the successfully validated .npz file.
    """
    # Galaxy10 requires HDF5 download and conversion to NPZ
    if metadata.name == "galaxy10":
        from .fetchers import ensure_galaxy10_npz

        return ensure_galaxy10_npz(metadata)

    # CIFAR-10/100 via torchvision download and NPZ conversion
    if metadata.name in ("cifar10", "cifar100"):
        from .fetchers import ensure_cifar_npz

        return ensure_cifar_npz(metadata)

    # Default: standard NPZ download with retries and MD5 check
    from .fetchers import ensure_medmnist_npz

    return ensure_medmnist_npz(metadata, retries=retries, delay=delay)

load_dataset(metadata)

Ensures the dataset is present and returns its metadata container.

Source code in orchard/data_handler/fetcher.py
def load_dataset(metadata: DatasetMetadata) -> DatasetData:
    """
    Ensures the dataset is present and returns its metadata container.
    """
    return _load_and_inspect(metadata)

load_dataset_health_check(metadata, chunk_size=100)

Quick health-check: inspects only the first chunk_size samples.

Parameters:

Name Type Description Default
metadata DatasetMetadata

Dataset metadata (URL, MD5, name, path).

required
chunk_size int

Number of samples to inspect (default 100).

100

Returns:

Type Description
DatasetData

DatasetData with format info derived from the chunk.

Source code in orchard/data_handler/fetcher.py
def load_dataset_health_check(metadata: DatasetMetadata, chunk_size: int = 100) -> DatasetData:
    """
    Quick health-check: inspects only the first *chunk_size* samples.

    Args:
        metadata: Dataset metadata (URL, MD5, name, path).
        chunk_size: Number of samples to inspect (default 100).

    Returns:
        DatasetData with format info derived from the chunk.
    """
    return _load_and_inspect(metadata, chunk_size=chunk_size)