Skip to content

dataset_config

orchard.core.config.dataset_config

Dataset Registry Orchestration & Metadata Resolution.

Bridges static dataset metadata with runtime execution requirements. Normalizes datasets regardless of native format (Grayscale/RGB) to meet model architecture input specifications. Supports multi-resolution (28x28 through 224x224) with proper YAML override while maintaining frozen immutability.

Key Responsibilities
  • Adaptive normalization: Adjusts mean/std based on channel logic
  • Feature promotion: Automates grayscale-to-RGB for ImageNet weights
  • Resource budgeting: Enforces sampling limits and class balancing
  • Multi-resolution support: Resolves metadata by selected resolution

DatasetConfig

Bases: BaseModel

Validated manifest for dataset execution context.

Bridges static registry metadata with runtime preferences. Resolves channel promotion and sampling policies with multi-resolution support. Auto-syncs img_size with resolution when not explicitly set.

Attributes:

Name Type Description
name str

Dataset identifier from registry (e.g., 'bloodmnist', 'organcmnist').

metadata DatasetMetadata | None

DatasetMetadata object (excluded from serialization).

data_root ValidatedPath

Root directory containing dataset files.

use_weighted_sampler bool

Enable class-balanced sampling for imbalanced datasets.

max_samples PositiveInt | None

Maximum samples to load (None=all).

val_ratio Probability

Fraction of max_samples for val/test splits (default 0.10).

img_size ImageSize | None

Target square resolution for model input (auto-synced).

force_rgb bool

Convert grayscale to RGB for pretrained ImageNet weights.

resolution int

Target resolution variant (28, 32, 64, 128, or 224).

lazy_loading bool

Use memory-mapped loading instead of eagerly loading into RAM.

dataset_name property

Get dataset identifier from metadata.

Returns:

Type Description
str

Dataset name string (e.g., 'bloodmnist').

num_classes property

Get number of target classes.

Returns:

Type Description
int

Integer count of classification classes.

in_channels property

Get native input channels from metadata.

Returns:

Type Description
int

1 for grayscale datasets, 3 for RGB.

effective_in_channels property

Get actual channels the model receives after promotion.

Returns:

Type Description
int

3 if force_rgb enabled, otherwise native in_channels.

mean property

Get channel-wise normalization mean.

Expands single-channel mean to 3 channels if force_rgb is enabled on a grayscale dataset.

Returns:

Type Description
tuple[float, ...]

tuple of mean values per channel.

std property

Get channel-wise normalization standard deviation.

Expands single-channel std to 3 channels if force_rgb is enabled on a grayscale dataset.

Returns:

Type Description
tuple[float, ...]

tuple of std values per channel.

effective_is_anatomical property

Whether the dataset is anatomical (conservative default: True).

Returns:

Type Description
bool

Metadata flag if available, else True (safest for augmentation).

effective_is_texture_based property

Whether the dataset is texture-based (conservative default: True).

Returns:

Type Description
bool

Metadata flag if available, else True (safest for augmentation).

processing_mode property

Get description of channel processing mode.

Returns:

Type Description
str

'NATIVE-RGB' for RGB datasets, 'RGB-PROMOTED' for grayscale

str

with force_rgb, or 'NATIVE-GRAY' for grayscale without promotion.

validate_resolution(v) classmethod

Enforce resolution against supported registry values.

Source code in orchard/core/config/dataset_config.py
@field_validator("resolution")
@classmethod
def validate_resolution(cls, v: int) -> int:
    """
    Enforce resolution against supported registry values.
    """
    if v not in SUPPORTED_RESOLUTIONS:
        raise OrchardConfigError(
            f"resolution={v} is not supported. Choose from {sorted(SUPPORTED_RESOLUTIONS)}."
        )
    return v

validate_min_samples(v) classmethod

Enforce minimum sample count for meaningful train/val/test splits.

Source code in orchard/core/config/dataset_config.py
@field_validator("max_samples")
@classmethod
def validate_min_samples(cls, v: int | None) -> int | None:
    """
    Enforce minimum sample count for meaningful train/val/test splits.
    """
    if v is not None and v < 20:
        raise OrchardConfigError(
            f"max_samples={v} is too small for meaningful train/val/test splits. "
            f"Use max_samples >= 20 or None to load all samples."
        )
    return v

sync_img_size_with_resolution(values) classmethod

Auto-sync img_size with resolution if not explicitly set.

This runs BEFORE frozen instantiation, allowing us to modify values.

Logic: 1. If img_size is explicitly set in YAML/args → keep it 2. If img_size is None/missing → use resolution 3. If metadata exists → use metadata.native_resolution

Parameters:

Name Type Description Default
values dict[str, Any]

Raw input dict before Pydantic validation

required

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Modified values dict with synced img_size

Source code in orchard/core/config/dataset_config.py
@model_validator(mode="before")
@classmethod
def sync_img_size_with_resolution(cls, values: dict[str, Any]) -> dict[str, Any]:
    """
    Auto-sync img_size with resolution if not explicitly set.

    This runs BEFORE frozen instantiation, allowing us to modify values.

    Logic:
    1. If img_size is explicitly set in YAML/args → keep it
    2. If img_size is None/missing → use resolution
    3. If metadata exists → use metadata.native_resolution

    Args:
        values (dict[str, Any]): Raw input dict before Pydantic validation

    Returns:
        dict[str, Any]: Modified values dict with synced img_size
    """
    img_size = values.get("img_size")
    resolution = values.get("resolution", 28)
    metadata = values.get("metadata")

    if img_size is None:
        if metadata is not None:
            # Use metadata's native resolution
            img_size = metadata.native_resolution
        else:
            # Use resolution parameter
            img_size = resolution

        values["img_size"] = img_size

    return values