Skip to content

medmnist_fetcher

orchard.data_handler.fetchers.medmnist_fetcher

MedMNIST Dataset Fetcher.

Downloads MedMNIST NPZ files with robust retry logic, MD5 verification, and atomic file operations. Follows the same pattern as galaxy10_converter to keep each domain's fetch logic self-contained.

ensure_medmnist_npz(metadata, retries=5, delay=5.0)

Downloads a MedMNIST NPZ file with retries and MD5 validation.

Implements a three-phase strategy
  1. Return immediately if a valid local copy already exists.
  2. Delete any corrupted local copy.
  3. Stream-download with retry loop and atomic file replacement.

Parameters:

Name Type Description Default
metadata DatasetMetadata

Metadata containing URL, MD5, name and target path.

required
retries int

Max number of download attempts.

5
delay float

Base delay (seconds) between retries (quadratic backoff on 429).

5.0

Returns:

Name Type Description
Path Path

Path to the successfully validated .npz file.

Raises:

Type Description
OrchardDatasetError

If all download attempts fail.

Source code in orchard/data_handler/fetchers/medmnist_fetcher.py
def ensure_medmnist_npz(
    metadata: DatasetMetadata,
    retries: int = 5,
    delay: float = 5.0,
) -> Path:
    """
    Downloads a MedMNIST NPZ file with retries and MD5 validation.

    Implements a three-phase strategy:
        1. Return immediately if a valid local copy already exists.
        2. Delete any corrupted local copy.
        3. Stream-download with retry loop and atomic file replacement.

    Args:
        metadata (DatasetMetadata): Metadata containing URL, MD5, name and target path.
        retries (int): Max number of download attempts.
        delay (float): Base delay (seconds) between retries (quadratic backoff on 429).

    Returns:
        Path: Path to the successfully validated .npz file.

    Raises:
        OrchardDatasetError: If all download attempts fail.
    """
    target_npz = metadata.path

    # 1. Validation of existing file
    if _is_valid_npz(target_npz, metadata.md5_checksum):
        logger.debug(
            "%s%s %-18s: '%s' found at %s",
            LogStyle.INDENT,
            LogStyle.ARROW,
            "Dataset",
            metadata.name,
            target_npz.name,
        )
        return target_npz

    # 2. Cleanup corrupted file
    if target_npz.exists():
        logger.warning("Corrupted dataset found, deleting: %s", target_npz)
        target_npz.unlink()

    # 3. Download logic with retries
    logger.info("%s%s %-18s: %s", LogStyle.INDENT, LogStyle.ARROW, "Downloading", metadata.name)
    target_npz.parent.mkdir(parents=True, exist_ok=True)
    tmp_path = target_npz.with_suffix(".tmp")

    for attempt in range(1, retries + 1):
        try:
            _stream_download(metadata.url, tmp_path)

            if not _is_valid_npz(tmp_path, metadata.md5_checksum):
                actual_md5 = md5_checksum(tmp_path)
                logger.error("MD5 mismatch: expected %s, got %s", metadata.md5_checksum, actual_md5)
                raise OrchardDatasetError("Downloaded file failed MD5 or header validation")

            # Atomic move
            tmp_path.replace(target_npz)
            logger.info(
                "%s%s %-18s: %s", LogStyle.INDENT, LogStyle.SUCCESS, "Verified", metadata.name
            )
            return target_npz

        except (OrchardDatasetError, OSError) as e:
            if tmp_path.exists():
                tmp_path.unlink()

            if attempt == retries:
                logger.error("Download failed after %d attempts", retries)
                raise OrchardDatasetError(f"Could not download {metadata.name}") from e

            actual_delay = _retry_delay(e, delay, attempt)
            logger.warning(
                "Attempt %d/%d failed: %s. Retrying in %ss...", attempt, retries, e, actual_delay
            )
            time.sleep(actual_delay)

    raise OrchardDatasetError("Unexpected error in dataset download logic.")  # pragma: no cover