Skip to content

evaluation

orchard.evaluation

Evaluation and Reporting Package.

This package coordinates model inference, performance visualization, and structured experiment reporting. It provides a unified interface to assess model generalization through standard metrics and Test-Time Augmentation (TTA), while automating the generation of artifacts (plots, reports) for experimental tracking.

PlotContext(arch_name, resolution, fig_dpi, plot_style, cmap_confusion, grid_cols, n_samples, fig_size_predictions, mean=None, std=None, use_tta=False, is_anatomical=True, is_texture_based=True) dataclass

Immutable snapshot of all configuration fields needed by visualization functions.

from_config(cfg) classmethod

Build a PlotContext from the full Config object.

Source code in orchard/evaluation/plot_context.py
@classmethod
def from_config(cls, cfg: Config) -> PlotContext:
    """
    Build a PlotContext from the full Config object.
    """
    return cls(
        arch_name=cfg.architecture.name,
        resolution=cfg.dataset.resolution,
        fig_dpi=cfg.evaluation.fig_dpi,
        plot_style=cfg.evaluation.plot_style,
        cmap_confusion=cfg.evaluation.cmap_confusion,
        grid_cols=cfg.evaluation.grid_cols,
        n_samples=cfg.evaluation.n_samples,
        fig_size_predictions=cfg.evaluation.fig_size_predictions,
        mean=cfg.dataset.mean,
        std=cfg.dataset.std,
        use_tta=cfg.training.use_tta,
        is_anatomical=cfg.dataset.effective_is_anatomical,
        is_texture_based=cfg.dataset.effective_is_texture_based,
    )

TrainingReport

Bases: BaseModel

Validated data container for summarizing a complete training experiment.

This model serves as a Schema for the final experimental metadata. It stores hardware, hyperparameter, and performance states to ensure full reproducibility and traceability of the training pipeline.

Attributes:

Name Type Description
timestamp str

ISO formatted execution time.

architecture str

Identifier of the architecture used.

dataset str

Name of the dataset.

best_val_accuracy float

Peak accuracy achieved on validation set.

best_val_auc float

Peak ROC-AUC achieved on validation set.

best_val_f1 float

Peak macro-averaged F1 on validation set.

test_accuracy float

Final accuracy on the unseen test set.

test_auc float

Final ROC-AUC on the unseen test set.

test_macro_f1 float

Macro-averaged F1 score (key for imbalanced data).

is_texture_based bool

Whether texture-preserving logic was applied.

is_anatomical bool

Whether anatomical orientation constraints were enforced.

use_tta bool

Indicates if Test-Time Augmentation was active.

epochs_trained int

Total number of optimization cycles completed.

learning_rate float

Initial learning rate used by the optimizer.

batch_size int

Samples processed per iteration.

seed int

Global RNG seed for experiment replication.

augmentations str

Descriptive string of the transformation pipeline.

normalization str

Mean/Std statistics applied to the input tensors.

model_path str

Absolute path to the best saved checkpoint.

log_path str

Absolute path to the session execution log.

to_vertical_df()

Converts the Pydantic model into a vertical pandas DataFrame.

Returns:

Type Description
DataFrame

pd.DataFrame: A two-column DataFrame (Parameter, Value) for Excel export.

Source code in orchard/evaluation/reporting.py
def to_vertical_df(self) -> pd.DataFrame:
    """
    Converts the Pydantic model into a vertical pandas DataFrame.

    Returns:
        pd.DataFrame: A two-column DataFrame (Parameter, Value) for Excel export.
    """
    data = self.model_dump()
    return pd.DataFrame(list(data.items()), columns=["Parameter", "Value"])

save(path, fmt='xlsx')

Saves the report to disk in the requested format.

Supported formats
  • xlsx: Professional Excel with conditional formatting.
  • csv: Flat CSV (two columns: Parameter, Value).
  • json: Pretty-printed JSON array.

Parameters:

Name Type Description Default
path Path

Base file path (suffix is replaced to match fmt).

required
fmt str

Output format — one of "xlsx", "csv", "json".

'xlsx'
Source code in orchard/evaluation/reporting.py
def save(
    self, path: Path, fmt: str = "xlsx"  # pragma: no mutate  # .lower() normalizes any case
) -> None:
    """
    Saves the report to disk in the requested format.

    Supported formats:
        - ``xlsx``: Professional Excel with conditional formatting.
        - ``csv``: Flat CSV (two columns: Parameter, Value).
        - ``json``: Pretty-printed JSON array.

    Args:
        path: Base file path (suffix is replaced to match *fmt*).
        fmt: Output format — one of ``"xlsx"``, ``"csv"``, ``"json"``.
    """
    fmt = fmt.lower()
    path.parent.mkdir(parents=True, exist_ok=True)

    try:
        df = self.to_vertical_df()
        if fmt == "csv":
            path = path.with_suffix(".csv")
            df.to_csv(path, index=False)  # pragma: no mutate  # None ≡ False in pandas
        elif fmt == "json":
            path = path.with_suffix(".json")
            df.to_json(path, orient="records", indent=2)
        else:
            path = path.with_suffix(".xlsx")
            with pd.ExcelWriter(  # pragma: no mutate
                path,
                engine="xlsxwriter",  # pragma: no mutate
                engine_kwargs={"options": {"nan_inf_to_errors": True}},  # pragma: no mutate
            ) as writer:
                # fmt: off
                df.to_excel(writer, sheet_name="Detailed Report", index=False)  # pragma: no mutate
                # fmt: on
                self._apply_excel_formatting(writer, df)

        logger.info(
            "%s%s %-18s: %s", LogStyle.INDENT, LogStyle.ARROW, "Summary Report", path.name
        )
    except Exception as e:  # xlsxwriter raises non-standard exceptions
        logger.error("Failed to generate report: %s", e)

run_final_evaluation(model, test_loader, train_losses, val_metrics_history, class_names, paths, training, dataset, augmentation, evaluation, arch_name, aug_info='N/A', tracker=None)

Execute the complete evaluation pipeline.

Coordinates full-set inference (with TTA support), visualizes metrics, and generates the final structured report.

Parameters:

Name Type Description Default
model Module

Trained model for evaluation (already on target device).

required
test_loader DataLoader[Any]

DataLoader for test set.

required
train_losses list[float]

Training loss history per epoch.

required
val_metrics_history list[Mapping[str, float]]

Validation metrics history per epoch.

required
class_names list[str]

List of class label strings.

required
paths RunPaths

RunPaths for artifact output.

required
training TrainingConfig

Training sub-config (use_tta, hyperparameters for report).

required
dataset DatasetConfig

Dataset sub-config (resolution, metadata, normalization).

required
augmentation AugmentationConfig

Augmentation sub-config (TTA transforms).

required
evaluation EvaluationConfig

Evaluation sub-config (plot flags, report format).

required
arch_name str

Architecture identifier (e.g. "resnet_18").

required
aug_info str

Augmentation description string for report.

'N/A'
tracker TrackerProtocol | None

Optional experiment tracker for final metrics.

None

Returns:

Type Description
tuple[float, float, float]

tuple[float, float, float]: A 3-tuple of:

  • macro_f1 -- Macro-averaged F1 score
  • test_acc -- Test set accuracy
  • test_auc -- Test set AUC (NaN if computation failed)
Source code in orchard/evaluation/evaluation_pipeline.py
def run_final_evaluation(
    model: nn.Module,
    test_loader: DataLoader[Any],
    train_losses: list[float],
    val_metrics_history: list[Mapping[str, float]],
    class_names: list[str],
    paths: RunPaths,
    training: TrainingConfig,
    dataset: DatasetConfig,
    augmentation: AugmentationConfig,
    evaluation: EvaluationConfig,
    arch_name: str,
    aug_info: str = "N/A",  # pragma: no mutate
    tracker: TrackerProtocol | None = None,
) -> tuple[float, float, float]:
    """
    Execute the complete evaluation pipeline.

    Coordinates full-set inference (with TTA support), visualizes metrics,
    and generates the final structured report.

    Args:
        model: Trained model for evaluation (already on target device).
        test_loader: DataLoader for test set.
        train_losses: Training loss history per epoch.
        val_metrics_history: Validation metrics history per epoch.
        class_names: List of class label strings.
        paths: RunPaths for artifact output.
        training: Training sub-config (use_tta, hyperparameters for report).
        dataset: Dataset sub-config (resolution, metadata, normalization).
        augmentation: Augmentation sub-config (TTA transforms).
        evaluation: Evaluation sub-config (plot flags, report format).
        arch_name: Architecture identifier (e.g. ``"resnet_18"``).
        aug_info: Augmentation description string for report.
        tracker: Optional experiment tracker for final metrics.

    Returns:
        tuple[float, float, float]: A 3-tuple of:

            - **macro_f1** -- Macro-averaged F1 score
            - **test_acc** -- Test set accuracy
            - **test_auc** -- Test set AUC (NaN if computation failed)
    """
    # Resolve device from model (already placed on the correct device by the trainer)
    device = next(model.parameters()).device

    # Filesystem-safe architecture tag (e.g. "timm/model" → "timm_model")
    arch_tag = arch_name.replace("/", "_")

    # --- 1) Inference & Metrics ---
    # Performance on the full test set
    all_preds, all_labels, test_metrics, macro_f1 = evaluate_model(
        model,
        test_loader,
        device=device,
        use_tta=training.use_tta,
        is_anatomical=dataset.effective_is_anatomical,
        is_texture_based=dataset.effective_is_texture_based,
        aug_cfg=augmentation,
        resolution=dataset.resolution,
    )

    # --- 2) Visualizations ---
    ctx = PlotContext(
        arch_name=arch_name,
        resolution=dataset.resolution,
        fig_dpi=evaluation.fig_dpi,
        plot_style=evaluation.plot_style,
        cmap_confusion=evaluation.cmap_confusion,
        grid_cols=evaluation.grid_cols,
        n_samples=evaluation.n_samples,
        fig_size_predictions=evaluation.fig_size_predictions,
        mean=dataset.mean,
        std=dataset.std,
        use_tta=training.use_tta,
        is_anatomical=dataset.effective_is_anatomical,
        is_texture_based=dataset.effective_is_texture_based,
    )

    # Diagnostic Confusion Matrix
    if evaluation.save_confusion_matrix:
        plot_confusion_matrix(
            all_labels=all_labels,
            all_preds=all_preds,
            classes=class_names,
            out_path=paths.get_fig_path(f"confusion_matrix_{arch_tag}_{dataset.resolution}.png"),
            ctx=ctx,
        )

    # Historical Training Curves
    val_acc_list = [m[METRIC_ACCURACY] for m in val_metrics_history]
    plot_training_curves(
        train_losses=train_losses,
        val_accuracies=val_acc_list,
        out_path=paths.get_fig_path(f"training_curves_{arch_tag}_{dataset.resolution}.png"),
        ctx=ctx,
    )

    # Lazy-loaded prediction grid (samples from loader)
    if evaluation.save_predictions_grid:
        show_predictions(
            model=model,
            loader=test_loader,
            device=device,
            classes=class_names,
            save_path=paths.get_fig_path(f"sample_predictions_{arch_tag}_{dataset.resolution}.png"),
            ctx=ctx,
        )

    # --- 3) Structured Reporting ---
    # Aggregates everything into a formatted report (xlsx/csv/json)
    final_log = paths.logs / "session.log"

    report = create_structured_report(
        val_metrics=val_metrics_history,
        test_metrics=test_metrics,
        macro_f1=macro_f1,
        train_losses=train_losses,
        best_path=paths.best_model_path,
        log_path=final_log,
        arch_name=arch_name,
        dataset=dataset,
        training=training,
        aug_info=aug_info,
    )
    report.save(paths.final_report_path, fmt=evaluation.report_format)

    test_acc = test_metrics[METRIC_ACCURACY]
    test_auc = test_metrics[METRIC_AUC]

    # Log test metrics to experiment tracker
    if tracker is not None:
        tracker.log_test_metrics(test_acc=test_acc, macro_f1=macro_f1)

    logger.info("%s%s Final Evaluation Phase Complete.", LogStyle.INDENT, LogStyle.SUCCESS)

    return macro_f1, test_acc, test_auc

evaluate_model(model, test_loader, device, use_tta=False, is_anatomical=False, is_texture_based=False, aug_cfg=None, resolution=28)

Performs full-set evaluation and coordinates metric calculation.

Parameters:

Name Type Description Default
model Module

The trained neural network.

required
test_loader DataLoader[Any]

DataLoader for the evaluation set.

required
device device

Hardware target (CPU/CUDA/MPS).

required
use_tta bool

Flag to enable Test-Time Augmentation.

False
is_anatomical bool

Dataset-specific orientation constraint.

False
is_texture_based bool

Dataset-specific texture preservation flag.

False
aug_cfg AugmentationConfig | None

Augmentation sub-configuration (required for TTA).

None
resolution int

Dataset resolution for TTA intensity scaling.

28

Returns:

Type Description
tuple[NDArray[Any], NDArray[Any], dict[str, float], float]

tuple[np.ndarray, np.ndarray, dict, float]: A 4-tuple of:

  • all_preds -- Predicted class indices, shape (N,)
  • all_labels -- Ground truth labels, shape (N,)
  • metrics -- dict[str, float] with keys accuracy, auc, f1
  • macro_f1 -- Macro-averaged F1 score (convenience shortcut)
Source code in orchard/evaluation/evaluator.py
def evaluate_model(
    model: nn.Module,
    test_loader: DataLoader[Any],
    device: torch.device,
    use_tta: bool = False,
    is_anatomical: bool = False,
    is_texture_based: bool = False,
    aug_cfg: AugmentationConfig | None = None,
    resolution: int = 28,
) -> tuple[npt.NDArray[Any], npt.NDArray[Any], dict[str, float], float]:
    """
    Performs full-set evaluation and coordinates metric calculation.

    Args:
        model: The trained neural network.
        test_loader: DataLoader for the evaluation set.
        device: Hardware target (CPU/CUDA/MPS).
        use_tta: Flag to enable Test-Time Augmentation.
        is_anatomical: Dataset-specific orientation constraint.
        is_texture_based: Dataset-specific texture preservation flag.
        aug_cfg: Augmentation sub-configuration (required for TTA).
        resolution: Dataset resolution for TTA intensity scaling.

    Returns:
        tuple[np.ndarray, np.ndarray, dict, float]: A 4-tuple of:

            - **all_preds** -- Predicted class indices, shape ``(N,)``
            - **all_labels** -- Ground truth labels, shape ``(N,)``
            - **metrics** -- dict[str, float] with keys ``accuracy``, ``auc``, ``f1``
            - **macro_f1** -- Macro-averaged F1 score (convenience shortcut)
    """
    model.eval()
    all_probs_list: list[npt.NDArray[Any]] = []
    all_labels_list: list[npt.NDArray[Any]] = []

    actual_tta = use_tta and (aug_cfg is not None)

    with torch.no_grad():
        for inputs, targets in test_loader:
            if actual_tta:
                # TTA logic handles its own device placement and softmax
                probs = adaptive_tta_predict(
                    model, inputs, device, is_anatomical, is_texture_based, aug_cfg, resolution
                )
            else:
                # Standard forward pass
                inputs = inputs.to(device)
                logits = model(inputs)
                probs = torch.softmax(logits, dim=1)

            all_probs_list.append(probs.cpu().numpy())
            all_labels_list.append(targets.numpy())

    # Consolidate batch results into global arrays
    all_probs = np.concatenate(all_probs_list)
    all_labels = np.concatenate(all_labels_list)
    all_preds = all_probs.argmax(axis=1)

    # Delegate statistical analysis to the metrics module
    metrics = compute_classification_metrics(all_labels, all_preds, all_probs)

    # Performance logging
    log_msg = "%s%s %-18s: Acc: %.4f | AUC: %.4f | F1: %.4f" % (
        LogStyle.INDENT,
        LogStyle.ARROW,
        "Test Metrics",
        metrics[METRIC_ACCURACY],
        metrics[METRIC_AUC],
        metrics[METRIC_F1],
    )
    if actual_tta and aug_cfg is not None:
        mode = aug_cfg.tta_mode.upper()
        log_msg += " | TTA ENABLED (Mode: %s)" % mode

    logger.info(log_msg)

    return all_preds, all_labels, metrics, metrics[METRIC_F1]

compute_auc(y_true, y_score)

Compute macro-averaged ROC-AUC with graceful fallback.

Handles binary (positive class probability) and multiclass (OvR) cases. Returns NaN on failure so callers can distinguish "computation impossible" from "genuinely zero AUC".

Parameters:

Name Type Description Default
y_true NDArray[Any]

Ground truth class indices, shape (N,).

required
y_score NDArray[Any]

Probability distributions, shape (N, C) (softmax output).

required

Returns:

Type Description
float

ROC-AUC score, or NaN if computation fails.

Source code in orchard/trainer/engine.py
def compute_auc(y_true: npt.NDArray[Any], y_score: npt.NDArray[Any]) -> float:
    """
    Compute macro-averaged ROC-AUC with graceful fallback.

    Handles binary (positive class probability) and multiclass (OvR)
    cases. Returns ``NaN`` on failure so callers can distinguish
    "computation impossible" from "genuinely zero AUC".

    Args:
        y_true: Ground truth class indices, shape ``(N,)``.
        y_score: Probability distributions, shape ``(N, C)`` (softmax output).

    Returns:
        ROC-AUC score, or ``NaN`` if computation fails.
    """
    try:
        n_classes = y_score.shape[1] if y_score.ndim == 2 else 1
        if n_classes <= 2:
            auc = roc_auc_score(y_true, y_score[:, 1] if y_score.ndim == 2 else y_score)
        else:
            auc = roc_auc_score(y_true, y_score, multi_class="ovr", average="macro")
    except (ValueError, TypeError, IndexError) as e:
        logger.warning("ROC-AUC calculation failed: %s. Returning NaN.", e)
        return float("nan")

    if np.isnan(auc):
        return float("nan")
    return float(auc)

compute_classification_metrics(labels, preds, probs)

Computes accuracy, macro-averaged F1, and macro-averaged ROC-AUC.

Parameters:

Name Type Description Default
labels NDArray[Any]

Ground truth class indices.

required
preds NDArray[Any]

Predicted class indices.

required
probs NDArray[Any]

Softmax probability distributions.

required

Returns:

Type Description
dict[str, float]

dict[str, float]: Metric dictionary with keys:

  • accuracy -- Overall classification accuracy
  • auc -- Macro-averaged ROC-AUC (NaN if computation fails)
  • f1 -- Macro-averaged F1 score
Source code in orchard/evaluation/metrics.py
def compute_classification_metrics(
    labels: npt.NDArray[Any], preds: npt.NDArray[Any], probs: npt.NDArray[Any]
) -> dict[str, float]:
    """
    Computes accuracy, macro-averaged F1, and macro-averaged ROC-AUC.

    Args:
        labels: Ground truth class indices.
        preds: Predicted class indices.
        probs: Softmax probability distributions.

    Returns:
        dict[str, float]: Metric dictionary with keys:

            - ``accuracy`` -- Overall classification accuracy
            - ``auc`` -- Macro-averaged ROC-AUC (NaN if computation fails)
            - ``f1`` -- Macro-averaged F1 score
    """
    accuracy = np.mean(preds == labels)
    # fmt: off
    # Equivalent mutants: average="macro"→"MACRO"/None is caught by sklearn;
    # zero_division=0.0→1.0 or removal is runtime-equiv in sklearn ≥1.8.
    macro_f1 = f1_score(labels, preds, average="macro", zero_division=0.0)  # pragma: no mutate
    # fmt: on
    auc = compute_auc(labels, probs)

    return {METRIC_ACCURACY: float(accuracy), METRIC_AUC: float(auc), METRIC_F1: float(macro_f1)}

create_structured_report(val_metrics, test_metrics, macro_f1, train_losses, best_path, log_path, arch_name, dataset, training, aug_info=None)

Constructs a TrainingReport object using final metrics and configuration.

This factory method aggregates disparate pipeline results into a single validated container, resolving paths and extracting augmentation summaries.

Parameters:

Name Type Description Default
val_metrics Sequence[Mapping[str, float]]

History of per-epoch validation metric dicts.

required
test_metrics dict[str, float]

Final test-set metric dict (accuracy, auc, etc.).

required
macro_f1 float

Final Macro F1 score on test set.

required
train_losses Sequence[float]

History of per-epoch training losses.

required
best_path Path

Path to the saved model weights.

required
log_path Path

Path to the run log file.

required
arch_name str

Architecture identifier (e.g. "resnet_18").

required
dataset DatasetConfig

Dataset sub-config with metadata, name, and normalization info.

required
training TrainingConfig

Training sub-config with hyperparameters and flags.

required
aug_info str | None

Pre-formatted augmentation string.

None

Returns:

Name Type Description
TrainingReport TrainingReport

A validated Pydantic model ready for export.

Source code in orchard/evaluation/reporting.py
def create_structured_report(
    val_metrics: Sequence[Mapping[str, float]],
    test_metrics: dict[str, float],
    macro_f1: float,
    train_losses: Sequence[float],
    best_path: Path,
    log_path: Path,
    arch_name: str,
    dataset: DatasetConfig,
    training: TrainingConfig,
    aug_info: str | None = None,
) -> TrainingReport:
    """
    Constructs a TrainingReport object using final metrics and configuration.

    This factory method aggregates disparate pipeline results into a single
    validated container, resolving paths and extracting augmentation summaries.

    Args:
        val_metrics: History of per-epoch validation metric dicts.
        test_metrics: Final test-set metric dict (accuracy, auc, etc.).
        macro_f1: Final Macro F1 score on test set.
        train_losses: History of per-epoch training losses.
        best_path: Path to the saved model weights.
        log_path: Path to the run log file.
        arch_name: Architecture identifier (e.g. ``"resnet_18"``).
        dataset: Dataset sub-config with metadata, name, and normalization info.
        training: Training sub-config with hyperparameters and flags.
        aug_info: Pre-formatted augmentation string.

    Returns:
        TrainingReport: A validated Pydantic model ready for export.
    """
    # Augmentation info is expected from caller; fallback to "N/A"
    aug_info = aug_info or "N/A"

    def _safe_max(key: str) -> float:
        """Return the best non-NaN value for *key* across validation epochs."""
        values = [m[key] for m in val_metrics if not math.isnan(m[key])]
        return max(values, default=0.0)

    best_val_acc = _safe_max(METRIC_ACCURACY)
    best_val_auc = _safe_max(METRIC_AUC)
    best_val_f1 = _safe_max(METRIC_F1)

    return TrainingReport(
        architecture=arch_name,
        dataset=dataset.dataset_name,
        best_val_accuracy=best_val_acc,
        best_val_auc=best_val_auc,
        best_val_f1=best_val_f1,
        test_accuracy=test_metrics[METRIC_ACCURACY],
        test_auc=test_metrics[METRIC_AUC],
        test_macro_f1=macro_f1,
        is_texture_based=dataset.metadata.is_texture_based,
        is_anatomical=dataset.metadata.is_anatomical,
        use_tta=training.use_tta,
        epochs_trained=len(train_losses),
        learning_rate=training.learning_rate,
        batch_size=training.batch_size,
        augmentations=aug_info,
        normalization=dataset.metadata.normalization_info,
        model_path=str(best_path.resolve()),
        log_path=str(log_path.resolve()),
        seed=training.seed,
    )

adaptive_tta_predict(model, inputs, device, is_anatomical, is_texture_based, aug_cfg, resolution)

Performs Test-Time Augmentation (TTA) inference on a batch of inputs.

Applies a set of standard augmentations in addition to the original input. Predictions from all augmented versions are averaged in the probability space. If is_anatomical is True, it restricts augmentations to orientation-preserving transforms. If is_texture_based is True, it disables destructive pixel-level noise/blur to preserve local patterns. The tta_mode config field controls ensemble complexity (full vs light) independently of hardware.

Parameters:

Name Type Description Default
model Module

The trained PyTorch model.

required
inputs Tensor

The batch of test images.

required
device device

The device to run the inference on.

required
is_anatomical bool

Whether the dataset has fixed anatomical orientation.

required
is_texture_based bool

Whether the dataset relies on high-frequency textures.

required
aug_cfg AugmentationConfig

Augmentation sub-configuration with TTA parameters.

required
resolution int

Dataset resolution for TTA intensity scaling.

required

Returns:

Type Description
Tensor

The averaged softmax probability predictions (mean ensemble).

Source code in orchard/evaluation/tta.py
def adaptive_tta_predict(
    model: nn.Module,
    inputs: torch.Tensor,
    device: torch.device,
    is_anatomical: bool,
    is_texture_based: bool,
    aug_cfg: AugmentationConfig,
    resolution: int,
) -> torch.Tensor:
    """
    Performs Test-Time Augmentation (TTA) inference on a batch of inputs.

    Applies a set of standard augmentations in addition to the original input.
    Predictions from all augmented versions are averaged in the probability space.
    If is_anatomical is True, it restricts augmentations to orientation-preserving
    transforms. If is_texture_based is True, it disables destructive pixel-level
    noise/blur to preserve local patterns. The ``tta_mode`` config field controls
    ensemble complexity (full vs light) independently of hardware.

    Args:
        model: The trained PyTorch model.
        inputs: The batch of test images.
        device: The device to run the inference on.
        is_anatomical: Whether the dataset has fixed anatomical orientation.
        is_texture_based: Whether the dataset relies on high-frequency textures.
        aug_cfg: Augmentation sub-configuration with TTA parameters.
        resolution: Dataset resolution for TTA intensity scaling.

    Returns:
        The averaged softmax probability predictions (mean ensemble).
    """
    model.eval()
    inputs = inputs.to(device)

    # Generate the suite of transforms via module-level factory
    transforms = _get_tta_transforms(is_anatomical, is_texture_based, aug_cfg, resolution)

    # ENSEMBLE EXECUTION: Iterative probability accumulation to save VRAM
    ensemble_probs = None

    with torch.no_grad():
        for t in transforms:
            aug_input = t(inputs)
            logits = model(aug_input)
            probs = F.softmax(logits, dim=1)

            if ensemble_probs is None:
                ensemble_probs = probs
            else:
                ensemble_probs += probs

    # Calculate the mean probability across all augmentation passes
    if ensemble_probs is None:
        raise ValueError("TTA transforms list cannot be empty")
    return ensemble_probs / len(transforms)

plot_confusion_matrix(all_labels, all_preds, classes, out_path, ctx)

Generate and save a row-normalized confusion matrix plot.

Parameters:

Name Type Description Default
all_labels NDArray[Any]

Ground-truth label array.

required
all_preds NDArray[Any]

Predicted label array.

required
classes list[str]

Human-readable class label names.

required
out_path Path

Destination file path for the saved figure.

required
ctx PlotContext

PlotContext with architecture and evaluation settings.

required
Source code in orchard/evaluation/visualization.py
def plot_confusion_matrix(
    all_labels: npt.NDArray[Any],
    all_preds: npt.NDArray[Any],
    classes: list[str],
    out_path: Path,
    ctx: PlotContext,
) -> None:
    """
    Generate and save a row-normalized confusion matrix plot.

    Args:
        all_labels: Ground-truth label array.
        all_preds: Predicted label array.
        classes: Human-readable class label names.
        out_path: Destination file path for the saved figure.
        ctx: PlotContext with architecture and evaluation settings.
    """
    # matplotlib cosmetic — confusion matrix rendering and styling
    with plt.style.context(ctx.plot_style):  # pragma: no mutate
        cm = confusion_matrix(  # pragma: no mutate
            all_labels,
            all_preds,
            labels=np.arange(len(classes)),
            normalize="true",  # pragma: no mutate
        )
        cm = np.nan_to_num(cm)

        disp = ConfusionMatrixDisplay(
            confusion_matrix=cm, display_labels=classes  # pragma: no mutate
        )
        fig, ax = plt.subplots(figsize=(11, 9))  # pragma: no mutate

        disp.plot(  # pragma: no mutate
            ax=ax,
            cmap=ctx.cmap_confusion,
            xticks_rotation=45,  # pragma: no mutate
            values_format=".3f",  # pragma: no mutate
        )  # pragma: no mutate
        plt.title(  # pragma: no mutate
            f"Confusion Matrix — {ctx.arch_name} | Resolution — {ctx.resolution}",
            fontsize=12,  # pragma: no mutate
            pad=20,  # pragma: no mutate
        )

        plt.tight_layout()  # pragma: no mutate

        fig.savefig(out_path, dpi=ctx.fig_dpi, bbox_inches="tight")  # pragma: no mutate
        plt.close()
        logger.info(
            "%s%s %-18s: %s", LogStyle.INDENT, LogStyle.ARROW, "Confusion Matrix", out_path.name
        )

plot_training_curves(train_losses, val_accuracies, out_path, ctx)

Plot training loss and validation accuracy on a dual-axis chart.

Saves the figure to disk and exports raw numerical data as .npz for reproducibility.

Parameters:

Name Type Description Default
train_losses Sequence[float]

Per-epoch training loss values.

required
val_accuracies Sequence[float]

Per-epoch validation accuracy values.

required
out_path Path

Destination file path for the saved figure.

required
ctx PlotContext

PlotContext with architecture and evaluation settings.

required
Source code in orchard/evaluation/visualization.py
def plot_training_curves(
    train_losses: Sequence[float], val_accuracies: Sequence[float], out_path: Path, ctx: PlotContext
) -> None:
    """
    Plot training loss and validation accuracy on a dual-axis chart.

    Saves the figure to disk and exports raw numerical data as ``.npz``
    for reproducibility.

    Args:
        train_losses: Per-epoch training loss values.
        val_accuracies: Per-epoch validation accuracy values.
        out_path: Destination file path for the saved figure.
        ctx: PlotContext with architecture and evaluation settings.
    """
    # matplotlib cosmetic — colors, fonts, sizes, layout
    with plt.style.context(ctx.plot_style):  # pragma: no mutate
        fig, ax1 = plt.subplots(figsize=(9, 6))  # pragma: no mutate

        # Left Axis: Training Loss
        ax1.plot(train_losses, color="#e74c3c", lw=2, label="Training Loss")  # pragma: no mutate
        ax1.set_xlabel("Epoch")  # pragma: no mutate
        ax1.set_ylabel("Loss", color="#e74c3c", fontweight="bold")  # pragma: no mutate
        ax1.tick_params(axis="y", labelcolor="#e74c3c")  # pragma: no mutate
        ax1.grid(True, linestyle="--", alpha=0.4)  # pragma: no mutate

        # Right Axis: Validation Accuracy
        ax2 = ax1.twinx()  # pragma: no mutate
        ax2.plot(  # pragma: no mutate
            val_accuracies, color="#3498db", lw=2, label="Validation Accuracy"  # pragma: no mutate
        )  # pragma: no mutate
        ax2.set_ylabel("Accuracy", color="#3498db", fontweight="bold")  # pragma: no mutate
        ax2.tick_params(axis="y", labelcolor="#3498db")  # pragma: no mutate

        fig.suptitle(  # pragma: no mutate
            f"Training Metrics — {ctx.arch_name} | Resolution — {ctx.resolution}",
            fontsize=14,  # pragma: no mutate
            y=1.02,  # pragma: no mutate
        )

        fig.tight_layout()  # pragma: no mutate

        plt.savefig(out_path, dpi=ctx.fig_dpi, bbox_inches="tight")  # pragma: no mutate
        logger.info(
            "%s%s %-18s: %s", LogStyle.INDENT, LogStyle.ARROW, "Training Curves", out_path.name
        )

        # Export raw data for post-run analysis
        npz_path = out_path.with_suffix(".npz")
        np.savez(npz_path, train_losses=train_losses, val_accuracies=val_accuracies)
        plt.close()

show_predictions(model, loader, device, classes, save_path=None, ctx=None, n=None)

Visualize model predictions on a sample batch.

Coordinates data extraction, model inference, grid layout generation, and image post-processing. Highlights correct (green) vs. incorrect (red) predictions.

Parameters:

Name Type Description Default
model Module

Trained model to evaluate.

required
loader DataLoader[Any]

DataLoader providing evaluation samples.

required
device device

Target device for inference.

required
classes list[str]

Human-readable class label names.

required
save_path Path | None

Output file path. If None, displays interactively.

None
ctx PlotContext | None

PlotContext with layout and normalization settings.

None
n int | None

Number of samples to display. Defaults to ctx.n_samples.

None
Source code in orchard/evaluation/visualization.py
def show_predictions(
    model: nn.Module,
    loader: DataLoader[Any],
    device: torch.device,
    classes: list[str],
    save_path: Path | None = None,
    ctx: PlotContext | None = None,
    n: int | None = None,
) -> None:
    """
    Visualize model predictions on a sample batch.

    Coordinates data extraction, model inference, grid layout generation,
    and image post-processing. Highlights correct (green) vs. incorrect
    (red) predictions.

    Args:
        model: Trained model to evaluate.
        loader: DataLoader providing evaluation samples.
        device: Target device for inference.
        classes: Human-readable class label names.
        save_path: Output file path. If None, displays interactively.
        ctx: PlotContext with layout and normalization settings.
        n: Number of samples to display. Defaults to ``ctx.n_samples``.
    """
    model.eval()
    # cosmetic fallback
    style = ctx.plot_style if ctx else "seaborn-v0_8-muted"  # pragma: no mutate

    with plt.style.context(style):  # pragma: no mutate
        # 1. Parameter Resolution & Batch Inference
        # cosmetic fallback
        num_samples = n or (ctx.n_samples if ctx else 12)  # pragma: no mutate
        images, labels, preds = _get_predictions_batch(
            model, loader, device, num_samples  # pragma: no mutate
        )

        # 2. Grid & Figure Setup
        # cosmetic fallback
        grid_cols = ctx.grid_cols if ctx else 4  # pragma: no mutate
        _, axes = _setup_prediction_grid(len(images), grid_cols, ctx)  # pragma: no mutate

        # 3. Plotting Loop
        for i, ax in enumerate(axes):
            # guard for extra grid cells beyond actual images
            if i < len(images):  # pragma: no mutate
                _plot_single_prediction(
                    ax, images[i], labels[i], preds[i], classes, ctx  # pragma: no mutate
                )
            ax.axis("off")  # pragma: no mutate

        # 4. Suptitle
        if ctx:
            plt.suptitle(_build_suptitle(ctx), fontsize=14)  # pragma: no mutate

        # 5. Export and Cleanup
        # forwarding; tested in _finalize_figure
        _finalize_figure(plt, save_path, ctx)  # pragma: no mutate