evaluation
orchard.evaluation
¶
Evaluation and Reporting Package.
This package coordinates model inference, performance visualization, and structured experiment reporting. It provides a unified interface to assess model generalization through standard metrics and Test-Time Augmentation (TTA), while automating the generation of artifacts (plots, reports) for experimental tracking.
PlotContext(arch_name, resolution, fig_dpi, plot_style, cmap_confusion, grid_cols, n_samples, fig_size_predictions, mean=None, std=None, use_tta=False, is_anatomical=True, is_texture_based=True)
dataclass
¶
Immutable snapshot of all configuration fields needed by visualization functions.
from_config(cfg)
classmethod
¶
Build a PlotContext from the full Config object.
Source code in orchard/evaluation/plot_context.py
TrainingReport
¶
Bases: BaseModel
Validated data container for summarizing a complete training experiment.
This model serves as a Schema for the final experimental metadata. It stores hardware, hyperparameter, and performance states to ensure full reproducibility and traceability of the training pipeline.
Attributes:
| Name | Type | Description |
|---|---|---|
timestamp |
str
|
ISO formatted execution time. |
architecture |
str
|
Identifier of the architecture used. |
dataset |
str
|
Name of the dataset. |
best_val_accuracy |
float
|
Peak accuracy achieved on validation set. |
best_val_auc |
float
|
Peak ROC-AUC achieved on validation set. |
best_val_f1 |
float
|
Peak macro-averaged F1 on validation set. |
test_accuracy |
float
|
Final accuracy on the unseen test set. |
test_auc |
float
|
Final ROC-AUC on the unseen test set. |
test_macro_f1 |
float
|
Macro-averaged F1 score (key for imbalanced data). |
is_texture_based |
bool
|
Whether texture-preserving logic was applied. |
is_anatomical |
bool
|
Whether anatomical orientation constraints were enforced. |
use_tta |
bool
|
Indicates if Test-Time Augmentation was active. |
epochs_trained |
int
|
Total number of optimization cycles completed. |
learning_rate |
float
|
Initial learning rate used by the optimizer. |
batch_size |
int
|
Samples processed per iteration. |
seed |
int
|
Global RNG seed for experiment replication. |
augmentations |
str
|
Descriptive string of the transformation pipeline. |
normalization |
str
|
Mean/Std statistics applied to the input tensors. |
model_path |
str
|
Absolute path to the best saved checkpoint. |
log_path |
str
|
Absolute path to the session execution log. |
to_vertical_df()
¶
Converts the Pydantic model into a vertical pandas DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A two-column DataFrame (Parameter, Value) for Excel export. |
Source code in orchard/evaluation/reporting.py
save(path, fmt='xlsx')
¶
Saves the report to disk in the requested format.
Supported formats
xlsx: Professional Excel with conditional formatting.csv: Flat CSV (two columns: Parameter, Value).json: Pretty-printed JSON array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Base file path (suffix is replaced to match fmt). |
required |
fmt
|
str
|
Output format — one of |
'xlsx'
|
Source code in orchard/evaluation/reporting.py
run_final_evaluation(model, test_loader, train_losses, val_metrics_history, class_names, paths, training, dataset, augmentation, evaluation, arch_name, aug_info='N/A', tracker=None)
¶
Execute the complete evaluation pipeline.
Coordinates full-set inference (with TTA support), visualizes metrics, and generates the final structured report.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
Trained model for evaluation (already on target device). |
required |
test_loader
|
DataLoader[Any]
|
DataLoader for test set. |
required |
train_losses
|
list[float]
|
Training loss history per epoch. |
required |
val_metrics_history
|
list[Mapping[str, float]]
|
Validation metrics history per epoch. |
required |
class_names
|
list[str]
|
List of class label strings. |
required |
paths
|
RunPaths
|
RunPaths for artifact output. |
required |
training
|
TrainingConfig
|
Training sub-config (use_tta, hyperparameters for report). |
required |
dataset
|
DatasetConfig
|
Dataset sub-config (resolution, metadata, normalization). |
required |
augmentation
|
AugmentationConfig
|
Augmentation sub-config (TTA transforms). |
required |
evaluation
|
EvaluationConfig
|
Evaluation sub-config (plot flags, report format). |
required |
arch_name
|
str
|
Architecture identifier (e.g. |
required |
aug_info
|
str
|
Augmentation description string for report. |
'N/A'
|
tracker
|
TrackerProtocol | None
|
Optional experiment tracker for final metrics. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[float, float, float]
|
tuple[float, float, float]: A 3-tuple of:
|
Source code in orchard/evaluation/evaluation_pipeline.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | |
evaluate_model(model, test_loader, device, use_tta=False, is_anatomical=False, is_texture_based=False, aug_cfg=None, resolution=28)
¶
Performs full-set evaluation and coordinates metric calculation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
The trained neural network. |
required |
test_loader
|
DataLoader[Any]
|
DataLoader for the evaluation set. |
required |
device
|
device
|
Hardware target (CPU/CUDA/MPS). |
required |
use_tta
|
bool
|
Flag to enable Test-Time Augmentation. |
False
|
is_anatomical
|
bool
|
Dataset-specific orientation constraint. |
False
|
is_texture_based
|
bool
|
Dataset-specific texture preservation flag. |
False
|
aug_cfg
|
AugmentationConfig | None
|
Augmentation sub-configuration (required for TTA). |
None
|
resolution
|
int
|
Dataset resolution for TTA intensity scaling. |
28
|
Returns:
| Type | Description |
|---|---|
tuple[NDArray[Any], NDArray[Any], dict[str, float], float]
|
tuple[np.ndarray, np.ndarray, dict, float]: A 4-tuple of:
|
Source code in orchard/evaluation/evaluator.py
compute_auc(y_true, y_score)
¶
Compute macro-averaged ROC-AUC with graceful fallback.
Handles binary (positive class probability) and multiclass (OvR)
cases. Returns NaN on failure so callers can distinguish
"computation impossible" from "genuinely zero AUC".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
NDArray[Any]
|
Ground truth class indices, shape |
required |
y_score
|
NDArray[Any]
|
Probability distributions, shape |
required |
Returns:
| Type | Description |
|---|---|
float
|
ROC-AUC score, or |
Source code in orchard/trainer/engine.py
compute_classification_metrics(labels, preds, probs)
¶
Computes accuracy, macro-averaged F1, and macro-averaged ROC-AUC.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
NDArray[Any]
|
Ground truth class indices. |
required |
preds
|
NDArray[Any]
|
Predicted class indices. |
required |
probs
|
NDArray[Any]
|
Softmax probability distributions. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
dict[str, float]: Metric dictionary with keys:
|
Source code in orchard/evaluation/metrics.py
create_structured_report(val_metrics, test_metrics, macro_f1, train_losses, best_path, log_path, arch_name, dataset, training, aug_info=None)
¶
Constructs a TrainingReport object using final metrics and configuration.
This factory method aggregates disparate pipeline results into a single validated container, resolving paths and extracting augmentation summaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
val_metrics
|
Sequence[Mapping[str, float]]
|
History of per-epoch validation metric dicts. |
required |
test_metrics
|
dict[str, float]
|
Final test-set metric dict (accuracy, auc, etc.). |
required |
macro_f1
|
float
|
Final Macro F1 score on test set. |
required |
train_losses
|
Sequence[float]
|
History of per-epoch training losses. |
required |
best_path
|
Path
|
Path to the saved model weights. |
required |
log_path
|
Path
|
Path to the run log file. |
required |
arch_name
|
str
|
Architecture identifier (e.g. |
required |
dataset
|
DatasetConfig
|
Dataset sub-config with metadata, name, and normalization info. |
required |
training
|
TrainingConfig
|
Training sub-config with hyperparameters and flags. |
required |
aug_info
|
str | None
|
Pre-formatted augmentation string. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
TrainingReport |
TrainingReport
|
A validated Pydantic model ready for export. |
Source code in orchard/evaluation/reporting.py
adaptive_tta_predict(model, inputs, device, is_anatomical, is_texture_based, aug_cfg, resolution)
¶
Performs Test-Time Augmentation (TTA) inference on a batch of inputs.
Applies a set of standard augmentations in addition to the original input.
Predictions from all augmented versions are averaged in the probability space.
If is_anatomical is True, it restricts augmentations to orientation-preserving
transforms. If is_texture_based is True, it disables destructive pixel-level
noise/blur to preserve local patterns. The tta_mode config field controls
ensemble complexity (full vs light) independently of hardware.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
The trained PyTorch model. |
required |
inputs
|
Tensor
|
The batch of test images. |
required |
device
|
device
|
The device to run the inference on. |
required |
is_anatomical
|
bool
|
Whether the dataset has fixed anatomical orientation. |
required |
is_texture_based
|
bool
|
Whether the dataset relies on high-frequency textures. |
required |
aug_cfg
|
AugmentationConfig
|
Augmentation sub-configuration with TTA parameters. |
required |
resolution
|
int
|
Dataset resolution for TTA intensity scaling. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The averaged softmax probability predictions (mean ensemble). |
Source code in orchard/evaluation/tta.py
plot_confusion_matrix(all_labels, all_preds, classes, out_path, ctx)
¶
Generate and save a row-normalized confusion matrix plot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_labels
|
NDArray[Any]
|
Ground-truth label array. |
required |
all_preds
|
NDArray[Any]
|
Predicted label array. |
required |
classes
|
list[str]
|
Human-readable class label names. |
required |
out_path
|
Path
|
Destination file path for the saved figure. |
required |
ctx
|
PlotContext
|
PlotContext with architecture and evaluation settings. |
required |
Source code in orchard/evaluation/visualization.py
plot_training_curves(train_losses, val_accuracies, out_path, ctx)
¶
Plot training loss and validation accuracy on a dual-axis chart.
Saves the figure to disk and exports raw numerical data as .npz
for reproducibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
train_losses
|
Sequence[float]
|
Per-epoch training loss values. |
required |
val_accuracies
|
Sequence[float]
|
Per-epoch validation accuracy values. |
required |
out_path
|
Path
|
Destination file path for the saved figure. |
required |
ctx
|
PlotContext
|
PlotContext with architecture and evaluation settings. |
required |
Source code in orchard/evaluation/visualization.py
show_predictions(model, loader, device, classes, save_path=None, ctx=None, n=None)
¶
Visualize model predictions on a sample batch.
Coordinates data extraction, model inference, grid layout generation, and image post-processing. Highlights correct (green) vs. incorrect (red) predictions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
Trained model to evaluate. |
required |
loader
|
DataLoader[Any]
|
DataLoader providing evaluation samples. |
required |
device
|
device
|
Target device for inference. |
required |
classes
|
list[str]
|
Human-readable class label names. |
required |
save_path
|
Path | None
|
Output file path. If None, displays interactively. |
None
|
ctx
|
PlotContext | None
|
PlotContext with layout and normalization settings. |
None
|
n
|
int | None
|
Number of samples to display. Defaults to |
None
|