Framework Design

Usage

# Install
pip install orchard-ml

# Generate a default recipe (all config options with defaults)
orchard init

# Run training
orchard run recipe.yaml

# Override any config value on the fly
orchard run recipe.yaml --set training.epochs=20 --set training.seed=99

# Use a built-in recipe
orchard run recipes/config_mini_cnn.yaml

Core Features

RootOrchestrator -- Lifecycle Management

The RootOrchestrator (orchard/core/orchestrator.py) is the central coordinator for every ML experiment. It implements a 7-phase initialization protocol (phases 1-6 eager, phase 7 deferred to CLI) inside a Context Manager, guaranteeing deterministic setup and automatic resource cleanup:

Phase	Responsibility	Key Detail
1. Determinism	Global RNG seeding	Python, NumPy, PyTorch (strict mode optional)
2. Runtime Configuration	CPU thread affinity, system libraries	matplotlib backend, library silencing
3. Filesystem Provisioning	Dynamic workspace via `RunPaths`	BLAKE2b-hashed directories
4. Logging Initialization	File-based persistent logging	Hot-swap from STDOUT to RotatingFileHandler
5. Config Persistence	YAML manifest export	Full audit trail in workspace
6. Infrastructure Guarding	OS-level resource locks (`flock`)	Prevents concurrent run collisions
7. Environment Reporting	Comprehensive telemetry	Hardware, dataset metadata, policies

Design qualities: - Context Manager pattern: with RootOrchestrator(cfg) as orch: guarantees cleanup even on failure -- lock release, handler flush, resource teardown - Full Dependency Injection: Every external dependency (infra manager, reporter, seed setter, device resolver, ...) is injectable, enabling complete testability without side effects - Protocol-Based Abstractions: InfraManagerProtocol, ReporterProtocol, TimeTrackerProtocol provide type-safe interfaces for mocking - Idempotent Initialization: Guarded by _initialized flag -- safe to call multiple times without orphaned directories or lock leaks - Device Caching: get_device() resolves and caches the optimal compute device (CUDA/CPU/MPS) once, avoiding repeated detection overhead - Rank-Aware Phase Gating: Injectable rank parameter enables DDP/torchrun awareness -- rank 0 executes phases 1-6 plus device resolution, non-main ranks execute only phases 1-2 (seeding + threads), skipping filesystem provisioning, logging, and infrastructure locks

from pathlib import Path
from orchard import Config, RootOrchestrator

cfg = Config.from_recipe(Path("recipes/config_mini_cnn.yaml"))
with RootOrchestrator(cfg) as orch:
    device = orch.get_device()
    paths = orch.paths
    logger = orch.run_logger
    # Pipeline execution with guaranteed cleanup

Distributed Training (Phase 0 -- Preparatory): The orchestrator, guards, and infrastructure manager are rank-aware via orchard.core.environment.distributed. Single-instance locks and duplicate process cleanup are automatically skipped on non-main ranks and in distributed environments. Full DDP support (barrier synchronization, path broadcasting, per-rank device assignment) is planned for a future release and has not yet been validated in multi-GPU training.

Configuration Engine (SSOT)

Built on Pydantic V2, the configuration system acts as a Single Source of Truth, transforming raw inputs (CLI/YAML) into an immutable, type-safe execution blueprint:

Late-Binding Metadata Injection: Dataset specifications (normalization constants, class mappings) are resolved from a centralized registry at instantiation time
Cross-Domain Validation: Post-construction logic guards prevent unstable states (e.g., enforcing RGB input for pretrained weights, validating AMP compatibility)
Path Portability: Automatic serialization converts absolute paths to environment-agnostic anchors for cross-platform reproducibility

Infrastructure Guard Layer

The InfrastructureManager bridges declarative configs with physical hardware (used by RootOrchestrator in Phase 6):

Mutual Exclusion via flock: Kernel-level advisory locking ensures only one training instance per workspace (prevents VRAM race conditions)
Process Sanitization: psutil wrapper identifies and terminates ghost Python processes
HPC-Aware Safety: Auto-detects cluster schedulers (SLURM/PBS/LSF) and suspends aggressive process cleanup to preserve multi-user stability

Reproducibility Design

Dual-Layer Strategy (enforced by RootOrchestrator Phase 1): 1. Standard Mode: Global seeding (Seed 42) with performance-optimized algorithms 2. Strict Mode: Bit-perfect reproducibility via: - torch.use_deterministic_algorithms(True) - worker_init_fn for multi-process RNG synchronization - Auto-scaling to num_workers=0 when determinism is critical

Data Integrity Validation: - MD5 checksum verification for dataset downloads - validate_npz_keys structural integrity checks before memory allocation

Performance Optimization

Hybrid RAM Management: - Small Datasets: Full RAM caching for maximum throughput - Large Datasets: Indexed slicing to prevent OOM errors

Dynamic Path Anchoring: - "Search-up" logic locates project root via markers (.git, README.md) - Ensures absolute path stability regardless of invocation directory

Intelligent Hyperparameter Search

Optuna Integration Features: - TPE Sampling: Tree-structured Parzen Estimator for efficient search space exploration - Median Pruning: Early stopping of underperforming trials (30-50% time savings) - Persistent Studies: SQLite storage enables resume-from-checkpoint - Type-Safe Constraints: All search spaces respect Pydantic validation bounds - Auto-Visualization: Parameter importance plots, optimization history, parallel coordinates

System Design

The framework implements Separation of Concerns (SoC) with six core layers:

┌─────────────────────────────────────────────────────────────────┐
│                      RootOrchestrator                           │
│              (Lifecycle Manager & Context)                      │
│                                                                 │
│  Responsibilities:                                              │
│  • Phase 1-7 initialization sequence                            │
│  • Resource acquisition & cleanup (Context Manager)             │
│  • Device resolution & caching                                  │
└────────────┬─────────────────────────┬──────────────────────────┘
             │                         │
             │ uses                    │ uses
             │                         │
    ┌────────▼──────────┐     ┌────────▼───────────────┐
    │                   │     │                        │
    │  Config Engine    │     │  InfrastructureManager │
    │  (Pydantic V2)    │     │  (flock/psutil)        │
    │                   │     │                        │
    │  • Type safety    │     │  • Process cleanup     │
    │  • Validation     │     │  • Kernel locks        │
    │  • Metadata       │     │  • HPC detection       │
    │    injection      │     │                        │
    └───────────────────┘     └────────────────────────┘
             │
             │ provides config to
             │
    ┌────────▼───────────────────────────────────────────────┐
    │                                                        │
    │              Execution Pipeline                        │
    │                                                        │
    │  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
    │  │   Data   │  │  Arch.   │  │ Trainer  │              │
    │  │ Handler  │→ │ Factory  │→ │  Engine  │              │
    │  └──────────┘  └──────────┘  └────┬─────┘              │
    │                                   │                    │
    │                             ┌─────▼──────┐             │
    │                             │ Evaluation │             │
    │                             │  Pipeline  │             │
    │                             └────────────┘             │
    │                                                        │
    └──────────────────────────┬─────────────────────────────┘
                               │
            ┌──────────────────┼──────────────────┐
            │                  │                  │
            ▼ optional         ▼ alternative      ▼ alternative
    ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
    │ Tracking Engine  │  │  Optimization    │  │  Export Engine   │
    │    (MLflow)      │  │  Engine (Optuna) │  │      (ONNX)      │
    │                  │  │                  │  │                  │
    │ • Metric logging │  │ • Study mgmt     │  │ • Checkpoint load│
    │ • Param capture  │  │ • Trial exec     │  │ • ONNX convert   │
    │ • Run comparison │  │ • Pruning        │  │ • Validation     │
    │ • Local SQLite   │  │ • Visualization  │  │ • Benchmarking   │
    └──────────────────┘  └──────────────────┘  └──────────────────┘

Key Design Principles:

Orchestrator owns both Config and InfrastructureManager
Config is the SSOT - all modules receive it as dependency
InfrastructureManager is stateless utility for OS-level operations
Execution pipeline is linear: Data → Model → Training → Eval
Optimization wraps the entire pipeline for each trial
Export consumes saved checkpoints for production deployment
Tracking is opt-in and side-effect-free - disabled by default, logs to local SQLite via MLflow

Dependency Graph

System Dependency Graph

Generated via pydeps. Highlights the centralized Config hub and modular design.

Regenerate:

pydeps orchard \
    --cluster \
    --max-bacon=0 \
    --max-module-depth=5 \
    --only orchard \
    --exclude-exact orchard.data_handler.diagnostic \
    --noshow \
    -T svg \
    -o docs/framework_map.svg

Requirements: pydeps + Graphviz (sudo apt install graphviz or brew install graphviz)