infrastructure_config

`orchard.core.config.infrastructure_config` ¶

Infrastructure & Resource Lifecycle Management.

Operational bridge between declarative configuration and physical execution environment. Manages 'clean-start' and 'graceful-stop' sequences, ensuring hardware resource optimization and preventing concurrent run collisions via filesystem locks.

Key Tasks

Process sanitization: Guards against ghost processes and multi-process collisions in local environments
Environment locking: Mutex strategy for synchronized experimental output access
Resource deallocation: GPU/MPS cache flushing and temporary artifact cleanup

`InfraManagerProtocol` ¶

Bases: Protocol

Protocol defining infrastructure management interface.

Enables dependency injection and mocking in tests while ensuring consistent lifecycle management across implementations.

`prepare_environment(cfg, logger)` ¶

Prepare execution environment before experiment run.

Parameters:

Name	Type	Description	Default
`cfg`	`'HardwareAwareConfig'`	Configuration with hardware manifest access.	required
`logger`	`Logger`	Logger instance for status reporting.	required

Source code in orchard/core/config/infrastructure_config.py

def prepare_environment(self, cfg: "HardwareAwareConfig", logger: logging.Logger) -> None:
    """
    Prepare execution environment before experiment run.

    Args:
        cfg: Configuration with hardware manifest access.
        logger: Logger instance for status reporting.
    """
    ...  # pragma: no cover

`release_resources(cfg, logger)` ¶

Release resources allocated during environment preparation.

Parameters:

Name	Type	Description	Default
`cfg`	`'HardwareAwareConfig'`	Configuration used during resource allocation.	required
`logger`	`Logger`	Logger instance for status reporting.	required

Source code in orchard/core/config/infrastructure_config.py

def release_resources(self, cfg: "HardwareAwareConfig", logger: logging.Logger) -> None:
    """
    Release resources allocated during environment preparation.

    Args:
        cfg: Configuration used during resource allocation.
        logger: Logger instance for status reporting.
    """
    ...  # pragma: no cover

`HardwareAwareConfig` ¶

Bases: Protocol

Structural contract for configurations exposing hardware manifest.

Decouples infrastructure management from concrete Config implementations, enabling type-safe access to hardware execution policies.

Attributes:

Name	Type	Description
`hardware`	`Any`	HardwareConfig instance with device and lock settings.

`InfrastructureManager` ¶

Bases: BaseModel

Environment safeguarding and resource management executor.

Ensures clean execution environment before runs and proper resource release after, preventing concurrent experiment collisions and GPU memory leaks.

Lifecycle

prepare_environment(): Kill zombies, acquire lock
[Experiment runs]
release_resources(): Release lock, flush caches

`prepare_environment(cfg, logger=None)` ¶

Prepare execution environment for experiment run.

Performs pre-run cleanup and resource acquisition to ensure isolated, collision-free experiment execution.

Steps

Terminate duplicate/zombie processes (if allow_process_kill=True and not in shared compute environment like SLURM/PBS/LSF)
Acquire filesystem lock to prevent concurrent runs using the same project name

Parameters:

Name	Type	Description	Default
`cfg`	`HardwareAwareConfig`	Configuration object with hardware.allow_process_kill and hardware.lock_file_path attributes.	required
`logger`	`Logger \| None`	Logger for status messages. Defaults to 'Infrastructure'.	`None`

Source code in orchard/core/config/infrastructure_config.py

def prepare_environment(
    self, cfg: HardwareAwareConfig, logger: logging.Logger | None = None
) -> None:
    """
    Prepare execution environment for experiment run.

    Performs pre-run cleanup and resource acquisition to ensure
    isolated, collision-free experiment execution.

    Steps:
        1. Terminate duplicate/zombie processes (if allow_process_kill=True
           and not in shared compute environment like SLURM/PBS/LSF)
        2. Acquire filesystem lock to prevent concurrent runs using
           the same project name

    Args:
        cfg: Configuration object with hardware.allow_process_kill and
            hardware.lock_file_path attributes.
        logger: Logger for status messages. Defaults to 'Infrastructure'.
    """
    log = logger or logging.getLogger(LOGGER_NAME)

    # Process sanitization
    if cfg.hardware.allow_process_kill:
        cleaner = DuplicateProcessCleaner()

        # Skip on shared compute (SLURM, PBS, LSF) or distributed launchers (torchrun)
        is_shared = any(
            env in os.environ
            for env in ("SLURM_JOB_ID", "PBS_JOBID", "LSB_JOBID", "RANK", "LOCAL_RANK")
        )

        if not is_shared:
            num_zombies = cleaner.terminate_duplicates(logger=log)
            log.debug(" %s Duplicate processes terminated: %d.", LogStyle.ARROW, num_zombies)
        else:
            log.debug(" %s Shared environment detected: skipping process kill.", LogStyle.ARROW)

    # Concurrency guard
    ensure_single_instance(lock_file=cfg.hardware.lock_file_path, logger=log)
    log.debug(" %s Lock acquired at %s", LogStyle.ARROW, cfg.hardware.lock_file_path)

`release_resources(cfg, logger=None)` ¶

Release system and hardware resources gracefully after experiment.

Performs cleanup to ensure resources are properly freed and available for subsequent runs. Handles errors gracefully to avoid blocking experiment completion.

Steps

Release filesystem lock at cfg.hardware.lock_file_path
Flush GPU/MPS memory caches to prevent VRAM fragmentation

Parameters:

Name	Type	Description	Default
`cfg`	`HardwareAwareConfig`	Configuration object with hardware.lock_file_path attribute.	required
`logger`	`Logger \| None`	Logger for status messages. Defaults to 'Infrastructure'.	`None`

Raises:

Type	Description
`OSError`	If lock file cannot be released (e.g. permission denied).

Source code in orchard/core/config/infrastructure_config.py

def release_resources(
    self, cfg: HardwareAwareConfig, logger: logging.Logger | None = None
) -> None:
    """
    Release system and hardware resources gracefully after experiment.

    Performs cleanup to ensure resources are properly freed and available
    for subsequent runs. Handles errors gracefully to avoid blocking
    experiment completion.

    Steps:
        1. Release filesystem lock at cfg.hardware.lock_file_path
        2. Flush GPU/MPS memory caches to prevent VRAM fragmentation

    Args:
        cfg: Configuration object with hardware.lock_file_path attribute.
        logger: Logger for status messages. Defaults to 'Infrastructure'.

    Raises:
        OSError: If lock file cannot be released (e.g. permission denied).
    """
    log = logger or logging.getLogger(LOGGER_NAME)

    # Release lock
    try:
        release_single_instance(cfg.hardware.lock_file_path)
        log.info("  %s System lock released", LogStyle.ARROW)
    except OSError as e:
        log.error(" %s Failed to release lock: %s", LogStyle.ARROW, e)
        raise

    # Flush caches
    self._flush_compute_cache(log=log)

infrastructure_config

orchard.core.config.infrastructure_config ¶

InfraManagerProtocol ¶

prepare_environment(cfg, logger) ¶

release_resources(cfg, logger) ¶

HardwareAwareConfig ¶

InfrastructureManager ¶

prepare_environment(cfg, logger=None) ¶

release_resources(cfg, logger=None) ¶

`orchard.core.config.infrastructure_config` ¶

`InfraManagerProtocol` ¶

`prepare_environment(cfg, logger)` ¶

`release_resources(cfg, logger)` ¶

`HardwareAwareConfig` ¶

`InfrastructureManager` ¶

`prepare_environment(cfg, logger=None)` ¶

`release_resources(cfg, logger=None)` ¶