Skip to content

infrastructure_config

orchard.core.config.infrastructure_config

Infrastructure & Resource Lifecycle Management.

Operational bridge between declarative configuration and physical execution environment. Manages 'clean-start' and 'graceful-stop' sequences, ensuring hardware resource optimization and preventing concurrent run collisions via filesystem locks.

Key Tasks
  • Process sanitization: Guards against ghost processes and multi-process collisions in local environments
  • Environment locking: Mutex strategy for synchronized experimental output access
  • Resource deallocation: GPU/MPS cache flushing and temporary artifact cleanup

InfraManagerProtocol

Bases: Protocol

Protocol defining infrastructure management interface.

Enables dependency injection and mocking in tests while ensuring consistent lifecycle management across implementations.

prepare_environment(cfg, logger)

Prepare execution environment before experiment run.

Parameters:

Name Type Description Default
cfg 'HardwareAwareConfig'

Configuration with hardware manifest access.

required
logger Logger

Logger instance for status reporting.

required
Source code in orchard/core/config/infrastructure_config.py
def prepare_environment(self, cfg: "HardwareAwareConfig", logger: logging.Logger) -> None:
    """
    Prepare execution environment before experiment run.

    Args:
        cfg: Configuration with hardware manifest access.
        logger: Logger instance for status reporting.
    """
    ...  # pragma: no cover

release_resources(cfg, logger)

Release resources allocated during environment preparation.

Parameters:

Name Type Description Default
cfg 'HardwareAwareConfig'

Configuration used during resource allocation.

required
logger Logger

Logger instance for status reporting.

required
Source code in orchard/core/config/infrastructure_config.py
def release_resources(self, cfg: "HardwareAwareConfig", logger: logging.Logger) -> None:
    """
    Release resources allocated during environment preparation.

    Args:
        cfg: Configuration used during resource allocation.
        logger: Logger instance for status reporting.
    """
    ...  # pragma: no cover

HardwareAwareConfig

Bases: Protocol

Structural contract for configurations exposing hardware manifest.

Decouples infrastructure management from concrete Config implementations, enabling type-safe access to hardware execution policies.

Attributes:

Name Type Description
hardware Any

HardwareConfig instance with device and lock settings.

InfrastructureManager

Bases: BaseModel

Environment safeguarding and resource management executor.

Ensures clean execution environment before runs and proper resource release after, preventing concurrent experiment collisions and GPU memory leaks.

Lifecycle
  1. prepare_environment(): Kill zombies, acquire lock
  2. [Experiment runs]
  3. release_resources(): Release lock, flush caches

prepare_environment(cfg, logger=None)

Prepare execution environment for experiment run.

Performs pre-run cleanup and resource acquisition to ensure isolated, collision-free experiment execution.

Steps
  1. Terminate duplicate/zombie processes (if allow_process_kill=True and not in shared compute environment like SLURM/PBS/LSF)
  2. Acquire filesystem lock to prevent concurrent runs using the same project name

Parameters:

Name Type Description Default
cfg HardwareAwareConfig

Configuration object with hardware.allow_process_kill and hardware.lock_file_path attributes.

required
logger Logger | None

Logger for status messages. Defaults to 'Infrastructure'.

None
Source code in orchard/core/config/infrastructure_config.py
def prepare_environment(
    self, cfg: HardwareAwareConfig, logger: logging.Logger | None = None
) -> None:
    """
    Prepare execution environment for experiment run.

    Performs pre-run cleanup and resource acquisition to ensure
    isolated, collision-free experiment execution.

    Steps:
        1. Terminate duplicate/zombie processes (if allow_process_kill=True
           and not in shared compute environment like SLURM/PBS/LSF)
        2. Acquire filesystem lock to prevent concurrent runs using
           the same project name

    Args:
        cfg: Configuration object with hardware.allow_process_kill and
            hardware.lock_file_path attributes.
        logger: Logger for status messages. Defaults to 'Infrastructure'.
    """
    log = logger or logging.getLogger(LOGGER_NAME)

    # Process sanitization
    if cfg.hardware.allow_process_kill:
        cleaner = DuplicateProcessCleaner()

        # Skip on shared compute (SLURM, PBS, LSF) or distributed launchers (torchrun)
        is_shared = any(
            env in os.environ
            for env in ("SLURM_JOB_ID", "PBS_JOBID", "LSB_JOBID", "RANK", "LOCAL_RANK")
        )

        if not is_shared:
            num_zombies = cleaner.terminate_duplicates(logger=log)
            log.debug(" %s Duplicate processes terminated: %d.", LogStyle.ARROW, num_zombies)
        else:
            log.debug(" %s Shared environment detected: skipping process kill.", LogStyle.ARROW)

    # Concurrency guard
    ensure_single_instance(lock_file=cfg.hardware.lock_file_path, logger=log)
    log.debug(" %s Lock acquired at %s", LogStyle.ARROW, cfg.hardware.lock_file_path)

release_resources(cfg, logger=None)

Release system and hardware resources gracefully after experiment.

Performs cleanup to ensure resources are properly freed and available for subsequent runs. Handles errors gracefully to avoid blocking experiment completion.

Steps
  1. Release filesystem lock at cfg.hardware.lock_file_path
  2. Flush GPU/MPS memory caches to prevent VRAM fragmentation

Parameters:

Name Type Description Default
cfg HardwareAwareConfig

Configuration object with hardware.lock_file_path attribute.

required
logger Logger | None

Logger for status messages. Defaults to 'Infrastructure'.

None

Raises:

Type Description
OSError

If lock file cannot be released (e.g. permission denied).

Source code in orchard/core/config/infrastructure_config.py
def release_resources(
    self, cfg: HardwareAwareConfig, logger: logging.Logger | None = None
) -> None:
    """
    Release system and hardware resources gracefully after experiment.

    Performs cleanup to ensure resources are properly freed and available
    for subsequent runs. Handles errors gracefully to avoid blocking
    experiment completion.

    Steps:
        1. Release filesystem lock at cfg.hardware.lock_file_path
        2. Flush GPU/MPS memory caches to prevent VRAM fragmentation

    Args:
        cfg: Configuration object with hardware.lock_file_path attribute.
        logger: Logger for status messages. Defaults to 'Infrastructure'.

    Raises:
        OSError: If lock file cannot be released (e.g. permission denied).
    """
    log = logger or logging.getLogger(LOGGER_NAME)

    # Release lock
    try:
        release_single_instance(cfg.hardware.lock_file_path)
        log.info("  %s System lock released", LogStyle.ARROW)
    except OSError as e:
        log.error(" %s Failed to release lock: %s", LogStyle.ARROW, e)
        raise

    # Flush caches
    self._flush_compute_cache(log=log)