2.5.1

TorchRun

Package: flyte.clustered

TorchRun launcher configuration for a ClusteredTaskEnvironment.

Parameters

class TorchRun(
    rdzv_backend: Literal['static', 'c10d'],
    max_restarts: int,
)
Parameter Type Description
rdzv_backend Literal['static', 'c10d'] Rendezvous backend. “static” (default) relies on JobSet-level restarts; “c10d” enables in-job elastic recovery via a TCPStore on rank-0.
max_restarts int In-pod torchrun restarts before the pod itself fails. Distinct from JobSet-level max_restarts on ClusterFailurePolicy.