flytekitplugins.kfpytorch.error_handling
Handle errors in elastic training jobs.
Directory
Methods
| Method | Description |
|---|---|
create_recoverable_error_file() |
Create a file to signal to the agent process that an exception in the worker process is recoverable. |
is_recoverable_worker_error() |
Check if the error in the worker process is recoverable. |
Variables
| Property | Type | Description |
|---|---|---|
RECOVERABLE_ERROR_FILE_NAME |
str |
Methods
create_recoverable_error_file()
def create_recoverable_error_file()Create a file to signal to the agent process that an exception in the worker process is recoverable.
Torch’s elastic_launch gives the agent process access to exceptions raised in the worker
processes only as strings in an error file. Instead of parsing this error file in the agent process for
the string FlyteRecoverableException - which would not detect exceptions inheriting from
FlyteRecoverableException - we create a file in the worker process to signal to the agent process
that the exception is recoverable. The file is created in the directory where the default
torch elastic error file is written.
Raises:
ValueError: If the environment variable TORCHELASTIC_ERROR_FILE is not set.
is_recoverable_worker_error()
def is_recoverable_worker_error(
failure,
) -> boolCheck if the error in the worker process is recoverable.
The error is considered recoverable if the directory containing the torch elastic error file contains
a file named recoverable_error.
| Parameter | Type |
|---|---|
failure |