# Retries and timeouts

## Retry types

Union.ai allows you to automatically retry failing tasks. This section explains the configuration and application of retries.

Errors causing task failure are categorized into two main types, influencing the retry logic differently:

* `SYSTEM`: These errors arise from infrastructure-related failures, such as hardware malfunctions or network issues.
  They are typically transient and can often be resolved with a retry.

* `USER`: These errors are due to issues in the user-defined code, like a value error or a logic mistake, which usually require code modifications to resolve.

## Configuring retries

Retries in Union.ai are configurable to address both `USER` and `SYSTEM` errors, allowing for tailored fault tolerance strategies:

`USER` error can be handled by setting the `retries` attribute in the task decorator to define how many times a task should retry.
This requires a `FlyteRecoverableException` to be raised in the task definition, any other exception will not be retried:

```python
import random
from flytekit import task
from flytekit.exceptions.user import FlyteRecoverableException

@task(retries=3)
def compute_mean(data: List[float]) -> float:
    if random() < 0.05:
        raise FlyteRecoverableException("Something bad happened 🔥")
    return sum(data) / len(data)
```

## Retrying interruptible tasks

Tasks marked as interruptible can be preempted and retried without counting against the USER error budget.
This is useful for tasks running on preemptible compute resources like spot instances.

See [Interruptible instances](https://www.union.ai/docs/v1/union/user-guide/core-concepts/tasks/task-hardware-environment/retries-and-timeouts/interruptible-instances)

## Retrying map tasks

For map tasks, the interruptible behavior aligns with that of regular tasks. The retries field in the task annotation is not necessary for handling SYSTEM errors, as these are managed by the platform’s configuration. Alternatively, the USER budget is set by defining retries in the task decorator.

See [Map tasks](https://www.union.ai/docs/v1/union/user-guide/core-concepts/tasks/task-hardware-environment/map-tasks).

## Timeouts

To protect against zombie tasks that hang due to system-level issues, you can supply the timeout argument to the task decorator to make sure that problematic tasks adhere to a maximum runtime.

In this example, we make sure that the task is terminated after it’s been running for more that one hour.

```python
from datetime import timedelta

@task(timeout=timedelta(hours=1))
def compute_mean(data: List[float]) -> float:
    return sum(data) / len(data)
```

Notice that the timeout argument takes a built-in Python `timedelta` object.

---
**Source**: https://github.com/unionai/unionai-docs/blob/main/content/user-guide/core-concepts/tasks/task-hardware-environment/retries-and-timeouts.md
**HTML**: https://www.union.ai/docs/v1/union/user-guide/core-concepts/tasks/task-hardware-environment/retries-and-timeouts/
