Data retention policy
Data categories
- Workflow execution data:
- Task inputs and outputs (that is, primitive type literals)
FlyteFile
/FlyteDirectory
and other large offloaded data objects (likeDataFrame
s) both in their default locations and in any customraw-data-prefix
locations that may have been specified at execution time- Flyte
Deck
data. - Artifact data.
- Internal metadata used by Union.ai.
- Fast-registered code:
- Local code artifacts that will be copied into the Flyte task container at runtime when using
union register
orunion run --remote --copy-all
.
- Local code artifacts that will be copied into the Flyte task container at runtime when using
- Flyte plugin metadata (for example, Spark history server data).
The versions discussed here are at the object level and are not related to the versions of workflows, tasks and other Union.ai entities that you see in the Union.ai UI.
How policies are specified
The policy will be configured on the object store bucket(s) which you are using for Union.ai.
- For AWS S3 buckets use S3 Lifecycle policies
- For GCP GCS buckets use Object Lifecycle Management
- For Azure Blob Storage use lifecycle management policies
Deletion of current versions
For current version, deletion due to a retention period running out means moving the object to a non-current version, which we refer to as soft-deletion.
Deletion of non-current versions
For non-current versions, deletion due to a retention period running out means permanent deletion.
Example policy
Workflow execution data | Fast-registered code | Flyte-plugin metadata | |
---|---|---|---|
Current version | unlimited | unlimited | unlimited |
Non-current version | 7 days | 7 days | 7 days |
-
The retention policy for current versions in all categories is
unlimited
, meaning that auto-deletion is disabled.- If you change this to a specified number of days, then auto-deletion will occur after that time period, but because it applies to current versions the data object will be soft-deleted (that is, moved to a non-current version), not permanently deleted.
-
The retention policy for non-current versions in all categories is
7 days
, meaning that auto-deletion will occur after 7 days and that the data will be permanently deleted.
Attempting to access deleted data
If you attempt to access deleted data, you will receive an error:
- When workflow node input/output data is deleted, the Input/Output tabs in the UI will display a Not Found error.
- When
Deck
data is deleted, theDeck
view in the UI will display a Not Found error. - When artifacts are deleted, the artifacts UI will work, but it will display a URL that points to no longer existing artifact.
To remedy these types of errors, you will have to re-run the workflow that generated the data in question.
- When fast registered code data is deleted, the workflow execution will fail.
To remedy this type of error, you will have to both re-register and re-run the workflow.
Separate sets of policies per cluster
If you have a multi-cluster set up, you can specify a different set of retention policies (one per category) for each cluster.
Data retention and task caching
When enabling data retention, task caching will be adjusted accordingly. To avoid attempts to retrieve cache data that has already been deleted, the age
of the cache will always be configured to be less than the sum of both retention periods.