Data retention policy

Each data plane uses an object store (an AWS S3 bucket, GCS bucket or ABS container) that is used to store data used in the execution of workflows. As a Union.ai administrator, you can specify retention policies for this data when setting up your data plane.

Data categories

There are three categories of data:

  1. Workflow execution data:
    • Task inputs and outputs (that is, primitive type literals)
    • FlyteFile/FlyteDirectory and other large offloaded data objects (like DataFrames) both in their default locations and in any custom raw-data-prefix locations that may have been specified at execution time
    • Flyte Deck data.
    • Artifact data.
    • Internal metadata used by Union.ai.
  2. Fast-registered code:
    • Local code artifacts that will be copied into the Flyte task container at runtime when using union register or union run --remote --copy-all.
  3. Flyte plugin metadata (for example, Spark history server data).
Object versions are not the same as Union.ai entity versions

The versions discussed here are at the object level and are not related to the versions of workflows, tasks and other Union.ai entities that you see in the Union.ai UI.

How policies are specified

The policy will be configured on the object store bucket(s) which you are using for Union.ai.

Deletion of current versions

For current version, deletion due to a retention period running out means moving the object to a non-current version, which we refer to as soft-deletion.

Deletion of non-current versions

For non-current versions, deletion due to a retention period running out means permanent deletion.

Example policy

Workflow execution data Fast-registered code Flyte-plugin metadata
Current version unlimited unlimited unlimited
Non-current version 7 days 7 days 7 days
  • The retention policy for current versions in all categories is unlimited, meaning that auto-deletion is disabled.

    • If you change this to a specified number of days, then auto-deletion will occur after that time period, but because it applies to current versions the data object will be soft-deleted (that is, moved to a non-current version), not permanently deleted.
  • The retention policy for non-current versions in all categories is 7 days, meaning that auto-deletion will occur after 7 days and that the data will be permanently deleted.

Attempting to access deleted data

If you attempt to access deleted data, you will receive an error:

  • When workflow node input/output data is deleted, the Input/Output tabs in the UI will display a Not Found error.
  • When Deck data is deleted, the Deck view in the UI will display a Not Found error.
  • When artifacts are deleted, the artifacts UI will work, but it will display a URL that points to no longer existing artifact.

To remedy these types of errors, you will have to re-run the workflow that generated the data in question.

  • When fast registered code data is deleted, the workflow execution will fail.

To remedy this type of error, you will have to both re-register and re-run the workflow.

Separate sets of policies per cluster

If you have a multi-cluster set up, you can specify a different set of retention policies (one per category) for each cluster.

Data retention and task caching

When enabling data retention, task caching will be adjusted accordingly. To avoid attempts to retrieve cache data that has already been deleted, the age of the cache will always be configured to be less than the sum of both retention periods.