Cluster recommendations
Union.ai is capable of running on any Kubernetes cluster. This includes managed Kubernetes services such as Google Kubernetes Engine (GKE), Azure Kubernetes Service (AKS), and Amazon Elastic Kubernetes Service (EKS), as well as self-managed Kubernetes clusters.
While many configurations are supported, we have some recommendations to ensure the best performance and reliability of your Union deployment.
Kubernetes Versions
We recommend running Kubernetes versions that are actively supported by the Kubernetes community. This typically means running one of the most recent three minor versions. For example, if the most recent version is 1.32, we recommend running 1.32, 1.31, or 1.30.
Networking Requirements
Many Container Network Interface (CNI) plugins require planning for IP address allocation capacity. For example, Amazon’s VPC CNI and GKE’s Dataplane v2 allocate IP addresses to Kubernetes Pods out of one or more or your VPC’s subnets. If you are using one of these CNI plugins, you should ensure that your VPC’s subnets have enough available IP addresses to support the number of concurrent tasks you expect to run.
We recommend using at least a /16
CIDR range (65,536 addresses), you may optionally sub-divide this range into smaller subnets to
support multiple availability zones or other network segmentation requirements.
In short, you should aim to have at least 1 IP address available for each task you expect to run concurrently.
Performance Recommendations
Node Pools
It is recommended but not required to use separate node pools for the Union services and the Union worker pods. This allows you to guard against resource contention between Union services and other tasks running in your cluster. You can find additional information in the Configuring Node Pools section.
AWS
S3
Each data plane uses an object store (an AWS S3 bucket, GCS bucket or ABS container) that is used to store data used in the execution of workflows. As a Union.ai administrator, you can specify retention policies for this data when setting up your data plane (learn more about the different types of data categories stored by the data plane.)
Union recommends the use of two S3 buckets
- metadata bucket: contains workflow execution data such as Task inputs and outputs, etc
- fast registration bucket: contain local code artifacts that will be copied into the Flyte task container at runtime when using
union register
orunion run --remote --copy-all
.
Note: You can choose to use a single bucket in your dataplane
Data Retention
Union recommends using Lifecycle Policy on these buckets to manage the storage costs. See Data retention policy for more information.
IAM
You will need to enable access to your S3 buckets from the cluster.
- Update the EKS Node IAM role for your cluster to allow the data plane nodes to use your S3 buckets. This can be done by creating and attaching a new IAM policy which enables access to your S3 buckets. Use
union-flyte-worker
as the name of the new policy. The permissions for the policy will be:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement1",
"Effect": "Allow",
"Action": [
"s3:DeleteObject*",
"s3:GetObject*",
"s3:ListBucket",
"s3:PutObject*"
],
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
]
}
]
}
- Attach this policy to your node group IAM role
- Create an IAM OIDC provider for your EKS cluster.
- Create a new role named
union-flyte-role
to enable applications in a Pod’s containers to make API requests to AWS services using AWS Identity and Access Management (IAM) permissions.
The Trust Policy for this role will be:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::$account_id:oidc-provider/$oidc_provider"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringLike": {
"$oidc_provider:aud": "sts.amazonaws.com",
"$oidc_provider:sub": "system:serviceaccount:*:*"
}
}
}
]
}
where $account_id
is your AWS account ID and $oidc_provider
is the OIDC provider you created above.
NOTE: You can obtain these values using the AWS CLI:
aws eks describe-cluster --region $cloud_region --name $cluster_name --query "cluster.identity.oidc.issuer" --output text
- Attach the
union-flyte-worker
policy created above to this new role.
EKS configuration
Union recommends installing the following EKS add-ons:
- CoreDNS
- Amazon VPC CNI
- Kube-proxy
Union supports Autoscaling and the use of spot (interruptible) instances.
AKS
Secure access
Union recommends using Microsoft Entra Workload ID to securely access Azure resources.
Ensure your AKS cluster is enabled as OIDC Issuer
Backend services
- Create a User Assigned Managed Identity with Federated Credentials that map to the following Kubernetes Service Accounts:
Subject Identifier
system:serviceaccount:<NAMESPACE>:flytepropeller-system
system:serviceaccount:<NAMESPACE>:flytepropeller-webhook-system
system:serviceaccount:<NAMESPACE>:operator-system
system:serviceaccount:<NAMESPACE>:proxy-system
Where <NAMESPACE>
is where you plan to install the Union operator (union
by default)
- Assign the
Storage Blob Data Owner
role to this Identity at the Storage Account level.
Workers
This is the Identity that the Pods created for each execution will use to access Azure resources. Those Pods use the default
K8s Service Account on each project-domain namespace, unless otherwise specified.
- Create a User Assigned Managed Identity with Federated Credentials that map to the
default
K8s Service Account:
Subject Identifier
-
system:serviceaccount:development:default
-
system:serviceaccount:staging:default
-
system:serviceaccount:production:default
-
Assign the
Storage Blob Data Owner
role to this Identity at the Storage Account level.
Node pools
By default, the Union installation request the following resources:
CPU (vCPUs) | Memory (GiB) | |
---|---|---|
Requests | 14 | 27.1 |
Limits | 17 | 32 |
For GPU access, Union injects tolerations and label selectors to execution Pods. Learn more and validate your AKS nodes have matching labels and/or taints.