Deploy the dataplane

If you have not yet set up the required Nebius resources (MK8s cluster, Object Storage bucket, service account, access key), see Prepare infrastructure first.

Assumptions

  • You have a Union.ai organization, and you know the control plane URL for your organization.
  • You have a cluster name provided by or coordinated with Union.
  • You have a Nebius Managed Kubernetes cluster running one of the most recent three minor Kubernetes versions. Learn more
  • You have a Nebius Object Storage bucket, service account, and access key as described in Prepare infrastructure.

Prerequisites

  • Install Helm 3.
  • Install uctl.
  • Install the flyte CLI (used later to run a sample workflow).
  • Install the Nebius CLI and authenticate with nebius profile create.

Deploy the Union.ai operator

  1. Set your KUBECONFIG to the Nebius MK8s cluster where you want to deploy the data plane:

    nebius mk8s cluster get-credentials --id <CLUSTER_ID> --external
    export KUBECONFIG=<PATH_TO_KUBECONFIG>
  2. Configure the Union CLI and provision data plane resources:

    uctl config init --host=<ORG_NAME>.union.ai
    uctl selfserve provision-dataplane-resources --clusterName <CLUSTER_NAME> --provider metal

    The command generates a YAML values file specific to the metal provider, including the secrets necessary so your data plane can communicate with Union’s control plane.

  3. Update the generated values file with your Nebius-specific storage configuration. Replace the placeholders with your actual credentials and settings.

    host: <ORG_NAME>.union.ai
    clusterName: <CLUSTER_NAME>
    orgName: <ORG_NAME>
    provider: metal
    
    storage:
      accessKey: <YOUR_BUCKET_ACCESS_KEY>
      bucketName: <YOUR_STORAGE_BUCKET_NAME>
      endpoint: https://storage.<REGION>.nebius.cloud
      fastRegistrationBucketName: <YOUR_STORAGE_BUCKET_NAME>
      provider: compat
      region: <REGION>
      secretKey: <YOUR_BUCKET_SECRET_KEY>
    
    secrets:
      admin:
        create: true
        clientId: <CLIENT_ID>
        clientSecret: <CLIENT_SECRET>

    The uctl selfserve provision-dataplane-resources command in step 2 generates the <CLIENT_ID> and <CLIENT_SECRET> values and feeds them into the values file. Don’t modify them.

  4. Add the Union.ai Helm repo:

    helm repo add unionai https://unionai.github.io/helm-charts/
    helm repo update
  5. Install the data plane. Replace <PATH_TO_VALUES_FILE> with the path to the Helm values file you customized in step 3.

    helm upgrade --install unionai-dataplane unionai/dataplane \
      --namespace union --create-namespace \
      --values <PATH_TO_VALUES_FILE> \
      --timeout 10m
  6. Verify the pods are running:

    kubectl get pods -n union

    When the deployment succeeds, all pods show a Running status, including union-operator-proxy, union-operator-buildkit, and executor.

  7. Verify the cluster is registered with the control plane:

    uctl get cluster

    The output is similar to the following:

    NAME            ORG       STATE          HEALTH
    union-nebius    my-org    STATE_ENABLED  HEALTHY
  8. Create an API key for your organization. This is required for v2 workflow executions on the data plane. If you have already created one, rerun the same command to propagate the key to the new cluster:

    uctl create apikey --keyName EAGER_API_KEY --org <ORG_NAME>

    If you receive a PermissionDenied error, contact Union.ai support to have the permission enabled for your organization.

GPU node configuration (Nebius-specific)

Follow these steps to run GPU workloads on Nebius:

  1. Ensure the NVIDIA device plugin is installed and your task definitions request GPU resources. Nebius MK8s pre-installs the NVIDIA GPU operator on GPU node groups, so no additional setup is typically required. Learn more about how to add nodes with GPUs to a cluster.

  2. Configure the Union backend to inject the required tolerations and label selectors so only tasks that require GPUs land in GPU-enabled nodes:

    1. Identify the node(s) that have GPU devices available:

      kubectl get nodes -o jsonpath='{range .items[?(@.status.allocatable.nvidia\.com/gpu)]}{.metadata.name}{"\n"}{end}'
    2. Get the labels of a GPU node:

      kubectl get node <node-name> -o jsonpath='{.metadata.labels}' | jq

      Nebius nodes typically include a label that displays the instance type. For example, for a node with NVIDIA H200 GPUs:

      beta.kubernetes.io/instance-type=gpu-h200-sxm
    3. If the GPU device supports MIG partitions, the node typically also has a label indicating the partition profile. For example:

      nvidia.com/gpu-partition-size: 2g.35gb
  3. Update your Helm values file with the information gathered in the previous steps:

    # all the existing content of your values file
    ...
    
    # ADD
    config:
      k8s:
        plugins:
          k8s:
            gpu-device-node-label: "beta.kubernetes.io/instance-type"
            accelerator-devices:
              - H200: "gpu-h200-sxm"
            gpu-partition-size-node-label: "nvidia.com/gpu-partition-size"
  4. Update your installed release:

    helm upgrade unionai-dataplane unionai/dataplane \
      --namespace union \
      --values <PATH_TO_VALUES_FILE> \
      --timeout 10m
  5. Once the above steps are completed, request GPU devices or MIG partitions directly from the Flyte task:

    from flyte import Resources
    
    @env.task(resources=Resources(gpu="H200:1", memory="64Gi"))
    def train_model(...):
        ...

Working with the Nebius Container Registry

Flyte executions bundle your code and run it inside a container in the Nebius MK8s cluster. The contents of the image include the flyte package, your task code, and any other dependency your workflow requires.

Flyte automates building the image using an efficient layered mechanism to detect changes. You can decide where to store the images. This section covers the configuration if you plan to use Nebius Container Registry to store your container images.

  1. Obtain a long-lived token from Nebius as described in Working in a CI/CD environment.

  2. Get the static key token value from the previous step (it usually starts with v1...) and add it to an environment variable:

    TOKEN='v1.CmQK...'
  3. Encode it into a docker config file (replace the registry region accordingly):

    cat > docker-config-nebius.json <<EOF
    {
      "auths": {
        "cr.eu-north1.nebius.cloud": {
          "auth": "$(echo -n "iam:${TOKEN}" | base64)"
        }
      }
    }
    EOF
  4. Create an image pull secret:

    flyte create secret --type image_pull nebius-image-secret \
      --from-docker-config \
      --docker-config-path docker-config-nebius.json \
      --registries cr.eu-north1.nebius.cloud
  5. Use it in your Flyte Image definition:

    custom_image = flyte.Image.from_debian_base(
        registry="cr.eu-north1.nebius.cloud/e00...",
        registry_secret="<your-secret-name>",
    )
  6. Request the secret in your Flyte TaskEnvironment so tasks can pull the image:

    env = flyte.TaskEnvironment(
        name="hello_v2",
        image=custom_image,
        secrets=["<your-secret-name>"],
    )

Test a workflow

To run a sample workflow, complete the following steps:

  1. Create a Flyte CLI configuration file at the path .flyte/config.yaml in your project directory. Replace <ORG_NAME> and <PROJECT_NAME> with your organization and project identifiers.

    admin:
      endpoint: dns:///<ORG_NAME>.union.ai
    image:
      builder: remote
    task:
      domain: development
      org: <ORG_NAME>
      project: <PROJECT_NAME>
  2. Run a sample workflow:

    flyte run --image ghcr.io/flyteorg/flyte:py3.13-v2.0.2 \
      hello_world.py main --n 5

    If the remote image builder isn’t enabled for your organization, use the --image flag with a pre-built container image as in the preceding flyte run example.

  3. Check the run status. Replace <RUN_NAME> with the workflow run identifier.

    flyte get run <RUN_NAME>

    Look for ACTION_PHASE_SUCCEEDED in the output to confirm the workflow completed successfully.

Additional resources

For more information, see the following resources: