Flyte 1 vs. Flyte 2 vs. Union.ai

All three share the same Python-native authoring model. Your workflows don’t change. What changes is the runtime: how far it scales, how fast it executes, and how much of the infrastructure your team has to own.

Try the devbox

A free, local sandbox to explore the Union.ai platform.

Chat with an engineer

Compare Features

Flyte workflows run on Union.ai without rewriting. The comparison below covers execution architecture, runtime capabilities, and operational model: the parts that determine whether your platform can keep up with your team.

Both platforms share the same Python-native authoring, dynamic workflows, and typed exception handling. Flyte workflows run on Union.ai without rewriting.

Workflow fanout
~5K tasks
bounded by map-task mechanics and K8s control-plane throughput
~10K tasks
improved orchestration, same underlying K8s scheduling path
250K+ tasks
high-cardinality scheduling runs through a purpose-built execution substrate, bypassing pod-scheduler and etcd write pressure
Workflow executions / hour
~10K/hr
~10K/hr
~1M/hr
designed for continuous evaluation services, experiment CI, and high-frequency model testing
Task executions / minute
~150/min
each task pays pod scheduling, image pull, container start, and Python import overhead
~250/min
Flyte 2 improves control flow
~5,000/min
warm execution eliminates repeated Kubernetes lifecycle overhead for short-running tasks
Cold start latency
~30s
~30s
<1s
sub-100ms for tasks running on warm reusable containers
Concurrent actions
~500
bounded by controller and global cluster limits
~500
~30K
concurrency is tunable at the project and domain level
GPU reuse across invocations
every task requests a GPU, starts a container, runs, and releases; accelerator sits idle during setup/teardown
same pod-per-task model
reusable containers keep the process and GPU context warm across repeated calls; tool calls, rerankers, embedding steps, and chained inference stop wasting accelerator time on lifecycle overhead. Smart batching techniques enabled by GPU reuse can push GPU utilization toward 100%.
Python-native workflow authoring
Static DAG DSL
workflows must be fully defined at compile time; loops, conditionals, and branching require Flyte-specific constructs
Pure Python
tasks can call tasks directly with loops, conditionals, and normal async control flow
Pure Python
same authoring model; Union.ai adds production runtime capabilities around identical code, priority control, and throttles via Queues.
Workflow sandboxing for generated code
Monty sandbox starts in microseconds and structurally blocks filesystem access, network I/O, OS calls, and arbitrary imports; heavy computation runs in isolated containers
same sandbox foundation with Union.ai runtime, serving, and ops layers around it
Typed exception handling across task boundaries
failures surface through task state and static retry policies only
OOM, spot preemption, and custom errors can be caught and handled in Python control flow
catch OOM and retry with more memory; catch spot interruption and resume from checkpoint; branch recovery logic without rebuilding the workflow
Runtime resource overrides
resource shape is set at compile time; adapting to data size or failure mode requires redesign
‍override GPU, memory, image, retry policy, and env at execution time
same override model with Union.ai's multi-cluster routing available as an additional dispatch layer
Live model / agent / API serving
batch system; a separate serving stack is required
Partial
batch workflows and optional serving via kserve.
serve models, agents, APIs, and applications directly from Union's custom engine, with customized identity aware networking layer
Realtime inference
GPU-backed inference endpoints with autoscaling, reading artifacts produced by Union.ai workflows
Agent execution runtime
every tool call or short unit of work pays full batch-system overhead; impractical for chained steps
Partial
Flyte 2 improves control-flow expressiveness significantly, but short tasks still pay pod-per-task startup cost
No new DSL, just Python. Everything runs in the user's cloud, fully secured, zero-trust by design, SLMs can be hosted locally. No fragmentation: train, process data, run agents, and serve on one platform. No vendor lock-in. MCP servers can be deployed privately within your cloud.
Unified batch and realtime lifecycle
‍training/eval pipelines and serving are separate systems
the same data plane runs training workflows and serves live endpoints: preprocess, train, evaluate, register, serve, and observe without stitching together two platforms
Task-level cluster routing
cross-cluster execution requires external orchestration or separate Flyte deployments
placement is workflow-level; individual tasks cannot be dynamically routed to different clusters
Union.ai's global scheduler routes individual tasks dynamically across registered clusters by resource type, cost, availability, region, or policy, without splitting the workflow. Queues add priority and concurrency control per project or domain. One workflow can preprocess on CPU spot, fine-tune on H100s, and validate on cheaper GPUs.
Fail-fast resource validation
unschedulable pods sit Pending until someone inspects K8s events
if requested hardware doesn't exist, pods remain Pending with no workflow-level explanation
validates resource requests against actual schedulable inventory before pod submission; no more workflows stuck for hours because someone requested A100s in a T4 cluster
Accelerated datasets
each pod downloads from object storage at startup unless the customer builds shared cache infrastructure
same per-pod download behavior from S3/GCS
large read-only datasets defined once and pre-mounted as shared volumes (EFS/FSx for Lustre); eliminates 8 to 12 minutes of per-pod download time before useful work starts
Remote image building
local Docker build, push to registry, reference in code
‍same local Docker/registry flow
ImageSpec builds images remotely inside your data plane via Kaniko; works from notebooks, CI, and Apple Silicon with no local Docker or registry credential management
Deployment path
Manual
bounded by cHelm, Postgres, ingress, object store, IAM, auth, logging, and upgrades assembled and operated by your teamontroller and global cluster limits
Manual
same Helm-driven deployment with the same operational burden
Self-service Terraform (BYOC)
provisions data plane, agent, IAM, buckets, and connectivity; platform teams validate workloads in the first hour instead of spending the first sprint wiring infrastructure
Time to first workflow
Days
every task requests a GPU, starts a container, runs, and releases; acceleratofor an experienced K8s/IAM/Flyte teamr sits idle during setup/teardown
~2 days
with k8s experience, following docs, without blockers
<1 hour
via self-service Terraform, plus optional Union.ai onboarding support
Distributed training
Kubeflow operator only
Clustered tasks via Kubernetes jobsets
Clustered tasks with native metrics and observability, gang scheduling on dedicated clusters, ability to run on the primary training node for fast iteration
Realtime and persisted logs
External only
historical search requires your own CloudWatch, Loki, Elasticsearch, or equivalent
External only
logs are not retained after pod termination without a separately managed logging backend
Union.ai collects and indexes task logs in your data plane; query logs from last Tuesday without asking platform engineering to re-run the job or dig through cloud log groups
Per-task CPU/GPU/memory profiling
External only
requires Prometheus/Grafana, Datadog, or custom sidecars
External only
same external monitoring dependency
CPU, GPU, and memory time-series graphs scoped to each task execution in the UI; see that a fine-tuning job ran at 3% GPU utilization, or that a task OOMs at 14.8GB every time, without correlating workflow IDs to a separate dashboard
Cost attribution
no native path from cloud bill to workflow, team, project, or execution
cost broken down by project, domain, workflow, and execution with configurable resource pricing; answer which team or model burned the GPU budget without building a separate cost pipeline on top of Kubernetes billing exports
Artifact registry
typed blobs and execution metadata exist, but no browsable registry with tags, versions, or lineage
raw blobs and metadata are insufficient for model governance or discoverability
artifacts are typed, tagged, versioned, and lineage-linked; find which dataset trained a model, which workflow produced it, and trigger downstream evaluations automatically when a new artifact appears
Cross-run lineage and provenance
limited to execution metadata and user conventions
full provenance graph queryable through the UI and SDK; when a dataset is bad, immediately identify which models, evaluations, and downstream artifacts are affected
Ray and Spark dashboards
Self-managed
teams run Ray Dashboard and Spark History Server themselves and solve ingress, auth, and VPN access independently
Self-managed
linked from the task detail view and proxied securely through Union.ai auth and RBAC; no custom ingress, no port hunting
Secrets management
Manual
K8s Secrets with manual Vault integration; secrets may live unencrypted in etcd depending on deployment configuration
Manual
same K8s Secrets/Vault integration with no managed lifecycle
‘flyte create secret’ writes to AWS Secrets Manager, GCP Secret Manager, Azure secret manager, Vault, or a custom backend; credentials are injected at runtime and never baked into images or K8s YAML
Fine-grained RBAC
project- and domain-scoped viewer/developer/admin roles mapped from SSO groups; separate dev from prod and team from team without creating a new cluster for every access boundary
SSO and IdP group sync
Basic OIDC
each task pays pod scheduling, image pull, container start, and Python import overhead
Basic OIDC
Flyte 2 improves control flow, not the per-task startup path
managed SAML/OIDC with IdP group sync into Union.ai roles; onboard teams to a shared AI platform without building a custom permissions admin system
Zero-trust data path
Self-managed
data stays in customer infrastructure, but the customer owns full control plane operations and cost
Self-managed
all data, logs, I/O, and auxiliary UIs stay within your VPC via Direct-to-DataPlane routing. Union hosts the control plane, eliminating the ~20% infrastructure cost of running it yourself, without ever touching customer data
Production support
Community only
GitHub issues and Slack; no SLA, no escalation path
Community only
GitHub issues and Slack; no SLA
enterprise support channel with direct escalation to Union.ai engineering; when production AI infrastructure breaks before a model release, your platform team has an accountable vendor to call, not just a GitHub issue to file

Frequently asked questions

Union.ai outperforms any OSS alternative on scale and performance in production. It supports 50K+ actions per workflow, 10,000+ concurrent actions per run, and cold start under 5 seconds. Reusable warm-start containers, per-action GPU and CPU profiling, cost attribution per team and workflow, and fail-fast resource validation at launch are the capabilities that separate a platform you can run experiments on from one you can run a business on.

Most orchestrators launch a new Kubernetes pod per action, ~10 seconds of overhead before your code runs. Union.ai supports reusable containers: warm containers you can use across similar tasks. Cold start drops to under 100ms and GPU stays allocated across invocations. For teams building agentic AI, RAG pipelines, or multi-step inference workflows, this adds essential production efficiency.

Flyte workflows run on Union.ai without rewriting. The SDK is compatible and the authoring model is identical. The migration is mostly operational and straightforward. Most teams run their first workflow on Union.ai within an hour of starting setup.

Flyte OSS is free to license. Operating it (or any open-source orchestrator) is not free. A stable production deployment requires a significant amount of manual maintenance that gets more costly as you scale. Engineers must manage Helm values, Postgres, ingress config, a separate secrets solution, an external log aggregation stack, and ongoing K8s maintenance. Union.ai offloads this maintenance so your team focuses on workflows, not infrastructure. The break-even on engineer time tends to come faster than most teams expect.

Scale is one part of the value. The features that tend to matter first for smaller teams are data lineage, persistent logs and built-in observability, and managed secrets that pass a security review without custom engineering. RBAC and cost attribution matter as soon as a second team starts touching the same platform. The operational overhead of self-managed Flyte tends to grow faster than the team itself does.

Union’s zero trust security architecture means data NEVER transits outside your secure cloud. No model weights, pipeline outputs, or execution logs leave your environment. This is more secure than the industry status quo, where you’re required to trust a vendor to handle your data safely.

Start today and scale with confidence.