/ better AI pipelines by design

Infrastructure for AI, ML & Data

For developers managing AI, ML, and data workflows in production, the challenges extend well beyond scheduling and orchestrating DAGs. Union.ai addresses these complexities by offering a comprehensive infrastructure management platform designed for the nuances of such environments.

Union optimizes resources across teams and implements cost-effective strategies that can reduce expenses by up to 66%. Moreover, it's engineered to fit within your own cloud ecosystem, ensuring a robust and tailored infrastructure that scales with your technical demands.

View product
Input Graph
Powerful DAGs, observability & cost-efficient engineering
/ Union: just bring your compute, we bring Flyte

Powerful DAGs, observability & cost-efficient engineering

Union is a fully-managed Flyte platform deployed in your VPC that provides a single-endpoint workflow orchestration and compute service to engineers building data and ML products.

Get built-in dashboards, live-logging, and task-level resource monitoring, enabling users to identify resource bottlenecks and simplifying the debugging process, resulting in optimized infrastructure and faster experimentation.

Get a demo
/ from engineers for engineers

AI engineering for engineers

Union is an open AI orchestration platform that simplifies AI infrastructure so you can develop, deploy, and innovate faster. Unlike popular—but simple—AI engineering orchestrators, Union wrangles the infrastructure setup and management as well.

Write your code in Python, collaborate across departments, and enjoy full reproducibility and auditability. Union lets you focus on what matters.

Explore docs
@task
def get_data() -> pd.DataFrame:
    return load_digits(as_frame=True).frame

@task
def train_model(data: pd.DataFrame) -> MLPClassifier:
    features = data.drop("target", axis="columns")
    target = data["target"]
    return MLPClassifier().fit(features, target)

@workflow
def training_workflow() -> MLPClassifier:
    data = get_data()
    return train_model(data=data)

Write your Python code locally, execute it remotely

Enjoy the freedom to write Python code that runs both locally and remotely in your Kubernetes cluster. Take advantage of full parallelization and utilization of all Kubernetes nodes without creating Docker files or writing YAML.

1. Run Local
2. Scale Remote
3. Deploy
import transformers as tr
from datasets import load_dataset

from flytekit import task
from flytekit.types.directory import FlyteDirectory

@task
def train(
    model_id: str,
    dataset_id: str,
    dataset_name: str,
) -> FlyteDirectory:

    # authenticate
    hh.login(token="...")

    # load the dataset, model, and tokenizer
    dataset = load_dataset(dataset_id, dataset_name)
    model = tr.AutoModelForCausalLM.from_pretrained(model_id, ...)
    tokenizer = tr.AutoTokenizer.from_pretrained(model_id, ...)

    # prepare dataset
    dataset = dataset["train"].shuffle().map(tokenizer, ...)

    # define and run the trainer
    trainer = tr.Trainer(model=model, train_dataset=dataset, ...)
    print("Training model")
    trainer.train()

    # save and return model directory
    output_path = "./model"
    print("Saving model")
    trainer.model.save_pretrained(output_path)
    return FlyteDirectory(path=output_path)
import transformers as tr
from datasets import load_dataset

from flytekit import task, ImageSpec, Resources
from flytekit.types.directory import FlyteDirectory


image_spec = ImageSpec(
    name="llm_training",
    registry="ghcr.io/unionai-oss",
    requirements="requirements.txt",
    python_version="3.9",
    cuda="11.7.1",
    env={"VENV": "/opt/venv"},
)

@task(
    cache=True,
    cache_version="0",
    requests=Resource(mem="100Gi", cpu="32", gpu="8"),
    container_image=image_spec,
)
def train(
    model_id: str,
    dataset_id: str,
    dataset_name: str,
) -> FlyteDirectory:
    ...
@task(...)
def train(...) -> FlyteDirectory:
    ...

@task(container_image=image_spec)
def deploy(model_dir: FlyteDirectory, repo_id: str) -> str:
    model_dir.download()
    hh.login(token="...")
    
    # upload readme and model files
    api = hh.HfApi()
    repo_url = api.create_repo(repo_id, exist_ok=True)
    readme = "..."
    api.upload_file(
        path_or_fileobj=BytesIO(readme.encode()),
        path_in_repo="README.md",
        repo_id=repo_id,
    )
    api.upload_folder(
        repo_id=repo_id,
        folder_path=model_dir.path,
    )
    return str(repo_url)

@workflow
def train_and_deploy(
    model_id: str,
    dataset_id: str,
    dataset_name: str,
    repo_id: str,
) -> str:
    model_dir = train(
        model_id=model_id,
        dataset_id=dataset_id,
        dataset_name=dataset_name,
    )
    return deploy(model_dir=model_dir, repo_id=repo_id)
$ pyflyte run llm_training.py train \
    --model_id EleutherAI/pythia-70m \
    --dataset_id togethercomputer/RedPajama-Data-V2 \
    --dataset_name sample

Running Execution on local.
Map: 100%|████████████| 1050391/1050391
Training model
{'train_runtime': 4.5401, ...}
100%|███████████████████| 100/100
Saving model
file:///var/folders/4q/frdnh9l10h53gggw1m59gr9m0000gp/T/flyte-f2qjyme6/raw/a888e295fefbdae4023ec2b35e53edcb

$ ls /var/folders/4q/frdnh9l10h53gggw1m59gr9m0000gp/T/flyte-f2qjyme6/raw/a888e295fefbdae4023ec2b35e53edcb

config.json
generation_config.json
model.safetensors
pytorch_model.bin
special_tokens_map.json
tokenizer.json
tokeknizer_config.json
training_args.json
$ pyflyte run --remote llm_training.py train \
    --model_id meta-llama/Llama-2-7b-hf \
    --dataset_id togethercomputer/RedPajama-Data-V2 \
    --dataset_name default

Running Execution on Remote.
Image ghcr.io/unionai-oss/llm_training:5quiCD_S3VoDsP0Sr3ZWIA.. found. Skip building.

[✔] Go to https://org.unionai.cloud/console/projects/flytesnacks/domains/development/executions/fe661d1127e84438bb8e to see execution in the console.
Scale
$ pyflyte run --remote llm_training.py train_and_deploy \
    --model_id meta-llama/Llama-2-7b-hf \
    --dataset_id togethercomputer/RedPajama-Data-V2 \
    --dataset_name default \
    --repo_id unionai/Llama-2-7b-hf-finetuned

Running Execution on Remote.
Image ghcr.io/unionai-oss/llm_training:5quiCD_S3VoDsP0Sr3ZWIA.. found. Skip building.
Image ghcr.io/unionai-oss/llm_training:5quiCD_S3VoDsP0Sr3ZWIA.. found. Skip building.

[✔] Go to https://org.unionai.cloud/console/projects/flytesnacks/domains/development/executions/fe661d1127e84438bb8e to see execution in the console.
Deploy
/ the better replacement for Airflow and Kubeflow

Purpose-built for lineage-aware pipeline orchestration

Bring your own Airflow code (BYOAC) and take advantage of modern AI orchestration features—out of the box! Get full reproducibility, audibility, experiment tracking, cross-team task sharing, compile-time error checking, and automatic artifact capture.

Explore features
Airflow
Union
Versioning

Easily experiment and iterate in isolation with versioned tasks and workflows.

Multi-tenancy

A centralized infrastructure for your team and organization, enables multiple users to share the same platform while maintaining their own distinct data and configurations.

Type checking

Strongly typed inputs and outputs can simplify data validation and highlight incompatibilities between tasks making it easier to identify and troubleshoot errors before launching the workflow.

Caching

Caching the output of task executions can accelerate subsequent executions and prevent wasted resources.

Data lineage

As a data-aware platform, it can simplify rollbacks and error tracking.

Immutability

Immutable executions help ensure reproducibility by preventing any changes to the state of an execution.

Recovery

Rerun only failed tasks in a workflow to save time, resources, and more easily debug.

Human-in-the-loop

Enable human intervention to supervise, tune and test workflows - resulting in improved accuracy and safety.

Intra-task checkpointing

Checkpoint progress within a task execution in order to save time and resources in the event of task failure.

Reproducibility

With every task versioned and every dependency set is captured, making it easy to share workflows across teams and reproduce results.

We manage the infrastructure so you can build what matters
/ supporting innovation across industries

We manage the infrastructure so you can build what matters

Union is the AI orchestration and infrastructure platform of choice for many top data and ML teams globally. Esteemed companies such as Woven Planet and AbCellera have transitioned their workflows from Airflow or Kubeflow to Union.

Why?
Union is up to 66 percent more cost-efficient with your compute resources, solves complex infrastructure challenges, and is built for rapid iteration across teams.

View case studies

Globally trusted & tested

10
k+
Community members
1
m+
Downloads per month
30
+
Fortune 100 companies

Join our developer community

“As engineers, a lot of this might be table stakes for us. But for data scientists, being able to get [financial analytics] up and running on Flyte™ and getting all of this stuff for free has been a really big win for them.”

Dylan Wilder, Engineering Manager at Spotify

“During our evaluation stage, we did some stress tests to understand whether Flyte™ can satisfy our requirements, and it provided us with a good result.”

Pradithya Aria Pura, Principal Software Engineer at Gojek

“Given the scale at which some of these tasks run, compute can get really expensive. So being able to add an interruptible argument to the task decorator for certain tasks has been really useful to cut costs.”

Jeev Balakrishnan, Software Engineer at Freenome

“Union Cloud solves our operational complexity problems across diverse workloads, whether it is running data cleaning & pre-processing workflows or protein structure ML predictions for low-volume, high-complexity scientific workloads to large-scale scientific simulations. Additionally, the platform can drive down the relative cost of protein production by orders of magnitude. With Union Cloud as our standardized workflow orchestration platform, we can stop managing our own systems and infrastructure, and instead focus on antibody discovery and development.”

Alex Ford, Head of Data Platform at AbCellera Biologics

“Gojek is experiencing rapid growth and incorporating machine learning into various products. To sustain this growth and guarantee success, a reliable and scalable pipeline solution is critical. Flyte plays a vital role as a key component of Gojek’s ML Platform by providing exactly that.”

Pradithya Aria Pura, Principal Software Engineer at Gojek

“Our contribution velocity and the rate at which we're contributing is a reflection of our confidence in Flyte™ long term as the de facto workflow orchestration engine. I really think Flyte™ has got the model absolutely correct.”

Kenny Workman, Co-founder and CTO at LatchBio

“Versioning, caching and the different domains we can have in Flyte™ prompted us to move from Airflow to Flyte™ because you don’t really need to think about them and they are … available out of the box in Flyte™.”

Stephen Batifol, Machine Learning Engineer at Wolt

“We're mainly using Flyte™ because of its cloud native capabilities. We do everything in the cloud, and we also don't want to be limited to a single cloud provider. So having the ability to run everything through Kubernetes is amazing for us.”

Maarten de Jong, Python Developer at Blackshark.ai

“FlyteFile is a really nice abstraction on a distributed platform. [I can say,] ‘I need this file,’ and Flyte™ takes care of downloading it, uploading it and only accessing it when we need to. We generate large binary files in netcdf format, so not having to worry about transferring and copying those files has been really nice.”

Nicholas LoFaso, Senior Platform Software Engineer at MethaneSAT