/ model training & fine-tuning

Confidently run large-scale training or fine-tuning on GPU clusters across clouds and on-premise

Union’s scalable architecture, flexible framework support, and robust workflow orchestration ensure reproducibility, efficiency, and collaboration in machine learning development. Train small or complex models, including GPU-accelerated deep learning, hyperparameter optimization, etc.

Balance performance vs. cost efficiency for training and fine-tuning

Train on a single GPU, multiple GPUs on a single node, or scale across multiple nodes. Leverage heterogeneous clusters with various accelerators, including Nvidia GPUs, Google TPUs, and AWS silicon. Utilize fractional GPUs for non-intensive tasks and take advantage of GPUs on spot nodes. All training infrastructure is shared and ephemeral, scaling to zero upon completion.

Orchestrate the entire training lifecycle on a unified platform

Training a model is often multiple steps, from data processing evaluation to validation. Orchestrate the entire training lifecycle on a unified platform. Fine-tune models reactively based on data arrival or automate downstream predictions during a successful training run. Catalog all models and access versions of models through a unified model registry.

Work seamlessly with your preferred languages, ML frameworks, and libraries

Leverage Union’s expansive support for ML frameworks, libraries, and Agent framework with pre-built components best suited for your needs. Run distributed training using PyTorch and TensorFlow by defining training tasks as containerized functions.

Build reliable, interpretable, and trustworthy models collaboratively across teams

Track end-to-end data and model lineage between workflows and teams using Artifacts, a registry for models and data. Trace any model predictions to the specific dataset used for model training. Immutable executions help you confidently reproduce and verify results to provide transparency and interpretability of models.


“Cradle addressed its data provenance requirements by leveraging key Flyte functionality. Since everything is versioned in Flyte, it is possible to trace what code and image produced which outputs from which inputs.”

Eli Bixby
Co-Founder & ML Lead at Cradle