Production-Grade ML Pipelines: Flyte™ vs. Kubeflow
Want to deploy ML models in production without worrying about managing infrastructure? Meet Flyte and Kubeflow. Both are Kubernetes-native platforms that help orchestrate ML workflows and infrastructure. Flyte and Kubeflow are uncompromisingly scalable and robust thanks to their Kubernetes compliance, but they offer much different developer experiences. Both address demand for infrastructure orchestrators that support ML orchestration, but Flyte’s modus operandi is quite different from Kubeflow’s.
It’s no secret that ML is compute-intensive and complex, and ML orchestrators can speed both pipeline iteration and deployment. Deployment, however, is considered an outlier by ML practitioners focused on building ML pipelines, not working with tools like Docker and Kubernetes to handle resource allocation and infrastructure automation.
Kubeflow Pipelines (KFP) is a platform to build, deploy and scale ML pipelines using Docker containers. In a nutshell, it is an infrastructure orchestrator for ML pipelines that can help put together core ML components to build full-stack pipelines. KFP is one part of the Kubeflow ecosystem and is more comparable with Flyte. Kubeflow orchestration entails a significant learning curve for many ML engineers because it requires them to understand and code infrastructure constructs. Kubeflow Pipelines v2 is a huge improvement over v1 but imposes a significant overhead for the end users of Kubeflow, especially data scientists, data engineers and ML engineers:
- Kubeflow is built as a thin layer on top of Kubernetes that automates some Kubernetes management systems. It offers limited management of Kubernetes configuration in Python, but it still requires the user to know how Kubernetes works.
- If ML practitioners are unfamiliar with Kubernetes, it’s hard to get Kubeflow deployed, even locally.
- Kubeflow has a significant learning curve and requires a lot of effort to get it off the ground.
- The v2 Python DSL isn’t purely Pythonic, which makes it difficult for Python developers to handle.
- Kubeflow Pipelines v2 supports only a minimal set of type annotations, which makes using it as an extension to Python code cumbersome, error-prone and difficult.
- Pages labeled “out of date” make it hard to trust Kubeflow documentation.
“... [T]rying to figure out when we’d done something wrong versus when the problem was outdated documentation. This slowed everything down.” —Kubeflow: Not Yet Ready for Production?
- Kubeflow is prone to dependency hell.
“For example, upgrading the KFServing component required upgrading Istio. … This upgrade broke access to the dashboard because the newer Istio version was incompatible with AWS authentication.” —Kubeflow: Not Yet Ready for Production?
Flyte was built to make data scientists happy, and data scientists needn’t necessarily know Kubernetes. Although Flyte’s built primarily on Kubernetes, its primitives abstract away Kubernetes constructs from ML practitioners. The Python SDK of Flyte is identical to writing Python code, and the local deployment requires just two commands: <span class="code-inline">flytectl demo start</span> and <span class="code-inline">pyflyte register <package-or-module></span>. When using Flyte, ML practitioners don’t need to tinker with Kubernetes constructs (at least on the user side).
Flyte’s fundamentally different from Kubeflow in three ways:
- Flyte lets ML practitioners create without having to navigate infrastructure jargon and Kubernetes details. It segregates the user and the platform teams and lets the user team — data scientists, ML practitioners and data engineers — focus on building models instead of setting up infrastructure. Kubeflow requires Kubernetes and DevOps expertise, which may slow down the development of ML pipelines because not all ML practitioners are comfortable with Kubernetes and Ops. Flyte is built on Kubernetes, too, and it offers an abstraction that can be removed for complicated use cases; however, 80% of use cases require minimal knowledge of Kubernetes.
- Flyte’s Python SDK (Flytekit) lets ML practitioners write Python code, while Kubeflow Python SDK feels like a new/infrastructure DSL.
- Flyte supports varied data types and transformations: You can pass Pandas DataFrames among Flyte tasks, load a DataFrame to a BigQuery table using structured datasets, offload data to and download data from cloud URIs using FlyteFiles, and more. Meanwhile, Kubeflow enforces a type system, but it doesn’t support data types beyond fundamental Python types and artifacts/files. Kubeflow needs to be told what to do when it encounters another type, such as an s3 URI. Flyte, however, automates the interaction with S3 (and GCS); supports intra- and intercommunication among different cloud services and the local file system; and reduces the need to write boilerplate code.
“We've migrated about 50% of all training pipelines over to Flyte from Kubeflow. In several cases we saw an 80% reduction in boilerplate between workflows and tasks vs. the Kubeflow pipeline and components.Overall, Flyte is a far simpler system to reason about with respect to how the code actually executions, and it's more self-serve for our research team to handle.”
—Rahul Mehta, ML Infrastructure/Platform Lead @ Theorem LP
When choosing an orchestrator, it’s vital to consider the features you need as well as the ease with which you can deploy and iterate. It’s important to be able to iterate on your ML pipelines quickly, catch bugs early on, deploy seamlessly on any cloud provider, and ultimately alleviate the complexity and pain points of ML practitioners.
With Kubeflow, there’s no question about deployment — it scales really well. It also has an extensive number of integrations. However, the developer experience isn’t too good.
The following table offers a feature comparison between Flyte and Kubeflow.
Note: Kubeflow Pipelines v2 is in the pre-release stage and not yet stable. The v2 docs are being continually improved, and links to v2 documentation are not yet stable.
Flyte ticks all the boxes, whereas Kubeflow, although performant and scalable, requires Kubernetes and DevOps expertise. Let’s take a look at the features in more detail:
Components and Tasks
Flyte tasks are independent units of execution that when run in a specific order produce a Flyte workflow. The functional equivalents in Kubeflow are component and pipeline, respectively. However, Kubeflow’s way of constructing pipelines isn’t as straightforward as Flyte’s because:
- The output type annotation needs to be declared in the input arguments declaration in case of custom container components and output artifacts.
- An output from a lightweight Python component can be passed as an input to a downstream component only using the <span class="code-inline">.output</span> attribute of the source task.
- Kubeflow’s containerized and custom container components are inclined to infrastructure DSL, unlike lightweight components which are Pythonic. Code imported from different modules must be refactored to use containerized components, which generates friction in the developer experience and hinders development cycles.
- Kubeflow prefers containerized and custom container components to lightweight components for production usage — thus moving a step closer to infrastructure DSL.
“Lightweight components should be used if your component implementation can be written as a standalone Python function and does not require an abundance of source code. … For more involved components and for production usage, prefer containerized components and custom container components for their increased flexibility.” —Kubeflow pipelines v2 docs
- Single-component executions aren’t supported, which might be useful in cases where a standalone component’s functionality needs to be tested.
Flyte addresses all the above pain points because:
- Flyte’s tasks are akin to Python functions. Inputs and outputs can be handled like a typical Python function that doesn’t need to call attributes. In addition, output type annotations needn’t be declared as part of input arguments declaration.
- All Flyte task variants — including dynamic workflows, reference tasks and map tasks — may use different decorators, but the fundamental behavior remains the same.
- Code can be imported from Flyte modules the same way as Python modules. Registered tasks can also be imported using the <span class="code-inline">reference_task</span> decorator.
- Flyte task is very much preferred for production usage and can adapt to all kinds of use cases.
- Single-task executions are supported by Flyte and can simplify iterating on the task’s definition without having to write a new workflow definition.
Flyte also supports workflow offloading and launch-plan composition, and it doesn’t repeat tasks multiple times. Hence, workflows can be gigantic — in some cases, up to 100k+ nodes!
Often you’ll want to trigger a workflow/pipeline with different sets of inputs. For example, you may want to share a workflow with all the inputs set with a colleague who can then simply kick off the execution. You may also want to share a workflow with a different set of inputs with another colleague. In such a case, it’s beneficial to create launch plans to launch your workflows.
Kubeflow enables scheduling a pipeline multiple times with different sets of inputs; however, for one-off executions, in order to create multiple experiments, you ought to compile, get the IR YAML file and create an experiment. A single pipeline can have multiple experiments, but the code needs to be compiled prior to creating an experiment. Moreover, no other parameters apart from the input arguments can be provided during the compilation step.
Flyte enables ML practitioners to create launch plans that bind a partial or complete list of inputs along with optional run-time overrides (like a service account, notifications or annotations.) This critical feature enhances team collaboration and enables you to issue run-time overrides right from your Python code.
Map tasks are helpful to run an operation over a static/dynamic list of inputs. Use cases of map tasks include:
- Several inputs running through the same code logic
- Multiple data batches processed in parallel
- Hyperparameter optimization
With Kubeflow <span class="code-inline">ParallelFor</span> — a parallel for loop over a set of items — data passing isn’t easy to implement because the collection of outputs over a dynamic set of items isn’t straightforward. Also, <span class="code-inline">ParallelFor</span> in Kubeflow is a high-level construct. The Flyte equivalent to <span class="code-inline">ParallelFor</span> is a deeper integration called map task that enables quicker iterations with little overhead and works well for large fan-out tasks. Because Flyte can run map tasks within a single workflow node, it doesn’t create a node for every instance, which boosts performance.
Dynamism in DAGs
For the most part, ML is dynamic, so it’s important to be able to construct dynamic DAGs. Here are some example use cases taken from the Flyte dynamic workflows blog post:
- If a dynamic modification is required in the code logic, such as determining the number of training regions, programmatically stopping the training if the error surges, introducing validation steps dynamically, or data-parallel and sharded training
- During feature extraction, if there’s a need to decide on the parameters dynamically
- Building an AutoML pipeline
- Tuning hyperparameters dynamically while a pipeline is in progress
Kubeflow supports dynamism with DSL recursion, which presents its own set of problems: It’s a little awkward to construct dynamic DAGs with recursion, it doesn’t work well with deep workflows, output cannot be dynamically resolved and types are ignored. Moreover, KFP v2 doesn’t yet provide support for recursion.
In Flyte, dynamic workflows enable the construction of dynamic DAGs. When a Flyte task is decorated with <span class="code-inline">@dynamic</span>, Flyte evaluates the code at runtime and determines DAG structure. Flyte dynamic workflows offer much more flexibility to compose and run dynamic DAGs:
- Dynamism isn’t restricted to recursion
- Data passing isn’t any different from general Flyte tasks
- Types are respected
Python is a dynamically typed programming language. That means it checks types at runtime as opposed to compile-time. It also supports the concept of gradual typing, which means you can gradually introduce types into your code. Putting type hints to work in Python is increasingly becoming important because:
- Type hints help catch certain errors before running the code
- Type hints help document your code
- Type hints improve IDEs and linters
Flyte is a strong supporter of type validation and it already provides full support for various Python types. ML-specific types like <span class="code-inline">torch.Tensor</span>, <span class="code-inline">torch.nn.Module</span>, <span class="code-inline">np.ndarray</span> and Spark DataFrame type <span class="code-inline">pyspark.DataFrame</span> are also natively supported by Flyte, which ensure that ML pipelines are foolproof.
Flyte also type-checks input arguments in launch forms on the UI.
Kubeflow’s type-checking is brittle. Only a handful of types are validated by Kubeflow — and not on the UI. Boilerplate code often finds its way into Kubeflow code, e.g., if specifying an output artifact, a dataset/model artifact needs to be saved to the output artifact <span class="code-inline">.path</span> every time it’s returned as an output of a Kubeflow component. <span class="code-inline">pandas.DataFrame</span> — a standard data structure widely used in data-intensive fields — isn’t supported either. Moreover, no ML-specific types are supported by Kubeflow, although Kubeflow’s primarily used for ML pipelines.
Notifications are particularly helpful when a job fails. Imagine not getting notified when a critical pipeline fails and goes unnoticed. Kubeflow doesn’t provide any native support to notify users; the notification mechanism can only be implemented by relying on Argo, VertexAI or a custom handler. Flyte, on the other hand, supports sending notifications via email, Slack and PagerDuty. Users can also schedule notifications to alert them when a workflow succeeds or fails.
Often you may want to rely on a custom Docker image that you built to run your pipelines. Kubeflow supports providing custom Docker images with containerized components. Custom images can be built by running <span class="code-inline">kfp component build src/ --push-image</span> where <span class="code-inline">src</span> contains all the source code. The pipeline can then be compiled and executed like any other Kubeflow pipeline.
Flyte supports raw containers using the <span class="code-inline">ContainerTask</span> class. Custom images can be provided using the <span class="code-inline">target_image</span> argument of <span class="code-inline">@task</span> but to have more control over the container, a <span class="code-inline">ContainerTask</span> can be used. The registration and execution of code on the Flyte backend follows the same pattern as that of a regular <span class="code-inline">@task</span> which simplifies the iteration of workflows.
As per the Kubeflow docs, containerized components can be used in production; however, the additional build step that needs to be triggered to build the Docker image hinders the iteration of pipelines because the image needs to be built every time there’s a code change.
In Flyte, a custom image can be provided while registering the workflows using the command <span class="code-inline">pyflyte register --image <image> <package-or-module></span> when additional dependencies are required than those provided by the default flytekit image; however, a docker image need not be built every time there’s a code change. This is known as fast registration which is enabled by default in the <span class="code-inline">pyflyte register</span> command. It saves time and speeds development.
Kubeflow’s type system currently supports a fairly limited set of types, and there’s no plug-in support available to add custom types. Flyte, on the other hand, enables adding custom types using Type Transformers, e.g., refer to <span class="code-inline">PyTorch2ONNX</span> type. Type Transformers are simple to understand and contribute to the Flyte type system.
Kubeflow ships with several serving integrations to serve ML models, including BentoML, Seldon, Triton and KServe. Flyte doesn’t currently ship with any model serving integrations, but UnionML — an ML wrapper built on Flyte, supports FastAPI for serving and BentoML — is slated for the upcoming UnionML release.
Checkpointing is an important feature when training ML models. Training is expensive, and storing snapshots of the model lets you continue subsequent executions from the failed state instead of running them from the beginning.
Flyte provides an intra-task checkpointing feature that can be leveraged from within <span class="code-inline">@task</span>. This feature supports the use of AWS spot instances or GCP preemptible instances to lower costs.
Kubeflow allows running pipelines on spot instances. However, no checkpointing feature is provided by it, and thus, the progress of execution is bound to be lost.
Kubeflow can be extended to accommodate other platforms but the integrations for the most part are bulky in the case of backend plugins.
Because Flyte backend plugins don’t require you to run pods for simple API calls, integrations are less bulky. For easier debugging, the Flyte UI includes queries to services and other relevant details.
Recovery mode in Flyte makes it easy to recover an individual execution by copying all successful node executions and running from the failed nodes. The “recover” button on the Flyte UI helps recover a failed execution. This is a critical feature for compute-intensive ML workflows; running a workflow from scratch when an abrupt failure crops up irrespective of the status of a task consumes resources unnecessarily. Ideally, skipping successful task node executions means better resource management and quicker iterations.
Kubeflow doesn’t currently support recovering partial executions. Without checkpointing or recovery mode, pipelines must run from scratch after an abrupt failure.
In the next section, let’s look at how Kubeflow’s code ergonomics differ from Flyte's.
Code Ergonomics Comparison
In this section, we compare Flyte and Kubeflow approaches to defining simple and advanced ML pipelines, using passages copied from the Kubeflow quickstart guide. This should also serve as a migration guide in case you want to migrate from Kubeflow to Flyte.
Kubeflow Pipelines v2
Trigger code via CLI
Trigger code via Python SDK
Trigger code via CLI
Trigger code via Python SDK
A Flyte workflow / task can be triggered:
- from the CLI
- on the UI
- Programmatically using the FlyteRemote API
Advanced ML Pipeline
Kubeflow Pipelines v2
Trigger code via CLI
Trigger code via Python SDK
Trigger code via CLI
Trigger code via Python SDK
Flyte is more closely aligned with Pythonic syntax than Kubeflow, which seems to have its own DSL. The code execution experience remains the same but local deployment (spinning up a relevant cluster) is far easier with Flyte than with Kubeflow.
The Problem with Triggering Executions in Kubeflow
The <span class="code-inline">Trigger(CLI)</span> sections in the above tables specify the commands to run to compile and execute the workflow/pipeline. Kubeflow CLI provides two commands to compile-and-create the execution on the Kubeflow backend:
- <span class="code-inline">kfp dsl compile --py <python-file> --output <compiled-result-path></span>
- <span class="code-inline">kfp run create --experiment-name <> --package-file <pipeline-package></span>
If the <span class="code-inline">python-file</span> consists of multiple components or pipelines, <span class="code-inline">--function</span> argument can be used to specify the component/pipeline that needs to be compiled. With Flyte, however, multiple tasks and workflows can be serialized/compiled and registered with a single command.
The following three commands serialize and register code on the Flyte backend:
- <span class="code-inline">pyflyte run --remote <python-file> <workflow-or-task></span>
- <span class="code-inline">pyflyte register <package-or-module></span>
- <span class="code-inline">pyflyte --pkgs <package> package</span> + <span class="code-inline">flytectl register files --project <project> --domain <domain> --archive <archive> --version <version></span>
<span class="code-inline">pyflyte run</span> is a lightweight, convenient command that operates on a single file and is easy to implement. <span class="code-inline">pyflyte register</span> is more of a production-grade command that can register multiple workflows and tasks at the same time. <span class="code-inline">pyflyte package</span> + <span class="code-inline">flytectl register</span> is helpful when there are multiple FlyteAdmins and can compile-register-execute multiple tasks and workflows.
Providing Custom Images
A component in Kubeflow and a task in Flyte may need to be associated with custom images if the dependencies are specialized (This is a standard use case in ML workflows), in which case the Kubeflow component is called a “containerized component.
Kubeflow Pipelines v2
Union Cloud: Hosted Flyte
Your choice of orchestrator plays a vital role in determining how quickly you can get your pipelines into production and the time you spend fixing errors that hinder the development and deployment processes.
Union Cloud, a hosted version of Flyte, simplifies the maintenance and deployment of Flyte, freeing data and ML teams from infrastructure setup and constraints. In no time, you can get your Flyte cluster up and running to deploy your workflows. Join our waitlist to try Union Cloud!
Kubeflow is a sophisticated tool for ML practitioners who are well-versed in Kubernetes or OK with the learning curve it imposes. For teams who want to hit the ground running right away, Kubeflow may be an impediment, and Kubeflow alternatives have long been a topic of discussion.
Flyte was designed to help ML practitioners create production-grade ML pipelines in no time. Flyte leverages the scalability offered by Kubernetes but abstracts away its language to make it more accessible to ML teams. We also recommend you check out UnionML for a simplified experience while leveraging the ML capabilities of Flyte.
Many companies have made the transition from Kubeflow to Flyte, and we’ve heard teams tell us about the time they’ve saved developing and deploying ML pipelines. We hope Flyte can help you, too.
Let us know what you think of this piece. We’d love to hear from you!