=== PAGE: https://www.union.ai/docs/v1/flyte === # Documentation Welcome to the documentation. ## Subpages - **Flyte** - **Tutorials** - **Integrations** - **Reference** - **Community** - **Architecture** - **Platform deployment** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide === # Flyte Flyte is a free and open source platform that provides a full suite of powerful features for orchestrating AI workflows. Flyte empowers AI development teams to rapidly ship high-quality code to production by offering optimized performance, unparalleled resource efficiency, and a delightful workflow authoring experience. You deploy and manage Flyte yourself, on your own cloud infrastructure. > [!NOTE] > This documentation for open-source Flyte is maintained by Union.ai. > > You can switch to the documentation for the commercial versions with the selector above. ### ๐Ÿ’ก **Introduction** Flyte is the leading open-source Kubernetes-native workflow orchestrator. ### ๐Ÿ”ข **Getting started** Build your first Flyte workflow, exploring the major features of the platform along the way. ### ๐Ÿ”— **Core concepts** Understand the core concepts of the Flyte platform. ### ๐Ÿ”— **Development cycle** Explore the Flyte development cycle from experimentation to production. ### ๐Ÿ”— **Data input/output** Manage the input and output of data in your Flyte workflow. ### ๐Ÿ”— **Programming** Learn about Flyte-specific programming constructs. ## Subpages - **Introduction** - **Getting started** - **Core concepts** - **Development cycle** - **Data input/output** - **Programming** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/introduction === # Introduction [Flyte](https://flyte.org) is a free and open source platform that provides a full suite of powerful features for orchestrating AI workflows. ## Flyte Flyte provides the building blocks need for an end-to-end AI platform: * Reusable, immutable tasks and workflows * Declarative task-level resource provisioning * GitOps-style versioning and branching * Strongly-typed interfaces between tasks enabling more reliable code * Caching, intra-task checkpointing, and spot instance provisioning * Task parallelism with *map tasks* * Dynamic workflows created at runtime for process flexibility ## Trying out Flyte You can try out Flyte in a couple ways: * To set up a local cluster on your own machine, go to **Getting started**. * To try a turn-key cloud service that includes all of Flyte plus additional features, go to [Union.ai Serverless Getting started](/docs/v1/serverless//user-guide/getting-started). ## Flyte in production For production use, you will need to [deploy and manage Flyte on your own cloud infrastructure](../deployment/_index). If you prefer a managed solution, have a look at: * [Union.ai Serverless](/docs/v1/serverless/). * [Union.ai BYOC (Bring Your Own Cloud)](/docs/v1/byoc/). === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/getting-started === # Getting started This section gives you a quick introduction to writing and running Flyte workflows. ## Try Flyte in your browser You can try Flyte in your browser without any setup simply by [signing up for **Union.ai Serverless**](https://signup.union.ai/). [Union.ai Serverless is a fully-hosted version of Flyte](/docs/v1/serverless/) with additional features. ## Try Flyte on your local machine You can also install Flyte's SDK (called `flytekit`) and a local Flyte cluster to run workflows on your local machine. To get started, follow the instructions on the next page, **Getting started > Local setup**. ## Subpages - **Getting started > Local setup** - **Getting started > First project** - **Getting started > Understanding the code** - **Getting started > Running your workflow** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/getting-started/local-setup === # Local setup In this section we will set up your local environment so that you can start building and deploying Flyte workflows from your local machine. ## Install `uv` First, [install `uv`](https://docs.astral.sh/uv/#getting-started). > [!NOTE] Using `uv` as best practice > The `uv` tool is our [recommended package and project manager](https://docs.astral.sh/uv/). > It replaces `pip`, `pip-tools`, `pipx`, `poetry`, `pyenv`, `twine`, `virtualenv`, and more. > > You can, of course, use other tools, > but all discussion in these pages will use `uv`, > so you will have to adapt the directions as appropriate. ## Ensure the correct version of Python is installed Flytekit requires Python `>=3.9,<3.13`. We recommend using `3.12`. You can install it with: ```shell $ uv python install 3.12 ``` > [!NOTE] Uninstall higher versions of Python > When installing Python packages "as tools" (as we do below with the `flytekit`), > `uv` will default to the latest version of Python available on your system. > If you have a version `>=3.13` installed, you will need to uninstall it since `flytekit` requires `>=3.9,<3.13`. ## Install the `pyflyte` CLI Once `uv` is installed, use it to install the `pyflyte` CLI by installing the `flytekit` Python package: ```shell $ uv tool install flytekit ``` This will make the `pyflyte` CLI globally available on your system. > [!NOTE] Add the installation location to your PATH > `uv` installs tools in `~/.local/bin` by default. > Make sure this location is in your `PATH`, so you can run the `pyflyte` command from anywhere. > `uv` provides a convenience command to do this: `uv tool update-shell`. > > Note that later in this guide we will be running the `pyflyte` CLI to run your workflows. > In those cases you will be running `pyflyte` within the Python virtual environment of your workflow project. > You will not be using this globally installed instance of `pyflyte`. > This instance of `pyflyte` is only used during the configuration step, below, when no projects yet exist. ## Install Docker and get access to a container registry Flyte tasks are run in containers. Every container requires a container image that defines its software environment. When developing and running Flyte tasks and workflows, an important part of the process is building those images and pushing them to a container registry. Your Flyte installation then pulls down these images when it spins up the containers that run your tasks. To build and push the images you need to have Docker (or an equivalent container runtime) installed on your local machine. Go to [the Docker website](https://docs.docker.com/get-docker/) for installation directions. You will also need access to a container registry where you can push your images. Furthermore, the pushed images will need to be accessible to the Flyte installation you are using (The registry must be accessible and the images themselves must also have the appropriate permissions. For example, a public registry like `ghcr.io` with the images set to public, would work). ## Install `flytectl` to set up a local cluster For production use you will need to install Flyte in your cloud infrastructure (see **Platform deployment**). Here we are using a local cluster for experimentation and demonstration purposes. To set up a local cluster you must first install the `flytectl` CLI. > [!NOTE] Flytectl vs Pyflyte > `flytectl` is different from the `pyflyte`. > > `pyflyte` is a Python program and part of the `flytekit` SDK > It is the primary command-line tool used during Flyte development. > > `flytectl` is a compiled binary (written in Go) and used for performing certain administrative tasks. > (see **Uctl CLI** for details) To install `flytectl`, follow these instructions: ### macOS To install `flytectl` on a Mac, use [Homebrew](https://brew.sh/), `curl`, or download the binary manually. **Homebrew** ```shell $ brew tap flyteorg/homebrew-tap $ brew install flytectl ``` **curl** To use `curl`, set `BINDIR` to the install location (it defaults to `./bin`) and run the following command: ```shell $ curl -sL https://ctl.flyte.org/install | sudo bash -s -- -b /usr/local/bin ``` **Manual download** To download manually, see the [`flytectl` releases](https://github.com/flyteorg/flytectl/releases). ### Linux To install `flytectl` on Linux, use `curl` or download the binary manually. **curl** To use `curl`, set `BINDIR` to the installation location (it defaults to `./bin`) and run the following command (note that [jq](https://jqlang.org/) needs to be installed to run this script): ```shell $ curl -sL https://ctl.flyte.org/install | sudo bash -s -- -b /usr/local/bin ``` **Manual download** To download manually, see the [`flytectl` releases](https://github.com/flyteorg/flytectl/releases). ### Windows To install `flytectl` on Windows, use `curl` , or download the binary manually. **curl** To use `curl`, in a Linux shell (such as [WSL](https://learn.microsoft.com/en-us/windows/wsl/install)), set `BINDIR` to the install location (it defaults to `./bin`) and run the following command: ```shell $ curl -sL https://ctl.flyte.org/install | sudo bash -s -- -b /usr/local/bin ``` **Manual download** To download manually, see the [`flytectl` releases](https://github.com/flyteorg/flytectl/releases). ## Start Docker and the local cluster Once you have installed Docker and `flytectl`, do the following: 1. Start the Docker daemon. 2. Use `flytectl` to set up a local Flyte cluster by running: ```shell $ flytectl demo start ``` This will start a local Flyte cluster on your machine and place a configuration file in `~/.flyte/config-sandbox.yaml` that contains the connection information to connect `pyflyte` (and `flytectl`) to that cluster. The local Flyte cluster will be available at `localhost:30080`. > [!NOTE] Try Flyte technology through Flyte Serverless > Alternatively, you can try using Flyte technology through Flyte Serverless. > With Flyte Serverless you do not need to install a local cluster and can start > experimenting immediately on a full cloud deployment. > You can even use the Workspaces in-browser IDE to quickly iterate on code. > See [Flyte Serverless > Getting started](/docs/v1/serverless//user-guide/getting-started) for more details. ## Configure the connection to your cluster To configure the connection from `pyflyte` and `flytectl` to your Flyte instance, set the `FLYTECTL_CONFIG` environment variable to point to the configuration file that `flytectl` created: ```shell $ export FLYTECTL_CONFIG=~/.flyte/config-sandbox.yaml ``` This will allow you to interact with your local Flyte cluster. To interact with a Flyte cluster in the cloud you will need to adjust the configuration and point the environment variable to the new configuration file. Alternatively, you can always specify the configuration file on the command line when invoking `pyflyte` or `flytectl` by using the `--config` flag. For example: ```shell $ pyflyte --config ~/.my-config-location/my-config.yaml run my_script.py my_workflow ``` See **Development cycle > Running in a local cluster** for more details on the format of the `yaml` file. ## Check your CLI configuration To check your CLI configuration, run: ```shell $ pyflyte info ``` You should get a response like this: ```shell $ pyflyte info โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Flytekit CLI Info โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ โ”‚ โ”‚ โ”‚ This CLI is meant to be used within a virtual environment that has Flytekit installed. Ideally it is used to iterate on your Flyte โ”‚ โ”‚ workflows and tasks. โ”‚ โ”‚ โ”‚ โ”‚ Flytekit Version: 1.15.0 โ”‚ โ”‚ Flyte Backend Version: โ”‚ โ”‚ Flyte Backend Endpoint: โ”‚ โ”‚ โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/getting-started/first-project === # First project In this section we will set up a new project. This involves creating a local project directory holding your project code and a corresponding Flyte project to which you will deploy that code using the `pyflyte` CLI. ## Create a new Flyte project Create a new project on your local Flyte cluster: ```shell $ flytectl create project \ --id "my-project" \ --labels "my-label=my-project" \ --description "My Flyte project" \ --name "My project" ``` You can see the project you just created by going to `http://localhost:30080` in your browser. > [!NOTE] Default project > Flyte provides a default project (called `flytesnacks`) where all your workflows will be > registered unless you specify otherwise. > In this section, however, we will be using the project we just created, not the default. ## Initialize a local project We will use the `pyflyte init` command to initialize a new local project corresponding to the project created on your Flyte instance: ```shell $ pyflyte init --template flyte-simple my-project ``` The resulting directory will look like this: ```shell โ”œโ”€โ”€ LICENSE โ”œโ”€โ”€ README.md โ”œโ”€โ”€ hello_world.py โ”œโ”€โ”€ pyproject.toml โ””โ”€โ”€ uv.lock ``` > [!NOTE] Local project directory name same as Flyte project ID > It is good practice to name your local project directory the same as your > Flyte project ID, as we have done here. Next, let's look at the contents of the local project directory. Continue to **Getting started > Understanding the code**. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/getting-started/understanding-the-code === # Understanding the code This is a simple "Hello, world!" example consisting of flat directory: ```shell โ”œโ”€โ”€ LICENSE โ”œโ”€โ”€ README.md โ”œโ”€โ”€ hello_world.py โ”œโ”€โ”€ pyproject.toml โ””โ”€โ”€ uv.lock ``` ## Python code The `hello_world.py` file illustrates the essential components of a Flyte workflow: ```python # Hello World import flytekit as fl import os image_spec = fl.ImageSpec( # The name of the image. This image will be used by the `say_hello`` task. name="say-hello-image", # Lock file with dependencies to be installed in the image. requirements="uv.lock", # Image registry to which this image will be pushed. # Set the Environment variable FLYTE_IMAGE_REGISTRY to the URL of your registry. # The image will be built on your local machine, so enure that your Docker is running. # Ensure that pushed image is accessible to your Flyte cluster, so that it can pull the image # when it spins up the task container. registry=os.environ['FLYTE_IMAGE_REGISTRY'] ) @fl.task(container_image=image_spec) def say_hello(name: str) -> str: return f"Hello, {name}!" @fl.workflow def hello_world_wf(name: str = "world") -> str: greeting = say_hello(name=name) return greeting ``` ### ImageSpec The `ImageSpec` object is used to define the container image that will run the tasks in the workflow. Here we have the simplest possible `ImageSpec` object, which specifies: * The `name` of the image. * This name will be used to identify the image in the container registry. * The `requirements` parameter. * We specify that the requirements should be read from the `uv.lock` file. * The `registry` to which the image will be pushed. * Here we use the environment variable `FLYTE_IMAGE_REGISTRY` to hold the URL of the registry. * You must ensure that this environment variable is correctly set before you register the workflow. * You must also ensure that when the image is pushed to the registry, it will be accesible to your Flyte cluster, so that it can pull the image when it spins up the task container. See **Development cycle > ImageSpec** for more information. ### Tasks The `@fl.task` decorator indicates a Python function that defines a **Core concepts > Tasks**. A task tasks some input and produces an output. When deployed to Flyte cluster, each task runs in its own Kubernetes pod. For a full list of task parameters, see **Core concepts > Tasks > Task parameters**. ### Workflow The `@fl.workflow` decorator indicates a function that defines a **Core concepts > Workflows**. This function contains references to the tasks defined elsewhere in the code. A workflow appears to be a Python function but is actually a [DSL](https://en.wikipedia.org/wiki/Domain-specific_language) that only supports a subset of Python syntax and semantics. When deployed to Flyte, the workflow function is compiled to construct the directed acyclic graph (DAG) of tasks, defining the order of execution of task pods and the data flow dependencies between them. > [!NOTE] `@fl.task` and `@fl.workflow` syntax > * The `@fl.task` and `@fl.workflow` decorators will only work on functions at the top-level > scope of the module. > * You can invoke tasks and workflows as regular Python functions and even import and use them in > other Python modules or scripts. > * Task and workflow function signatures must be type-annotated with Python type hints. > * Task and workflow functions must be invoked with keyword arguments. ## pyproject.toml The `pyproject.toml` is the standard project configuration used by `uv`. It specifies the project dependencies and the Python version to use. The default `pyproject.toml` file created by `pyflyte init` from the `flyte-simple` template looks like this ```toml [project] name = "flyte-simple" version = "0.1.0" description = "A simple Flyte project" readme = "README.md" requires-python = ">=3.9,<3.13" dependencies = ["flytekit"] ``` (You can update the `name` and `description` to match the actual name of your project, `my-project`, if you like). The most important part of the file is the list of dependencies, in this case consisting of only one package, `flytekit`. See [uv > Configuration > Configuration files](https://docs.astral.sh/uv/configuration/files/) for details. ## uv.lock The `uv.lock` file is generated from `pyproject.toml` by `uv sync` command. It contains the exact versions of the dependencies required by the project. The `uv.lock` included in the `init` template may not reflect the latest version of the dependencies, so you should update it by doing a fresh `uv sync`. See [uv > Concepts > Projects > Locking and syncing](https://docs.astral.sh/uv/concepts/projects/sync/) for details. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/getting-started/running-your-workflow === # Running your workflow ## Python virtual environment The first step is to ensure that your `uv.lock` file is properly generated from your `pyproject.toml` file and that your local Python virtual environment is properly set up. Using `uv`, you can install the dependencies with the command: ```shell $ uv sync ``` You can then activate the virtual environment with: ```shell $ source .venv/bin/activate ``` > [!NOTE] `activate` vs `uv run` > When running the `pyflyte` CLI within your local project you must run it in the virtual > environment _associated with_ that project. > This differs from our earlier usage of the tool when > **Getting started > Running your workflow > we installed `pyflyte` globally** in order to > **Getting started > Running your workflow > set up its configuration**. > > To run `pyflyte` within your project's virtual environment using `uv`, > you can prefix it use the `uv run` command. For example: > > `uv run pyflyte ...` > > Alternatively, you can activate the virtual environment with `source .venv/bin/activate` and then > run the `pyflyte` command directly. > > In our examples we assume that you are doing the latter. ## Run the code locally Because tasks and workflows are defined as regular Python functions, they can be executed in your local Python environment. You can run the workflow locally with the command **Union CLI > `union` CLI commands**: ```shell $ pyflyte run hello_world.py hello_world_wf ``` You should see output like this: ```shell Running Execution on local. Hello, world! ``` You can also pass in parameters to the workflow (assuming they declared in the workflow function): ```shell $ pyflyte run hello_world.py hello_world_wf --name="everybody" ``` You should see output like this: ```shell Running Execution on local. Hello, everybody! ``` ## Running remotely on Flyte in the cloud Running you code in your local Python environment is useful for testing and debugging. But to run them at scale, you will need to deploy them (or as we say, "register" them) on to your Flyte instance in the cloud. When task and workflow code is registered: * The `@fl.task` function is loaded into a container defined by the `ImageSpec` object specified in the `container_image` parameter of the decorator. * The `@fl.workflow` function is compiled into a directed acyclic graph that controls the running of the tasks invoked within it. To run the workflow on Flyte in the cloud, use the **Union CLI > `union` CLI commands** and the ```shell $ pyflyte run --remote --project my-project --domain development hello_world.py hello_world_wf ``` The output displays a URL that links to the workflow execution in the UI: ```shell ๐Ÿ‘ Build submitted! โณ Waiting for build to finish at: https:///org/... โœ… Build completed in 0:01:57! [โœ”] Go to https:///org/... to see execution in the UI. ``` Click the link to see the execution in the UI. ## Register the workflow without running Above we used the `pyflyte run --remote` to register and immediately run a workflow on Flyte. This is useful for quick testing, but for more complex workflows you may want to register the workflow first and then run it from the Flyte interface. To do this, you can use the `pyflyte register` command to register the workflow code with Flyte. The form of the command is: ```shell $ pyflyte register [] ``` in our case, from within the `getting-started` directory, you would do: ```shell $ pyflyte register --project my-project --domain development . ``` This registers all code in the current directory to Flyte but does not immediately run anything. You should see the following output (or similar) in your terminal: ```shell Running pyflyte register from /Users/my-user/scratch/my-project with images ImageConfig(default*image=Image(name='default', fqn='cr.flyte.org/flyteorg/flytekit', tag='py3.12-1.14.6', digest=None), images=[Image(name='default', fqn='cr.flyte.org/flyteorg/flytekit', tag='py3.12-1.14.6', digest=None)]) and image destination folder /root on 1 package(s) ('/Users/my-user/scratch/my-project',) Registering against demo.hosted.unionai.cloud Detected Root /Users/my-user/my-project, using this to create deployable package... Loading packages ['my-project'] under source root /Users/my-user/my-project No output path provided, using a temporary directory at /var/folders/vn/72xlcb5d5jbbb3kk_q71sqww0000gn/T/tmphdu9wf6* instead Computed version is sSFSdBXwUmM98sYv930bSQ Image say-hello-image:lIpeqcBrlB8DlBq0NEMR3g found. Skip building. Serializing and registering 3 flyte entities [โœ”] Task: my-project.hello_world.say_hello [โœ”] Workflow: my-project.hello_world.hello_world_wf [โœ”] Launch Plan: my-project.hello_world.hello_world_wf Successfully registered 3 entities ``` ## Run the workflow from the Flyte interface To run the workflow, you need to go to the Flyte interface: 1. Navigate to the Flyte dashboard. 2. In the left sidebar, click **Workflows**. 3. Search for your workflow, then select the workflow from the search results. 4. On the workflow page, click **Launch Workflow**. 5. In the "Create New Execution" dialog, you can change the workflow version, launch plan, and inputs (if present). Click "Advanced options" to change the security context, labels, annotations, max parallelism, override the interruptible flag, and overwrite cached inputs. 6. To execute the workflow, click **Launch**. You should see the workflow status change to "Running", then "Succeeded" as the execution progresses. To view the workflow execution graph, click the **Graph** tab above the running workflow. ## View the workflow execution on Flyte When you view the workflow execution graph, you will see the following: ![Graph](../../_static/images/user-guide/getting-started/running-your-workflow/graph.png) Above the graph, there is metadata that describes the workflow execution, such as the duration and the workflow version. Next, click on the `evaluate_model` node to open up a sidebar that contains additional information about the task: ![Sidebar](../../_static/images/user-guide/getting-started/running-your-workflow/sidebar.png) === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts === # Core concepts Flyte is a platform for building and orchestrating the execution of interconnected software processes across machines in a computer cluster. In Flyte terminology, the software processes are called *tasks* and the overall organization of connections between tasks is called a *workflow*. The tasks in a workflow are connected to each other by their inputs and outputs. The output of one task becomes the input of another. More precisely, a workflow in Flyte is a *directed acyclic graph (DAG)* of *nodes* where each node is a unit of execution and the edges between nodes represent the flow of data between them. The most common type of node is a task node (which encapsulates a task), though there are also workflow nodes (which encapsulate subworkflows) and branch nodes. In most contexts we just say that a workflow is a DAG of tasks. You define tasks and workflows in Python using the Flytekit SDK. The Flytekit SDK provides a set of decorators and classes that allow you to define tasks and workflows in a way that is easy to understand and work with. Once defined, tasks and workflows are deployed to your Flyte instance (we say they are *registered* to the instance), where they are compiled into a form that can be executed on your Flyte cluster. In addition to tasks and workflows, another important concept in Flyte is the **Core concepts > Launch plans**. A launch plan is like a template that can be used to define the inputs to a workflow. Triggering a launch plan will launch its associated workflow with the specified parameters. ## Defining tasks and workflows Using the Flytekit SDK, tasks and workflows are defined as Python functions using the `@fl.task` and `@fl.workflow` decorators, respectively: ```python import flytekit as fl @fl.task def task_1(a: int, b: int, c: int) -> int: return a + b + c @fl.task def task_2(m: int, n: int) -> int: return m * n @fl.task def task_3(x: int, y: int) -> int: return x - y @fl.workflow def my_workflow(a: int, b: int, c: int, m: int, n: int) -> int: x = task_1(a=a, b=b, c=c) y = task_2(m=m, n=n) return task_3(x=x, y=y) ``` Here we see three tasks defined using the `@fl.task` decorator and a workflow defined using the `@fl.workflow` decorator. The workflow calls `task_1` and `task_2` and passes the results to `task_3` before finally outputting the result of `task_3`. When the workflow is registered, Flyte compiles the workflow into a directed acyclic graph (DAG) based on the input/output dependencies between the tasks. The DAG is then used to execute the tasks in the correct order, taking advantage of any parallelism that is possible. For example, the workflow above results in the following DAG: ![Workflow DAG](../../_static/images/user-guide/core-concepts/workflow-dag.png) ### Type annotation is required One important difference between Flyte and generic Python is that in Flyte all inputs and outputs *must be type annotated*. This is because tasks are strongly typed, meaning that the types of the inputs and outputs are validated at deployment time. See **Core concepts > Tasks > Tasks are strongly typed** for more details. ### Workflows *are not* full Python functions The definition of a workflow must be a valid Python function, so it can be run locally as a normal Python function during development, but only *a subset of Python syntax is allowed*, because it must also be compiled into a DAG that is deployed and executed on Flyte. *Technically then, the language of a workflow function is a domain-specific language (DSL) that is a subset of Python.* See **Core concepts > Workflows** for more details. ## Registering tasks and workflows ### Registering on the command line with `pyflyte` or `flytectl` In most cases, workflows and tasks (and possibly other things, such as launch plans) are defined in your project code and registered as a bundle using `pyflyte` or `flytectl` For example: ```shell $ pyflyte register ./workflows --project my_project --domain development ``` Tasks can also be registered individually, but it is more common to register alongside the workflow that uses them. See **Development cycle > Running your code**. ### Registering in Python with `FlyteRemote` As with all Flyte command line actions, you can also perform registration of workflows and tasks programmatically with [`FlyteRemote`](), specifically, [`FlyteRemote.register_script`](), [`FlyteRemote.register_workflow`](), and [`FlyteRemote.register_task`](). ## Results of registration When the code above is registered to Flyte, it results in the creation of five objects: * The tasks `workflows.my_example.task_1`, `workflows.my_example.task_2`, and `workflows.my_example.task_3` (see **Core concepts > Tasks** for more details). * The workflow `workflows.my_example.my_workflow`. * The default launch plan `workflows.my_example.my_workflow` (see **Core concepts > Launch plans** for more details). Notice that the task and workflow names are derived from the path, file name and function name of the Python code that defines them: `..`. The default launch plan for a workflow always has the same name as its workflow. ## Changing tasks and workflows Tasks and workflows are changed by altering their definition in code and re-registering. When a task or workflow with the same project, domain, and name as a preexisting one is re-registered, a new version of that entity is created. ## Inspecting tasks and workflows ### Inspecting workflows in the UI Select **Workflows** in the sidebar to display a list of all the registered workflows in the project and domain. You can search the workflows by name. Click on a workflow in the list to see the **workflow view**. The sections in this view are as follows: * **Recent Workflow Versions**: A list of recent versions of this workflow. Select a version to see the **Workflow version view**. This view shows the DAG and a list of all version of the task. You can switch between versions with the radio buttons. * **All Executions in the Workflow**: A list of all executions of this workflow. Click on an execution to go to the **Core concepts > Workflows > Viewing workflow executions**. * **Launch Workflow button**: In the top right of the workflow view, you can click the **Launch Workflow** button to run the workflow with the default inputs. ### Inspecting tasks in the UI Select **Tasks** in the sidebar to display a list of all the registered tasks in the project and domain. You can search the launch plans by name. To filter for only those that are archived, check the **Show Only Archived Tasks** box. Click on a task in the list to see the task view The sections in the task view are as follows: * **Inputs & Outputs**: The name and type of each input and output for the latest version of this task. * **Recent Task Versions**: A list of recent versions of this task. Select a version to see the **Task version view**: This view shows the task details and a list of all version of the task. You can switch between versions with the radio buttons. See **Core concepts > Tasks** for more information. * **All Executions in the Task**: A list of all executions of this task. Click on an execution to go to the execution view. * **Launch Task button**: In the top right of the task view, you can click the **Launch Task** button to run the task with the default inputs. ### Inspecting workflows on the command line with `flytectl` To view all tasks within a project and domain: ```shell $ flytectl get workflows \ --project \ --domain ``` To view a specific workflow: ```shell $ flytectl get workflow \ --project \ --domain \ ``` See **Uctl CLI** for more details. ### Inspecting tasks on the command line with `flytectl` To view all tasks within a project and domain: ```shell $ flytectl get tasks \ --project \ --domain ``` To view a specific task: ```shell $ flytectl get task \ --project \ --domain \ ``` See **Uctl CLI** for more details. ### Inspecting tasks and workflows in Python with `FlyteRemote` Use the method [`FlyteRemote.fetch_workflow`]() or [`FlyteRemote.client.get_workflow`]() to get a workflow. See [`FlyteRemote`]() for more options and details. Use the method [`FlyteRemote.fetch_task`]() or [`FlyteRemote.client.get_task`]() to get a task. See [`FlyteRemote`]() for more options and details. ## Running tasks and workflows ### Running a task or workflow in the UI To run a workflow in the UI, click the **Launch Workflow** button in the workflow view. You can also run individual tasks in the UI by clicking the **Launch Task** button in the task view. ### Running a task or workflow locally on the command line with `pyflyte` or `python` You can execute a Flyte workflow or task locally simply by calling it just like any regular Python function. For example, you can add the following to the above code: ```python if __name__ == "__main__": my_workflow(a=1, b=2, c=3, m=4, n=5) ``` If the file is saved as `my_example.py`, you can run it locally using the following command: ```shell $ python my_example.py ``` Alternatively, you can run the task locally with the `pyflyte` command line tool: To run it locally, you can use the following `pyflyte run` command: ```shell $ pyflyte run my_example.py my_workflow --a 1 --b 2 --c 3 --m 4 --n 5 ``` This has the advantage of allowing you to specify the input values as command line arguments. For more details on running workflows and tasks, see **Development cycle**. ### Running a task or workflow remotely on the command line with `pyflyte` To run a workflow remotely on your Flyte installation, use the following command (this assumes that you have your **Development cycle > Setting up a production project**): ```shell $ pyflyte run --remote my_example.py my_workflow --a 1 --b 2 --c 3 --m 4 --n 5 ``` ### Running a task or workflow remotely in Python with `FlyteRemote` To run a workflow or task remotely in Python, use the method [`FlyteRemote.execute`](). See [`FlyteRemote`]() for more options and details. ## Subpages - **Core concepts > Workflows** - **Core concepts > Tasks** - **Core concepts > Launch plans** - **Core concepts > Caching** - **Core concepts > Named outputs** - **Core concepts > ImageSpec** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/workflows === # Workflows So far in our discussion of workflows, we have focused on top-level workflows decorated with `@fl.workflow`. These are, in fact, more accurately termed **Core concepts > Workflows > Standard workflows** to differentiate them from the other types of workflows that exist in Flyte: **Core concepts > Workflows > Subworkflows and sub-launch plans**, **Core concepts > Workflows > Dynamic workflows**, and **Core concepts > Workflows > Imperative workflows**. In this section, we will delve deeper into the fundamentals of all of these workflow types, including their syntax, structure, and behavior. ## Subpages - **Core concepts > Workflows > Standard workflows** - **Core concepts > Workflows > Subworkflows and sub-launch plans** - **Core concepts > Workflows > Dynamic workflows** - **Core concepts > Workflows > Imperative workflows** - **Core concepts > Workflows > Launching workflows** - **Core concepts > Workflows > Viewing workflows** - **Core concepts > Workflows > Viewing workflow executions** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/workflows/standard-workflows === # Standard workflows A standard workflow is defined by a Python function decorated with the `@fl.workflow` decorator. The function is written in a domain specific language (DSL), a subset of Python syntax that describes the directed acyclic graph (DAG) that is deployed and executed on Flyte. The syntax of a standard workflow definition can only include the following: * Calls to functions decorated with `@fl.task` and assignment of variables to the returned values. * Calls to other functions decorated with `@fl.workflow` and assignment of variables to the returned values (see **Core concepts > Workflows > Subworkflows and sub-launch plans**). * Calls to **Core concepts > Launch plans** (see **Core concepts > Workflows > Subworkflows and sub-launch plans > When to use sub-launch plans**) * Calls to functions decorated with `@fl.dynamic` and assignment of variables to the returned values (see **Core concepts > Workflows > Dynamic workflows**). * The special **Programming > Conditionals**. * Statements using the **Programming > Chaining Entities**. ## Evaluation of a standard workflow When a standard workflow is **Core concepts > Workflows > Standard workflows > run locally in a Python environment** it is executed as a normal Python function. However, when it is registered to Flyte, the top level `@fl.workflow`-decorated function is evaluated as follows: * Inputs to the workflow are materialized as lazily-evaluated promises which are propagated to downstream tasks and subworkflows. * All values returned by calls to functions decorated with `@fl.task` or `@fl.dynamic` are also materialized as lazily-evaluated promises. The resulting structure is used to construct the Directed Acyclic Graph (DAG) and deploy the required containers to the cluster. The actual evaluation of these promises occurs when the tasks (or dynamic workflows) are executed in their respective containers. ## Conditional construct Because standard workflows cannot directly include Python `if` statements, a special `conditional` construct is provided that allows you to define conditional logic in a workflow. For details, see **Programming > Conditionals**. ## Chaining operator When Flyte builds the DAG for a standard workflow, it uses the passing of values from one task to another to determine the dependency relationships between tasks. There may be cases where you want to define a dependency between two tasks that is not based on the output of one task being passed as an input to another. In that case, you can use the chaining operator `>>` to define the dependencies between tasks. For details, see **Programming > Chaining Entities**. ## Workflow decorator parameters The `@fl.workflow` decorator can take the following parameters: * `failure_policy`: Use the options in **Flytekit SDK**. * `interruptible`: Indicates if tasks launched from this workflow are interruptible by default. See **Core concepts > Tasks > Task hardware environment > Interruptible instances**. * `on_failure`: Invoke this workflow or task on failure. The workflow specified must have the same parameter signature as the current workflow, with an additional parameter called `error`. * `docs`: A description entity for the workflow. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/workflows/subworkflows-and-sub-launch-plans === # Subworkflows and sub-launch plans In Flyte it is possible to invoke one workflow from within another. A parent workflow can invoke a child workflow in two ways: as a **subworkflow** or via a **Core concepts > Launch plans > Running launch plans > Sub-launch plans**. In both cases the child workflow is defined and registered normally, exists in the system normally, and can be run independently. But, if the child workflow is invoked from within the parent **by directly calling the child's function**, then it becomes a **subworkflow**. The DAG of the subworkflow is embedded directly into the DAG of the parent and effectively become part of the parent workflow execution, sharing the same execution ID and execution context. On the other hand, if the child workflow is invoked from within the parent **Core concepts > Launch plans**, this is called a **sub-launch plan**. It results in a new top-level workflow execution being invoked with its own execution ID and execution context. It also appears as a separate top-level entity in the system. The only difference is that it happens to have been kicked off from within another workflow instead of from the command line or the UI. Here is an example: ```python import flytekit as fl @fl.workflow def sub_wf(a: int, b: int) -> int: return t(a=a, b=b) # Get the default launch plan of sub_wf, which we name sub_wf_lp sub_wf_lp = fl.LaunchPlan.get_or_create(sub_wf) @fl.workflow def main_wf(): # Invoke sub_wf directly. # An embedded subworkflow results. sub_wf(a=3, b=4) # Invoke sub_wf through its default launch plan, here called sub_wf_lp # An independent subworkflow results. sub_wf_lp(a=1, b=2) ``` ## When to use subworkflows Subworkflows allow you to manage parallelism between a workflow and its launched sub-flows, as they execute within the same context as the parent workflow. Consequently, all nodes of a subworkflow adhere to the overall constraints imposed by the parent workflow. Here's an example illustrating the calculation of slope, intercept and the corresponding y-value. ```python import flytekit as fl @fl.task def slope(x: list[int], y: list[int]) -> float: sum_xy = sum([x[i] * y[i] for i in range(len(x))]) sum_x_squared = sum([x[i] ** 2 for i in range(len(x))]) n = len(x) return (n * sum_xy - sum(x) * sum(y)) / (n * sum_x_squared - sum(x) ** 2) @fl.task def intercept(x: list[int], y: list[int], slope: float) -> float: mean_x = sum(x) / len(x) mean_y = sum(y) / len(y) intercept = mean_y - slope * mean_x return intercept @fl.workflow def slope_intercept_wf(x: list[int], y: list[int]) -> (float, float): slope_value = slope(x=x, y=y) intercept_value = intercept(x=x, y=y, slope=slope_value) return (slope_value, intercept_value) @fl.task def regression_line(val: int, slope_value: float, intercept_value: float) -> float: return (slope_value * val) + intercept_value # y = mx + c @fl.workflow def regression_line_wf(val: int = 5, x: list[int] = [-3, 0, 3], y: list[int] = [7, 4, -2]) -> float: slope_value, intercept_value = slope_intercept_wf(x=x, y=y) return regression_line(val=val, slope_value=slope_value, intercept_value=intercept_value) ``` The `slope_intercept_wf` computes the slope and intercept of the regression line. Subsequently, the `regression_line_wf` triggers `slope_intercept_wf` and then computes the y-value. It is possible to nest a workflow that contains a subworkflow within yet another workflow. Workflows can be easily constructed from other workflows, even if they also function as standalone entities. For example, each workflow in the example below has the capability to exist and run independently: ```python import flytekit as fl @fl.workflow def nested_regression_line_wf() -> float: return regression_line_wf() ``` ## When to use sub-launch plans Sub-launch plans can be useful for implementing exceptionally large or complicated workflows that canโ€™t be adequately implemented as **Core concepts > Workflows > Dynamic workflows** or **Core concepts > Workflows > Subworkflows and sub-launch plans > map tasks**. Dynamic workflows and map tasks share the same context and single underlying Kubernetes resource definitions. Sub-launch plan invoked workflows do not share the same context. They are executed as separate top-level entities, allowing for better parallelism and scale. Here is an example of invoking a workflow multiple times through its launch plan: ```python import flytekit as fl @fl.task def my_task(a: int, b: int, c: int) -> int: return a + b + c @fl.workflow def my_workflow(a: int, b: int, c: int) -> int: return my_task(a=a, b=b, c=c) my_workflow_lp = fl.LaunchPlan.get_or_create(my_workflow) @fl.workflow def wf() -> list[int]: return [my_workflow_lp(a=i, b=i, c=i) for i in [1, 2, 3]] ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/workflows/dynamic-workflows === # Dynamic workflows A workflow whose directed acyclic graph (DAG) is computed at run-time is a [`dynamic`]() workflow. The tasks in a dynamic workflow are executed at runtime using dynamic inputs. A dynamic workflow shares similarities with the [`workflow`](), as it uses a Python-esque domain-specific language to declare dependencies between the tasks or define new workflows. A key distinction lies in the dynamic workflow being assessed at runtime. This means that the inputs are initially materialized and forwarded to the dynamic workflow, resembling the behavior of a task. However, the return value from a dynamic workflow is a [`Promise`]() object, which can be materialized by the subsequent tasks. Think of a dynamic workflow as a combination of a task and a workflow. It is used to dynamically decide the parameters of a workflow at runtime and is both compiled and executed at run-time. Dynamic workflows become essential when you need to do the following: - Handle conditional logic - Modify the logic of the code at runtime - Change or decide on feature extraction parameters on the fly ## Defining a dynamic workflow You can define a dynamic workflow using the `@fl.dynamic` decorator. Within the `@fl.dynamic` context, each invocation of a [`task`]() or a derivative of the [`Task`]() class leads to deferred evaluation using a Promise, rather than the immediate materialization of the actual value. While nesting other `@fl.dynamic` and `@fl.workflow` constructs within this task is possible, direct interaction with the outputs of a task/workflow is limited, as they are lazily evaluated. If you need to interact with the outputs, we recommend separating the logic in a dynamic workflow and creating a new task to read and resolve the outputs. The example below uses a dynamic workflow to count the common characters between any two strings. We define a task that returns the index of a character, where A-Z/a-z is equivalent to 0-25: ```python import flytekit as fl @fl.task def return_index(character: str) -> int: if character.islower(): return ord(character) - ord("a") else: return ord(character) - ord("A") ``` We also create a task that prepares a list of 26 characters by populating the frequency of each character: ```python @fl.task def update_list(freq_list: list[int], list_index: int) -> list[int]: freq_list[list_index] += 1 return freq_list ``` We define a task to calculate the number of common characters between the two strings: ```python @fl.task def derive_count(freq1: list[int], freq2: list[int]) -> int: count = 0 for i in range(26): count += min(freq1[i], freq2[i]) return count ``` We define a dynamic workflow to accomplish the following: 1. Initialize an empty 26-character list to be passed to the `update_list` task. 2. Iterate through each character of the first string (`s1`) and populate the frequency list. 3. Iterate through each character of the second string (`s2`) and populate the frequency list. 4. Determine the number of common characters by comparing the two frequency lists. The looping process depends on the number of characters in both strings, which is unknown until runtime: ```python @fl.dynamic def count_characters(s1: str, s2: str) -> int: # s1 and s2 should be accessible # Initialize empty lists with 26 slots each, corresponding to every alphabet (lower and upper case) freq1 = [0] * 26 freq2 = [0] * 26 # Loop through characters in s1 for i in range(len(s1)): # Calculate the index for the current character in the alphabet index = return_index(character=s1[i]) # Update the frequency list for s1 freq1 = update_list(freq_list=freq1, list_index=index) # index and freq1 are not accessible as they are promises # looping through the string s2 for i in range(len(s2)): # Calculate the index for the current character in the alphabet index = return_index(character=s2[i]) # Update the frequency list for s2 freq2 = update_list(freq_list=freq2, list_index=index) # index and freq2 are not accessible as they are promises # Count the common characters between s1 and s2 return derive_count(freq1=freq1, freq2=freq2) ``` A dynamic workflow is modeled as a task in the Flyte backend, but the body of the function is executed to produce a workflow at runtime. In both dynamic and static workflows, the output of tasks are Promise objects. Flyte executes the dynamic workflow within its container, resulting in a compiled DAG, which is then accessible in the UI. It uses the information acquired during the dynamic task's execution to schedule and execute each task within the dynamic workflow. Visualization of the dynamic workflow's graph in the UI is only available after it has completed its execution. When a dynamic workflow is executed, it generates the entire workflow structure as its output, termed the *futures file*. This name reflects the fact that the workflow has yet to be executed, so all subsequent outputs are considered futures. > [!NOTE] > Local execution works when a `@fl.dynamic` decorator is used because Flytekit treats it as a task that runs with native Python inputs. Finally, we define a standard workflow that triggers the dynamic workflow: ```python @fl.workflow def start_wf(s1: str, s2: str) -> int: return count_characters(s1=s1, s2=s2) ``` You can run the workflow locally as follows: ```python if __name__ == "__main__": print(start_wf(s1="Pear", s2="Earth")) ``` ## Advantages of dynamic workflows ### Flexibility Dynamic workflows streamline the process of building pipelines, offering the flexibility to design workflows according to the unique requirements of your project. This level of adaptability is not achievable with static workflows. ### Lower pressure on `etcd` The workflow Custom Resource Definition (CRD) and the states associated with static workflows are stored in `etcd`, the Kubernetes database. This database maintains Flyte workflow CRDs as key-value pairs, tracking the status of each node's execution. However, `etcd` has a hard limit on data size, encompassing the workflow and node status sizes, so it is important to ensure that static workflows don't excessively consume memory. In contrast, dynamic workflows offload the workflow specification (including node/task definitions and connections) to the object store. Still, the statuses of nodes are stored in the workflow CRD within `etcd`. Dynamic workflows help alleviate some pressure on `etcd` storage space, providing a solution to mitigate storage constraints. ## Dynamic workflows vs. map tasks Dynamic tasks come with overhead for large fan-out tasks as they store metadata for the entire workflow. In contrast, **Core concepts > Workflows > Dynamic workflows > map tasks** prove efficient for such extensive fan-out scenarios since they refrain from storing metadata, resulting in less noticeable overhead. ## Using dynamic workflows to achieve recursion Merge sort is a perfect example to showcase how to seamlessly achieve recursion using dynamic workflows. Flyte imposes limitations on the depth of recursion to prevent misuse and potential impacts on the overall stability of the system. ```python from typing import Tuple import flytekit as fl @fl.task def split(numbers: list[int]) -> tuple[list[int], list[int]]: length = len(numbers) return ( numbers[0 : int(length / 2)], numbers[int(length / 2) :] ) @fl.task def merge(sorted_list1: list[int], sorted_list2: list[int]) -> list[int]: result = [] while len(sorted_list1) > 0 and len(sorted_list2) > 0: # Compare the current element of the first array with the current element of the second array. # If the element in the first array is smaller, append it to the result and increment the first array index. # Otherwise, do the same with the second array. if sorted_list1[0] < sorted_list2[0]: result.append(sorted_list1.pop(0)) else: result.append(sorted_list2.pop(0)) # Extend the result with the remaining elements from both arrays result.extend(sorted_list1) result.extend(sorted_list2) return result @fl.task def sort_locally(numbers: list[int]) -> list[int]: return sorted(numbers) @fl.dynamic def merge_sort_remotely(numbers: list[int], threshold: int) -> list[int]: split1, split2 = split(numbers=numbers) sorted1 = merge_sort(numbers=split1, threshold=threshold) sorted2 = merge_sort(numbers=split2, threshold=threshold) return merge(sorted_list1=sorted1, sorted_list2=sorted2) @fl.dynamic def merge_sort(numbers: list[int], threshold: int=5) -> list[int]: if len(numbers) <= threshold: return sort_locally(numbers=numbers) else: return merge_sort_remotely(numbers=numbers, threshold=threshold) ``` By simply adding the `@fl.dynamic` annotation, the `merge_sort_remotely` function transforms into a plan of execution, generating a workflow with four distinct nodes. These nodes run remotely on potentially different hosts, with Flyte ensuring proper data reference passing and maintaining execution order with maximum possible parallelism. `@fl.dynamic` is essential in this context because the number of times `merge_sort` needs to be triggered is unknown at compile time. The dynamic workflow calls a static workflow, which subsequently calls the dynamic workflow again, creating a recursive and flexible execution structure. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/workflows/imperative-workflows === # Imperative workflows Workflows are commonly created by applying the `@fl.workflow` decorator to Python functions. During compilation, this involves processing the function's body and utilizing subsequent calls to underlying tasks to establish and record the workflow structure. This is the *declarative* approach and is suitable when manually drafting the workflow. However, in cases where workflows are constructed programmatically, an imperative style is more appropriate. For instance, if tasks have been defined already, their sequence and dependencies might have been specified in textual form (perhaps during a transition from a legacy system). In such scenarios, you want to orchestrate these tasks. This is where Flyte's imperative workflows come into play, allowing you to programmatically construct workflows. ## Example To begin, we define the `slope` and `intercept` tasks: ```python import flytekit as fl @fl.task def slope(x: list[int], y: list[int]) -> float: sum_xy = sum([x[i] * y[i] for i in range(len(x))]) sum_x_squared = sum([x[i] ** 2 for i in range(len(x))]) n = len(x) return (n * sum_xy - sum(x) * sum(y)) / (n * sum_x_squared - sum(x) ** 2) @fl.task def intercept(x: list[int], y: list[int], slope: float) -> float: mean_x = sum(x) / len(x) mean_y = sum(y) / len(y) intercept = mean_y - slope * mean_x return intercept ``` Create an imperative workflow: ```python imperative_wf = Workflow(name="imperative_workflow") ``` Add the workflow inputs to the imperative workflow: ```python imperative_wf.add_workflow_input("x", list[int]) imperative_wf.add_workflow_input("y", list[int]) ``` > If you want to assign default values to the workflow inputs, you can create a **Core concepts > Launch plans**. Add the tasks that need to be triggered from within the workflow: ```python node_t1 = imperative_wf.add_entity(slope, x=imperative_wf.inputs["x"], y=imperative_wf.inputs["y"]) node_t2 = imperative_wf.add_entity( intercept, x=imperative_wf.inputs["x"], y=imperative_wf.inputs["y"], slope=node_t1.outputs["o0"] ) ``` Lastly, add the workflow output: ```python imperative_wf.add_workflow_output("wf_output", node_t2.outputs["o0"]) ``` You can execute the workflow locally as follows: ```python if __name__ == "__main__": print(f"Running imperative_wf() {imperative_wf(x=[-3, 0, 3], y=[7, 4, -2])}") ``` You also have the option to provide a list of inputs and retrieve a list of outputs from the workflow: ```python wf_input_y = imperative_wf.add_workflow_input("y", list[str]) node_t3 = wf.add_entity(some_task, a=[wf.inputs["x"], wf_input_y]) wf.add_workflow_output( "list_of_outputs", [node_t1.outputs["o0"], node_t2.outputs["o0"]], python_type=list[str], ) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/workflows/launching-workflows === # Launching workflows From the **Core concepts > Workflows > Viewing workflows > Workflow view** (accessed, for example, by selecting a workflow in the **Core concepts > Workflows > Viewing workflows > Workflows list**) you can select **Launch Workflow** in the top right. This opens the **New Execution** dialog for workflows: ![New execution dialog settings](../../../_static/images/user-guide/core-concepts/workflows/launching-workflows/new-execution-dialog-settings.png) At the top you can select: * The specific version of this workflow that you want to launch. * The launch plan to be used to launch this workflow (by default it is set to the **Core concepts > Launch plans > Default launch plan**). Along the left side the following sections are available: * **Inputs**: The input parameters of the workflow function appear here as fields to be filled in. * **Settings**: * **Execution name**: A custom name for this execution. If not specified, a name will be generated. * **Overwrite cached outputs**: A boolean. If set to `True`, this execution will overwrite any previously-computed cached outputs. * **Raw output data config**: Remote path prefix to store raw output data. By default, workflow output will be written to the built-in metadata storage. Alternatively, you can specify a custom location for output at the organization, project-domain, or individual execution levels. This field is for specifying this setting at the workflow execution level. If this field is filled in it overrides any settings at higher levels. The parameter is expected to be a URL to a writable resource (for example, `http://s3.amazonaws.com/my-bucket/`). See **Data input/output > Task input and output > Raw data store**. * **Max parallelism**: Number of workflow nodes that can be executed in parallel. If not specified, project/domain defaults are used. If 0 then no limit is applied. * **Force interruptible**: A three valued setting for overriding the interruptible setting of the workflow for this particular execution. If not set, the workflow's interruptible setting is used. If set and **enabled** then `interruptible=True` is used for this execution. If set and **disabled** then `interruptible=False` is used for this execution. See **Core concepts > Tasks > Task hardware environment > Interruptible instances** * **Service account**: The service account to use for this execution. If not specified, the default is used. * **Environment variables**: Environment variables that will be available to tasks in this workflow execution. * **Labels**: Labels to apply to the execution resource. * **Notifications**: **Core concepts > Launch plans > Notifications** configured for this workflow execution. * **Debug**: The workflow execution details for debugging purposes. Select **Launch** to launch the workflow execution. This will take you to the **Core concepts > Workflows > Viewing workflow executions**. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/workflows/viewing-workflows === # Viewing workflows ## Workflows list The workflows list shows all workflows in the current project and domain: ![Workflows list](../../../_static/images/user-guide/core-concepts/workflows/viewing-workflows/workflows-list.png) You can search the list by name and filter for only those that are archived. To archive a workflow, select the archive icon ![Archive icon](../../../_static/images/user-guide/core-concepts/workflows/viewing-workflows/archive-icon.png). Each entry in the list provides some basic information about the workflow: * **Last execution time**: The time of the most recent execution of this workflow. * **Last 10 executions**: The status of the last 10 executions of this workflow. * **Inputs**: The input type for the workflow. * **Outputs**: The output type for the workflow. * **Description**: The description of the workflow. Select an entry on the list to go to that **Core concepts > Workflows > Viewing workflows > Workflow view**. ## Workflow view The workflow view provides details about a specific workflow. ![Workflow view](../../../_static/images/user-guide/core-concepts/workflows/viewing-workflows/workflow-view.png) This view provides: * A list of recent workflow versions: Selecting a version will take you to the **Core concepts > Workflows > Viewing workflows > Workflow view > Workflow versions list**. * A list of recent executions: Selecting an execution will take you to the **Core concepts > Workflows > Viewing workflow executions**. ### Workflow versions list The workflow versions list shows the a list of all versions of this workflow along with a graph view of the workflow structure: ![Workflow version list](../../../_static/images/user-guide/core-concepts/workflows/viewing-workflows/workflow-versions-list.png) ### Workflow and task descriptions Flyte enables the use of docstrings to document your code. Docstrings are stored in the control plane and displayed on the UI for each workflow or task. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/workflows/viewing-workflow-executions === # Viewing workflow executions The **Executions list** shows all executions in a project and domain combination. An execution represents a single run of all or part of a workflow (including subworkflows and individual tasks). You can access it from the **Executions** link in the left navigation. ![Executions list](../../../_static/images/user-guide/core-concepts/workflows/viewing-workflow-executions/executions-list.png) ## Domain Settings This section displays any domain-level settings that have been configured for this project-domain combination. They are: * Security Context * Labels * Annotations * Raw output data config * Max parallelism ## All Executions in the Project For each execution in this project and domain you can see the following: * A graph of the **last 100 executions in the project**. * **Start time**: Select to view the **Core concepts > Workflows > Viewing workflow executions > Execution view**. * **Workflow/Task**: The **Core concepts > Workflows > Viewing workflows** or **Core concepts > Tasks > Viewing tasks** that ran in this execution. * **Version**: The version of the workflow or task that ran in this execution. * **Launch Plan**: The **Core concepts > Launch plans > Viewing launch plans** that was used to launch this execution. * **Schedule**: The schedule that was used to launch this execution (if any). * **Execution ID**: The ID of the execution. * **Status**: The status of the execution. One of **QUEUED**, **RUNNING**, **SUCCEEDED**, **FAILED** or **UNKNOWN**. * **Duration**: The duration of the execution. ## Execution view The execution view appears when you launch a workflow or task or select an already completed execution. An execution represents a single run of all or part of a workflow (including subworkflows and individual tasks). ![Execution view - nodes](../../../_static/images/user-guide/core-concepts/workflows/viewing-workflow-executions/execution-view-nodes.png) > [!NOTE] > An execution usually represents the run of an entire workflow. > But, because workflows are composed of tasks (and sometimes subworkflows) and Flyte caches the outputs of those independently of the workflows in which they participate, it sometimes makes sense to execute a task or subworkflow independently. The top part of execution view provides detailed general information about the execution. The bottom part provides three tabs displaying different aspects of the execution: **Nodes**, **Graph**, and **Timeline**. ### Nodes The default tab within the execution view is the **Nodes** tab. It shows a list of the Flyte nodes that make up this execution (A node in Flyte is either a task or a (sub-)workflow). Selecting an item in the list opens the right panel showing more details of that specific node: ![](../../../_static/images/user-guide/core-concepts/workflows/viewing-workflow-executions/execution-view-node-side-panel.png) The top part of the side panel provides detailed information about the node as well as the **Rerun task** button. Below that, you have the following tabs: **Executions**, **Inputs**, **Outputs**, and **Task**. The **Executions** tab gives you details on the execution of this particular node as well as access to: * **Task level monitoring**: You can access the **Core concepts > Tasks > Task hardware environment > Task-level monitoring** information by selecting **View Utilization**. * **Logs**: You can access logs by clicking the text under **Logs**. See **Core concepts > Tasks > Viewing logs**. The **Inputs**, **Outputs** tabs display the data that was passed into and out of the node, respectively. If this node is a task (as opposed to a subworkflow) then the **Task** tab displays the Task definition structure. ### Graph The Graph tab displays a visual representation of the execution as a directed acyclic graph: ![](../../../_static/images/user-guide/core-concepts/workflows/viewing-workflow-executions/execution-view-graph.png) ### Timeline The Timeline tab displays a visualization showing the timing of each task in the execution: ![](../../../_static/images/user-guide/core-concepts/workflows/viewing-workflow-executions/execution-view-timeline.png) === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks === # Tasks Tasks are the fundamental units of compute in Flyte. They are independently executable, strongly typed, and containerized building blocks that make up workflows. Workflows are constructed by chaining together tasks, with the output of one task feeding into the input of the next to form a directed acyclic graph. ## Tasks are independently executable Tasks are designed to be independently executable, meaning that they can be run in isolation from other tasks. And since most tasks are just Python functions, they can be executed on your local machine, making it easy to unit test and debug tasks locally before deploying them to Flyte. Because they are independently executable, tasks can also be shared and reused across multiple workflows and, as long as their logic is deterministic, their input and outputs can be **Core concepts > Caching** to save compute resources and execution time. ## Tasks are strongly typed Tasks have strongly typed inputs and outputs, which are validated at deployment time. This helps catch bugs early and ensures that the data passing through tasks and workflows is compatible with the explicitly stated types. Under the hood, Flyte uses the [Flyte type system]() and translates between the Flyte types and the Python types. Python type annotations make sure that the data passing through tasks and workflows is compatible with the explicitly stated types defined through a function signature. The Flyte type system is also used for caching, data lineage tracking, and automatic serialization and deserialization of data as itโ€™s passed from one task to another. ## Tasks are containerized While (most) tasks are locally executable, when a task is deployed to Flyte as part of the registration process it is containerized and run in its own independent Kubernetes pod. This allows tasks to have their own independent set of [software dependencies](./task-software-environment/_index) and [hardware requirements](./task-hardware-environment/_index). For example, a task that requires a GPU can be deployed to Flyte with a GPU-enabled container image, while a task that requires a specific version of a software library can be deployed with that version of the library installed. ## Tasks are named, versioned, and immutable The fully qualified name of a task is a combination of its project, domain, and name. To update a task, you change it and re-register it under the same fully qualified name. This creates a new version of the task while the old version remains available. At the version level task are, therefore, immutable. This immutability is important for ensuring that workflows are reproducible and that the data lineage is accurate. ## Tasks are (usually) deterministic and cacheable When deciding if a unit of execution is suitable to be encapsulated as a task, consider the following questions: * Is there a well-defined graceful/successful exit criteria for the task? * A task is expected to exit after completion of input processing. * Is it deterministic and repeatable? * Under certain circumstances, a task might be cached or rerun with the same inputs. It is expected to produce the same output every time. You should, for example, avoid using random number generators with the current clock as seed. * Is it a pure function? That is, does it have side effects that are unknown to the system? * It is recommended to avoid side effects in tasks. * When side effects are unavoidable, ensure that the operations are idempotent. For details on task caching, see **Core concepts > Caching**. ## Workflows can contain many types of tasks One of the most powerful features of Flyte is the ability to run widely differing computational workloads as tasks with a single workflow. Because of the way that Flyte is architected, tasks within a single workflow can differ along many dimensions. While the total number of ways that tasks can be configured is quite large, the options fall into three categories: * **Task type**: These include standard Python tasks, map tasks, raw container tasks, and many specialized plugin tasks. For more information, see **Core concepts > Tasks > Other task types**. * **Software environment**: Define the task container image, dependencies, and even programming language. For more information, see [Task software environment](./task-software-environment/_index). * **Hardware environment**: Define the resource requirements (processor numbers, storage amounts) and machine node characteristics (CPU and GPU type). For more information, see [Task hardware environment](./task-hardware-environment/_index). ### Mix and match task characteristics Along these three dimensions, you can mix and match characteristics to build a task definition that performs exactly the job you want, while still taking advantage of all the features provided at the workflow level like output caching, versioning, and reproducibility. Tasks with diverse characteristics can be combined into a single workflow. For example, a workflow might contain: * A **Python task running on your default container image** with default dependencies and a default resource and hardware profile. * A **Python task running on a container image with additional dependencies** configured to run on machine nodes with a specific type of GPU. * A **raw container task** running a Java process. * A **plugin task** running a Spark job that spawns its own cluster-in-a-cluster. * A **map task** that runs multiple copies of a Python task in parallel. The ability to build workflows from such a wide variety of heterogeneous tasks makes Flyte uniquely flexible. > [!NOTE] > Not all parameters are compatible. For example, with specialized plugin task types, some configurations are > not available (this depends on task plugin details). ## Task configuration The `@fl.task` decorator can take a number of parameters that allow you to configure the task's behavior. For example, you can specify the task's software dependencies, hardware requirements, caching behavior, retry behavior, and more. For more information, see **Core concepts > Tasks > Task parameters**. ## Subpages - **Core concepts > Tasks > Map Tasks** - **Core concepts > Tasks > Other task types** - **Core concepts > Tasks > Task parameters** - **Core concepts > Tasks > Launching tasks** - **Core concepts > Tasks > Viewing tasks** - **Core concepts > Tasks > Task software environment** - **Core concepts > Tasks > Viewing logs** - **Core concepts > Tasks > Reference tasks** - **Core concepts > Tasks > Task hardware environment** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/map-tasks === ## Map tasks A map task allows you to execute many instances of a task within a single workflow node. This enables you to execute a task across a set of inputs without having to create a node for each input, resulting in significant performance improvements. Map tasks find application in various scenarios, including: * When multiple inputs require running through the same code logic. * Processing multiple data batches concurrently. Just like normal tasks, map tasks are automatically parallelized to the extent possible given resources available in the cluster. ```python THRESHOLD = 11 @fl.task def detect_anomalies(data_point: int) -> bool: return data_point > THRESHOLD @fl.workflow def map_workflow(data: list[int] = [10, 12, 11, 10, 13, 12, 100, 11, 12, 10]) -> list[bool]: # Use the map task to apply the anomaly detection function to each data point return fl.map_task(detect_anomalies)(data_point=data) ``` > [!NOTE] > Map tasks can also map over launch plans. For more information and example code, see **Core concepts > Launch plans > Mapping over launch plans**. To customize resource allocations, such as memory usage for individual map tasks, you can leverage `with_overrides`. Hereโ€™s an example using the `detect_anomalies` map task within a workflow: ```python import union @fl.workflow def map_workflow_with_resource_overrides( data: list[int] = [10, 12, 11, 10, 13, 12, 100, 11, 12, 10] ) -> list[bool]: return ( fl.map_task(detect_anomalies)(data_point=data) .with_overrides(requests=fl.Resources(mem="2Gi")) ) ``` You can also configure `concurrency` and `min_success_ratio` for a map task: - `concurrency` limits the number of mapped tasks that can run in parallel to the specified batch size. If the input size exceeds the concurrency value, multiple batches will run serially until all inputs are processed. If left unspecified, it implies unbounded concurrency. - `min_success_ratio` determines the minimum fraction of total jobs that must complete successfully before terminating the map task and marking it as successful. ```python @fl.workflow def map_workflow_with_additional_params( data: list[int] = [10, 12, 11, 10, 13, 12, 100, 11, 12, 10] ) -> list[typing.Optional[bool]]: return fl.map_task( detect_anomalies, concurrency=1, min_success_ratio=0.75 )(data_point=data) ``` For more details see [Map Task example](https://github.com/unionai-oss/union-cloud-docs-examples/tree/main/map_task) in the `unionai-examples` repository and [Map Tasks]() section. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-types === # Other task types Task types include: * **`PythonFunctionTask`**: This Python class represents the standard default task. It is the type that is created when you use the `@fl.task` decorator. * **`ContainerTask`**: This Python class represents a raw container. It allows you to install any image you like, giving you complete control of the task. * **Shell tasks**: Use them to execute `bash` scripts within Flyte. * **Specialized plugin tasks**: These include both specialized classes and specialized configurations of the `PythonFunctionTask`. They implement integrations with third-party systems. ## PythonFunctionTask This is the task type that is created when you add the `@fl.task` decorator to a Python function. It represents a Python function that will be run within a single container. For example:: ```python @fl.task def get_data() -> pd.DataFrame: """Get the wine dataset.""" return load_wine(as_frame=True).frame ``` See the [Python Function Task example](https://github.com/unionai-oss/union-cloud-docs-examples/tree/main/python_function_task). This is the most common task variant and the one that, thus far, we have focused on in this documentation. ## ContainerTask This task variant represents a raw container, with no assumptions made about what is running within it. Here is an example of declaring a `ContainerTask`: ```python greeting_task = ContainerTask( name="echo_and_return_greeting", image="alpine:latest", input_data_dir="/var/inputs", output_data_dir="/var/outputs", inputs=kwtypes(name=str), outputs=kwtypes(greeting=str), command=["/bin/sh", "-c", "echo 'Hello, my name is {{.inputs.name}}.' | tee -a /var/outputs/greeting"], ) ``` The `ContainerTask` enables you to include a task in your workflow that executes arbitrary code in any language, not just Python. In the following example, the tasks calculate an ellipse area. This name has to be unique in the entire project. Users can specify: `input_data_dir` -> where inputs will be written to. `output_data_dir` -> where Flyte will expect the outputs to exist. The `inputs` and `outputs` specify the interface for the task; thus it should be an ordered dictionary of typed input and output variables. The image field specifies the container image for the task, either as an image name or an ImageSpec. To access the file that is not included in the image, use ImageSpec to copy files or directories into container `/root`. Cache can be enabled in a ContainerTask by configuring the cache settings in the `TaskMetadata` in the metadata parameter. ```python calculate_ellipse_area_haskell = ContainerTask( name="ellipse-area-metadata-haskell", input_data_dir="/var/inputs", output_data_dir="/var/outputs", inputs=kwtypes(a=float, b=float), outputs=kwtypes(area=float, metadata=str), image="ghcr.io/flyteorg/rawcontainers-haskell:v2", command=[ "./calculate-ellipse-area", "{{.inputs.a}}", "{{.inputs.b}}", "/var/outputs", ], metadata=TaskMetadata(cache=True, cache_version="1.0"), ) calculate_ellipse_area_julia = ContainerTask( name="ellipse-area-metadata-julia", input_data_dir="/var/inputs", output_data_dir="/var/outputs", inputs=kwtypes(a=float, b=float), outputs=kwtypes(area=float, metadata=str), image="ghcr.io/flyteorg/rawcontainers-julia:v2", command=[ "julia", "calculate-ellipse-area.jl", "{{.inputs.a}}", "{{.inputs.b}}", "/var/outputs", ], metadata=TaskMetadata(cache=True, cache_version="1.0"), ) @workflow def wf(a: float, b: float): area_haskell, metadata_haskell = calculate_ellipse_area_haskell(a=a, b=b) area_julia, metadata_julia = calculate_ellipse_area_julia(a=a, b=b) ``` See the [Container Task example](https://github.com/unionai-oss/union-cloud-docs-examples/tree/main/container_task). ## Shell tasks Shell tasks enable the execution of shell scripts within Flyte. To create a shell task, provide a name for it, specify the bash script to be executed, and define inputs and outputs if needed: ### Example ```python from pathlib import Path from typing import Tuple import flytekit as fl from flytekit import kwtypes from flytekit.extras.tasks.shell import OutputLocation, ShellTask t1 = ShellTask( name="task_1", debug=True, script=""" set -ex echo "Hey there! Let's run some bash scripts using a shell task." echo "Showcasing shell tasks." >> {inputs.x} if grep "shell" {inputs.x} then echo "Found it!" >> {inputs.x} else echo "Not found!" fi """, inputs=kwtypes(x=FlyteFile), output_locs=[OutputLocation(var="i", var_type=FlyteFile, location="{inputs.x}")], ) t2 = ShellTask( name="task_2", debug=True, script=""" set -ex cp {inputs.x} {inputs.y} tar -zcvf {outputs.j} {inputs.y} """, inputs=kwtypes(x=FlyteFile, y=FlyteDirectory), output_locs=[OutputLocation(var="j", var_type=FlyteFile, location="{inputs.y}.tar.gz")], ) t3 = ShellTask( name="task_3", debug=True, script=""" set -ex tar -zxvf {inputs.z} cat {inputs.y}/$(basename {inputs.x}) | wc -m > {outputs.k} """, inputs=kwtypes(x=FlyteFile, y=FlyteDirectory, z=FlyteFile), output_locs=[OutputLocation(var="k", var_type=FlyteFile, location="output.txt")], ) ``` Here's a breakdown of the parameters of the `ShellTask`: - The `inputs` parameter allows you to specify the types of inputs that the task will accept - The `output_locs` parameter is used to define the output locations, which can be `FlyteFile` or `FlyteDirectory` - The `script` parameter contains the actual bash script that will be executed (`{inputs.x}`, `{outputs.j}`, etc. will be replaced with the actual input and output values). - The `debug` parameter is helpful for debugging purposes We define a task to instantiate `FlyteFile` and `FlyteDirectory`. A `.gitkeep` file is created in the `FlyteDirectory` as a placeholder to ensure the directory exists: ```python @fl.task def create_entities() -> Tuple[fl.FlyteFile, fl.FlyteDirectory]: working_dir = Path(fl.current_context().working_directory) flytefile = working_dir / "test.txt" flytefile.touch() flytedir = working_dir / "testdata" flytedir.mkdir(exist_ok=True) flytedir_file = flytedir / ".gitkeep" flytedir_file.touch() return flytefile, flytedir ``` We create a workflow to define the dependencies between the tasks: ```python @fl.workflow def shell_task_wf() -> fl.FlyteFile: x, y = create_entities() t1_out = t1(x=x) t2_out = t2(x=t1_out, y=y) t3_out = t3(x=x, y=y, z=t2_out) return t3_out ``` You can run the workflow locally: ```python if __name__ == "__main__": print(f"Running shell_task_wf() {shell_task_wf()}") ``` ## Specialized plugin task classes and configs Flyte supports a wide variety of plugin tasks. Some of these are enabled as specialized task classes, others as specialized configurations of the default `@fl.task` (`PythonFunctionTask`). They enable things like: * Querying external databases (AWS Athena, BigQuery, DuckDB, SQL, Snowflake, Hive). * Executing specialized processing right in Flyte (Spark in virtual cluster, Dask in Virtual cluster, Sagemaker, Airflow, Modin, Ray, MPI and Horovod). * Handing off processing to external services(AWS Batch, Spark on Databricks, Ray on external cluster). * Data transformation (Great Expectations, DBT, Dolt, ONNX, Pandera). * Data tracking and presentation (MLFlow, Papermill). See the [Integration section]() for examples. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-parameters === # Task parameters You pass the following parameters to the `@fl.task` decorator: * `accelerator`: The accelerator to use for this task. For more information, see [Specifying accelerators](). * `cache`: See **Core concepts > Caching**. * `cache_serialize`: See **Core concepts > Caching**. * `cache_version`: See **Core concepts > Caching**. * `cache_ignore_input_vars`: Input variables that should not be included when calculating the hash for the cache. * `container_image`: See **Core concepts > ImageSpec**. * `deprecated`: A string that can be used to provide a warning message for deprecated task. The absence of a string, or an empty string, indicates that the task is active and not deprecated. * `docs`: Documentation about this task. * `enable_deck`: If true, this task will output a Deck which can be used to visualize the task execution. See **Development cycle > Decks**. ```python @fl.task(enable_deck=True) def my_task(my_str: str): print("hello {my_str}") ``` * `environment`: See **Core concepts > Tasks > Task software environment > Environment variables**. * `interruptible`: See **Core concepts > Tasks > Task hardware environment > Interruptible instances**. * `limits`: See **Core concepts > Tasks > Task hardware environment > Customizing task resources**. * `node_dependency_hints`: A list of tasks, launch plans, or workflows that this task depends on. This is only for dynamic tasks/workflows, where Flyte cannot automatically determine the dependencies prior to runtime. Even on dynamic tasks this is optional, but in some scenarios it will make registering the workflow easier, because it allows registration to be done the same as for static tasks/workflows. For example this is useful to run launch plans dynamically, because launch plans must be registered before they can be run. Tasks and workflows do not have this requirement. ```python @fl.workflow def workflow0(): launchplan0 = LaunchPlan.get_or_create(workflow0) # Specify node_dependency_hints so that launchplan0 # will be registered on flyteadmin, despite this being a dynamic task. @fl.dynamic(node_dependency_hints=[launchplan0]) def launch_dynamically(): # To run a sub-launchplan it must have previously been registered on flyteadmin. return [launchplan0]*10 ``` * `pod_template`: See **Core concepts > Tasks > Task parameters > Task hardware environment**. * `pod_template_name`: See **Core concepts > Tasks > Task parameters > Task hardware environment**. * `requests`: See **Core concepts > Tasks > Task hardware environment > Customizing task resources** * `retries`: Number of times to retry this task during a workflow execution. Tasks can define a retry strategy to let the system know how to handle failures (For example: retry 3 times on any kind of error). For more information, see **Core concepts > Tasks > Task hardware environment > Interruptible instances** There are two kinds of retries *system retries* and *user retries*. * `secret_requests`: See **Development cycle > Managing secrets** * `task_config`: Configuration for a specific task type. See the [Flyte Connectors documentation](../../integrations/connectors) and [Flyte plugins documentation]() for the right object to use. * `task_resolver`: Provide a custom task resolver. * `timeout`: The max amount of time for which one execution of this task should be executed for. The execution will be terminated if the runtime exceeds the given timeout (approximately). To ensure that the system is always making progress, tasks must be guaranteed to end gracefully/successfully. The system defines a default timeout period for the tasks. It is possible for task authors to define a timeout period, after which the task is marked as `failure`. Note that a timed-out task will be retried if it has a retry strategy defined. The timeout can be handled in the [TaskMetadata](). ## Use `partial` to provide default arguments to tasks You can use the `functools.partial` function to assign default or constant values to the parameters of your tasks: ```python import functools import flytekit as fl @fl.task def slope(x: list[int], y: list[int]) -> float: sum_xy = sum([x[i] * y[i] for i in range(len(x))]) sum_x_squared = sum([x[i] ** 2 for i in range(len(x))]) n = len(x) return (n * sum_xy - sum(x) * sum(y)) / (n * sum_x_squared - sum(x) ** 2) @fl.workflow def simple_wf_with_partial(x: list[int], y: list[int]) -> float: partial_task = functools.partial(slope, x=x) return partial_task(y=y) ``` ## Named outputs By default, Flyte employs a standardized convention to assign names to the outputs of tasks or workflows. Each output is sequentially labeled as `o1`, `o2`, `o3`, ... `on`, where `o` serves as the standard prefix, and `1`, `2`, ... `n` indicates the positional index within the returned values. However, Flyte allows the customization of output names for tasks or workflows. This customization becomes beneficial when you're returning multiple outputs and you wish to assign a distinct name to each of them. The following example illustrates the process of assigning names to outputs for both a task and a workflow. Define a `NamedTuple` and assign it as an output to a task: ```python import flytekit as fl from typing import NamedTuple slope_value = NamedTuple("slope_value", [("slope", float)]) @fl.task def slope(x: list[int], y: list[int]) -> slope_value: sum_xy = sum([x[i] * y[i] for i in range(len(x))]) sum_x_squared = sum([x[i] ** 2 for i in range(len(x))]) n = len(x) return (n * sum_xy - sum(x) * sum(y)) / (n * sum_x_squared - sum(x) ** 2) ``` Likewise, assign a `NamedTuple` to the output of `intercept` task: ```python intercept_value = NamedTuple("intercept_value", [("intercept", float)]) @fl.task def intercept(x: list[int], y: list[int], slope: float) -> intercept_value: mean_x = sum(x) / len(x) mean_y = sum(y) / len(y) intercept = mean_y - slope * mean_x return intercept ``` > [!NOTE] > While it's possible to create `NamedTuple`s directly within the code, > it's often better to declare them explicitly. This helps prevent potential linting errors in tools like mypy. > > ```python > def slope() -> NamedTuple("slope_value", slope=float): > pass > ``` You can easily unpack the `NamedTuple` outputs directly within a workflow. Additionally, you can also have the workflow return a `NamedTuple` as an output. > [!NOTE] > Remember that we are extracting individual task execution outputs by dereferencing them. > This is necessary because `NamedTuple`s function as tuples and require this dereferencing: ```python slope_and_intercept_values = NamedTuple("slope_and_intercept_values", [("slope", float), ("intercept", float)]) @fl.workflow def simple_wf_with_named_outputs(x: list[int] = [-3, 0, 3], y: list[int] = [7, 4, -2]) -> slope_and_intercept_values: slope_value = slope(x=x, y=y) intercept_value = intercept(x=x, y=y, slope=slope_value.slope) return slope_and_intercept_values(slope=slope_value.slope, intercept=intercept_value.intercept) ``` You can run the workflow locally as follows: ```python if __name__ == "__main__": print(f"Running simple_wf_with_named_outputs() {simple_wf_with_named_outputs()}") ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/launching-tasks === # Launching tasks From the **Core concepts > Tasks > Viewing tasks > Task view** (accessed, for example, by selecting a task in the **Core concepts > Tasks > Viewing tasks > Tasks list**) you can select **Launch Task** in the top right: This opens the **New Execution** dialog for tasks: ![](../../../_static/images/user-guide/core-concepts/tasks/launching-tasks/new-execution-dialog.png) The settings are similar to those for workflows. At the top you can select: * The specific version of this task that you want to launch. Along the left side the following sections are available: * **Inputs**: The input parameters of the task function appear here as fields to be filled in. * **Settings**: * **Execution name**: A custom name for this execution. If not specified, a name will be generated. * **Overwrite cached outputs**: A boolean. If set to `True`, this execution will overwrite any previously-computed cached outputs. * **Raw output data config**: Remote path prefix to store raw output data. By default, workflow output will be written to the built-in metadata storage. Alternatively, you can specify a custom location for output at the organization, project-domain, or individual execution levels. This field is for specifying this setting at the workflow execution level. If this field is filled in it overrides any settings at higher levels. The parameter is expected to be a URL to a writable resource (for example, `http://s3.amazonaws.com/my-bucket/`). See **Data input/output > Task input and output > Raw data store** **Max parallelism**: Number of workflow nodes that can be executed in parallel. If not specified, project/domain defaults are used. If 0 then no limit is applied. * **Force interruptible**: A three valued setting for overriding the interruptible setting of the workflow for this particular execution. If not set, the workflow's interruptible setting is used. If set and **enabled** then `interruptible=True` is used for this execution. If set and **disabled** then `interruptible=False` is used for this execution. See **Core concepts > Tasks > Task hardware environment > Interruptible instances** * **Service account**: The service account to use for this execution. If not specified, the default is used. * **Environment variables**: Environment variables that will be available to tasks in this workflow execution. * **Labels**: Labels to apply to the execution resource. * **Notifications**: **Core concepts > Launch plans > Notifications** configured for this workflow execution. * **Debug**: The workflow execution details for debugging purposes. Select **Launch** to launch the task execution. This will take you to the **Core concepts > Workflows > Viewing workflow executions**. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/viewing-tasks === # Viewing tasks ## Tasks list Selecting **Tasks** in the sidebar displays a list of all the registered tasks: ![Tasks list](../../../_static/images/user-guide/core-concepts/tasks/viewing-tasks/tasks-list.png) You can search the tasks by name and filter for only those that are archived. Each task in the list displays some basic information about the task: * **Inputs**: The input type for the task. * **Outputs**: The output type for the task. * **Description**: A description of the task. Select an entry on the list to go to that **Core concepts > Tasks > Viewing tasks > Task view**. ## Task view Selecting an individual task from the **Core concepts > Tasks > Viewing tasks > Tasks list** will take you to the task view: ![Task view](../../../_static/images/user-guide/core-concepts/tasks/viewing-tasks/task-view.png) Here you can see: * **Inputs & Outputs**: The input and output types for the task. * Recent task versions. Selecting one of these takes you to the **Core concepts > Tasks > Viewing tasks > Task view > Task versions list** * Recent executions of this task. Selecting one of these takes you to the **Core concepts > Workflows > Viewing workflow executions**. ### Task versions list The task versions list give you detailed information about a specific version of a task: ![Task versions list](../../../_static/images/user-guide/core-concepts/tasks/viewing-tasks/task-versions-list.png) * **Image**: The Docker image used to run this task. * **Env Vars**: The environment variables used by this task. * **Commands**: The JSON object defining this task. At the bottom is a list of all versions of the task with the current one selected. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-software-environment === # Task software environment The @fl.task decorator provides the following parameters to specify the software environment in which a task runs: * `container_image`: Can be either a string referencing a specific image on a container repository, or an ImageSpec defining a build. See **Core concepts > Tasks > Task software environment > Local image building** for details. * `environment`: See **Core concepts > Tasks > Task software environment > Environment variables** for details. ## Subpages - **Core concepts > Tasks > Task software environment > Local image building** - **Core concepts > Tasks > Task software environment > ImageSpec with ECR** - **Core concepts > Tasks > Task software environment > ImageSpec with GAR** - **Core concepts > Tasks > Task software environment > ImageSpec with ACR** - **Core concepts > Tasks > Task software environment > Environment variables** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-software-environment/image-spec === # Local image building With Flyte, every task in a workflow runs within its own dedicated container. Since a container requires a container image to run, every task in Flyte must have a container image associated with it. You can specify the container image to be used by a task by defining an `ImageSpec` object and passing it to the `container_image` parameter of the `@fl.task` decorator. When you register the workflow, the container image is built locally and pushed to the container registry that you specify. When the workflow is executed, the container image is pulled from that registry and used to run the task. > [!NOTE] > See the [ImageSpec API documentation]() for full documentation of `ImageSpec` class parameters and methods. To illustrate the process, we will walk through an example. ## Project structure ```shell โ”œโ”€โ”€ requirements.txt โ””โ”€โ”€ workflows โ”œโ”€โ”€ __init__.py โ””โ”€โ”€ imagespec-simple-example.py ``` ### requirements.txt ```shell union pandas ``` ### imagespec-simple-example.py ```python import typing import pandas as pd import flytekit as fl image_spec = union.ImageSpec( registry="ghcr.io/", name="simple-example-image", base_image="ghcr.io/flyteorg/flytekit:py3.11-latest", requirements="requirements.txt" ) @fl.task(container_image=image_spec) def get_pandas_dataframe() -> typing.Tuple[pd.DataFrame, pd.Series]: df = pd.read_csv("https://storage.googleapis.com/download.tensorflow.org/data/heart.csv") print(df.head()) return df[["age", "thalach", "trestbps", "chol", "oldpeak"]], df.pop("target") @fl.workflow() def wf() -> typing.Tuple[pd.DataFrame, pd.Series]: return get_pandas_dataframe() ``` ## Install and configure `pyflyte` and Docker To install Docker, see **Getting started > Local setup > Install Docker and get access to a container registry**. To configure `pyflyte` to connect to your Flyte instance, see [Getting started](../../../getting-started/_index). ## Set up an image registry You will need an image registry where the container image can be stored and pulled by Flyte when the task is executed. You can use any image registry that you have access to, including public registries like Docker Hub or GitHub Container Registry. Alternatively, you can use a registry that is part of your organization's infrastructure such as AWS Elastic Container Registry (ECR) or Google Artifact Registry (GAR). The registry that you choose must be one that is accessible to the Flyte instance where the workflow will be executed. Additionally, you will need to ensure that the specific image, once pushed to the registry, is itself publicly accessible. In this example, we use GitHub's `ghcr.io` container registry. See [Working with the Container registry](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry) for more information. ## Authenticate to the registry You will need to set up your local Docker client to authenticate with GHCR. This is needed for `pyflyte` CLI to be able to push the image built according to the `ImageSpec` to GHCR. Follow the directions [Working with the Container registry > Authenticating to the Container registry](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry#authenticating-to-the-container-registry). ## Set up your project and domain on Flyte You will need to set up a project on your Flyte instance to which you can register your workflow. See **Development cycle > Setting up a production project**. ## Understand the requirements The `requirements.txt` file contains the `flytekit` package and the `pandas` package, both of which are needed by the task. ## Set up a virtual Python environment Set up a virtual Python environment and install the dependencies defined in the `requirements.txt` file. Assuming you are in the local project root, run `pip install -r requirements.txt`. ## Run the workflow locally You can now run the workflow locally. In the project root directory, run: `pyflyte run workflows/imagespec-simple-example.py wf`. See **Development cycle > Running your code** for more details. > [!NOTE] > When you run the workflow in your local Python environment, the image is not built or pushed (in fact, no container image is used at all). ## Register the workflow To register the workflow to Flyte, in the local project root, run: ```shell $ pyflyte register workflows/imagespec-simple-example.py ``` `pyflyte` will build the container image and push it to the registry that you specified in the `ImageSpec` object. It will then register the workflow to Flyte. To see the registered workflow, go to the UI and navigate to the project and domain that you created above. ## Ensure that the image is publicly accessible If you are using the `ghcr.io` image registry, you must switch the visibility of your container image to Public before you can run your workflow on Flyte. See [Configuring a package's access control and visibility](https://docs.github.com/en/packages/learn-github-packages/configuring-a-packages-access-control-and-visibility#about-inheritance-of-access-permissions-and-visibility). ## Run the workflow on Flyte Assuming your image is publicly accessible, you can now run the workflow on Flyte by clicking **Launch Workflow**. > [!WARNING] Make sure your image is accessible > If you try to run a workflow that uses a private container image or an image that is inaccessible for some other reason, the system will return an error: > > ``` > ... Failed to pull image ... > ... Error: ErrImagePull > ... Back-off pulling image ... > ... Error: ImagePullBackOff > ``` ## Multi-image workflows You can also specify different images per task within the same workflow. This is particularly useful if some tasks in your workflow have a different set of dependencies where most of the other tasks can use another image. In this example we specify two tasks: one that uses CPUs and another that uses GPUs. For the former task, we use the default image that ships with flytekit while for the latter task, we specify a pre-built image that enables distributed training with the Kubeflow Pytorch integration. ```python import numpy as np import torch.nn as nn @task( requests=Resources(cpu="2", mem="16Gi"), container_image="ghcr.io/flyteorg/flytekit:py3.9-latest", ) def get_data() -> Tuple[np.ndarray, np.ndarray]: ... # get dataset as numpy ndarrays @task( requests=Resources(cpu="4", gpu="1", mem="16Gi"), container_image="ghcr.io/flyteorg/flytecookbook:kfpytorch-latest", ) def train_model(features: np.ndarray, target: np.ndarray) -> nn.Module: ... # train a model using gpus ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-software-environment/image-spec-with-ecr === # ImageSpec with ECR In this section we explain how to set up and use AWS Elastic Container Registry (ECR) to build and deploy task container images using `ImageSpec`. ## Set up the image repository Unlike GitHub Container Registry, ECR does not allow you to simply push an arbitrarily named image to the registry. Instead, you must first create a repository in the ECR instance and then push the image to that repository. > [!NOTE] Registry, repository, and image > In ECR terminology the **registry** is the top-level storage service. The registry holds a collection of **repositories**. > Each repository corresponds to a named image and holds all versions of that image. > > When you push an image to a registry, you are actually pushing it to a repository within that registry. > Strictly speaking, the term *image* refers to a specific *image version* within that repository. This means that you have to decide on the name of your image and create a repository by that name first, before registering your workflow. We will assume the following: * The ECR instance you will be using has the base URL `123456789012.dkr.ecr.us-east-1.amazonaws.com`. * Your image will be called `simple-example-image`. In the AWS console, go to **Amazon ECR > Repositories** and find the correct ECR registry Under **Create a Repository**, click **Get Started**: ![](../../../../_static/images/user-guide/core-concepts/tasks/task-software-environment/imagespec-with-ecr/create-repository-1.png) On the **Create repository** page: * Select **Private** for the repository visibility, assuming you want to make it private. You can, alternatively, select **Public**, but in most cases, the main reason for using ECR is to keep your images private. * Enter the name of the repository: ![](../../../../_static/images/user-guide/core-concepts/tasks/task-software-environment/imagespec-with-ecr/create-repository-2.png) and then scroll down to click **Create repository**: ![](../../../../_static/images/user-guide/core-concepts/tasks/task-software-environment/imagespec-with-ecr/create-repository-3.png) Your repository is now created. ## Authenticate to the registry You will need to set up your local Docker client to authenticate with ECR. This is needed for `pyflyte` to be able to push the image built according to the `ImageSpec` to ECR. To do this, you will need to [install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html), use it to run the `aws ecr get-login-password` command to get the appropriate password, then perform a `docker login` with that password. See [Private registry authentication](https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html) for details. ## Register your workflow to Flyte You can register tasks with `ImageSpec` declarations that reference this repository. For example, to use the example repository shown here, we would alter the Python code in the **Core concepts > Tasks > Task software environment**, to have the following `ImageSpec` declaration: ```python image_spec = union.ImageSpec( registry="123456789012.dkr.ecr.us-eas-1.amazonaws.com", name="simple-example-image", base_image="ghcr.io/flyteorg/flytekit:py3.11-latest", requirements="image-requirements.txt" ) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-software-environment/image-spec-with-gar === # ImageSpec with GAR In this section we explain how to set up and use Google Artifact Registry (GAR) to build and deploy task container images using `ImageSpec`. ## Set up the image repository Unlike GitHub Container Registry, GAR does not allow you to simply push an arbitrarily named image to the registry. Instead, you must first create a repository in the GAR instance and then push the image to that repository. > [!NOTE] Registry, repository, and image > In GAR terminology the **registry** is the top-level storage service. The registry holds a collection of **repositories**. > Each repository in turn holds some number of images, and each specific image name can have different versions. > > Note that this differs from the arrangement in AWS ECR where the repository name and image name are essentially the same. > > When you push an image to GAR, you are actually pushing it to an image name within a repository within that registry. > Strictly speaking, the term *image* refers to a specific *image version* within that repository. This means that you have to decide on the name of your repository and create it, before registering your workflow. You can, however, decide on the image name later, when you push the image to the repository. We will assume the following: * The GAR instance you will be using has the base URL `us-east1-docker.pkg.dev/my-union-dataplane/my-registry/`. * Your repository will be called `my-image-repository`. * Your image will be called `simple-example-image`. In the GCP console, within the same project as your Flyte installation, go to **Artifact Registry**. Create a new one by clicking **Create Repository**: ![](../../../../_static/images/user-guide/core-concepts/tasks/task-software-environment/imagespec-with-gar/gar-create-repository-1.png) On the **Create repository** page, * Enter the name of the repository. In this example it would be `my-image-repository`. * Select **Docker** for the artifact type. * Select the region. If you want to access the GAR without further configuration, make sure this the same region as your Flyte cluster. * Click **Create**: ![](../../../../_static/images/user-guide/core-concepts/tasks/task-software-environment/imagespec-with-gar/gar-create-repository-2.png) Your GAR repository is now created. ## Authenticate to the registry You will need to set up your local Docker client to authenticate with GAR. This is needed for `pyflyte` to be able to push the image built according to the `ImageSpec` to GAR. Directions can be found in the GAR console interface. Click on **Setup Instructions**: ![](../../../../_static/images/user-guide/core-concepts/tasks/task-software-environment/imagespec-with-gar/gar-setup-instructions.png) The directions are also reproduced below. (We show the directions for the `us-east1` region. You may need to adjust the command accordingly): > [!NOTE] Setup Instructions > Follow the steps below to configure your client to push and pull packages using this repository. > You can also [view more detailed instructions here](https://cloud.google.com/artifact-registry/docs/docker/authentication?authuser=1). > For more information about working with artifacts in this repository, see the [documentation](https://cloud.google.com/artifact-registry/docs/docker?authuser=1). > > **Initialize gcloud** > > The [Google Cloud SDK](https://cloud.google.com/sdk/docs/?authuser=1) is used to generate an access token when authenticating with Artifact Registry. > Make sure that it is installed and initialized with [Application Default Credentials](https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login?authuser=1) before proceeding. > > **Configure Docker** > > Run the following command to configure `gcloud` as the credential helper for the Artifact Registry domain associated with this repository's location: > > ```shell > $ gcloud auth configure-docker us-east1-docker.pkg.dev > ``` ## Register your workflow to Flyte You can now register tasks with `ImageSpec` declarations that reference this repository. For example, to use the example GAR repository shown here, we would alter the Python code in the **Core concepts > Tasks > Task software environment**, to have the following `ImageSpec` declaration: ```python image_spec = union.ImageSpec( registry="us-east1-docker.pkg.dev/my-union-dataplane/my-registry/my-image-repository", name="simple-example-image", base_image="ghcr.io/flyteorg/flytekit:py3.11-latest", requirements="image-requirements.txt" ) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-software-environment/image-spec-with-acr === # ImageSpec with ACR In this section we explain how to use [Azure Container Registry (ACR)](https://azure.microsoft.com/en-us/products/container-registry) to build and deploy task container images using `ImageSpec`. Before proceeding, make sure that you have [enabled Azure Container Registry](../../../integrations/enabling-azure-resources/enabling-azure-container-registry) for you Flyte installation. ## Authenticate to the registry Authenticate with the container registry ```bash az login az acr login --name ``` Refer to [Individual login with Microsoft Entra ID](https://learn.microsoft.com/en-us/azure/container-registry/container-registry-authentication?tabs=azure-cli#individual-login-with-microsoft-entra-id) in the Azure documentation for additional details. ## Register your workflow to Flyte You can now register tasks with `ImageSpec` declarations that reference this repository. For example, to use an existing ACR repository, we would alter the Python code in the **Core concepts > Tasks > Task software environment**, to have the following `ImageSpec` declaration: ```python image_spec = union.ImageSpec( registry=".azurecr.io", name="my-repository/simple-example-image", base_image="ghcr.io/flyteorg/flytekit:py3.11-latest", requirements="image-requirements.txt" ) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-software-environment/environment-variables === # Environment variables The `environment` parameter lets you specify the values of any variables that you want to be present within the task container execution environment. For example: ```python @fl.task(environment={"MY_ENV_VAR": "my_value"}) def my_task() -> str: return os.environ["MY_ENV_VAR"] ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/viewing-logs === # Viewing logs In the **Core concepts > Workflows > Viewing workflow executions**, selecting a task from the list in the **Nodes** tab will open the task details in the right panel. Within that panel, in the **Execution** tab, you will find the stack trace associated with that node: ![Task logs link](../../../_static/images/user-guide/core-concepts/tasks/viewing-logs/viewing_logs_flyte.png) Also, if you configure the Kubernetes dashboard, the Flyte console will display a link to the specific Pod logs. ## Cloud provider logs In addition to the **Task Logs** link, you will also see a link to your cloud provider's logs (**Cloudwatch Logs** for AWS, **Stackdriver Logs** for GCP, and **Azure Logs** for Azure): ![Cloud provider logs link](../../../_static/images/user-guide/core-concepts/tasks/viewing-logs/cloud-provider-logs-link.png) Assuming you are logged into your cloud provider account with the appropriate permissions, this link will take you to the logs specific to the container in which this particular task execution is running. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/reference-tasks === # Reference tasks A reference_task references tasks that have already been defined, serialized, and registered. You can reference tasks from other projects and create workflows that use tasks declared by others. These tasks can be in their own containers, python runtimes, flytekit versions, and even different languages. > [!NOTE] > Reference tasks cannot be run locally. To test locally, mock them out. ## Example 1. Create a file called `task.py` and insert this content into it: ```python import flytekit as fl @fl.task def add_two_numbers(a: int, b: int) -> int: return a + b ``` 2. Register the task: ```shell $ pyflyte register --project flytesnacks --domain development --version v1 task.py ``` 3. Create a separate file `wf_ref_task.py` and copy the following code into it: ```python from flytekit import reference_task @reference_task( project="flytesnacks", domain="development", name="task.add_two_numbers", version="v1", ) def add_two_numbers(a: int, b: int) -> int: ... @fl.workflow def wf(a: int, b: int) -> int: return add_two_numbers(a, b) ``` 4. Register the `wf` workflow: ```shell $ pyflyte register --project flytesnacks --domain development wf_ref_task.py ``` 5. In the Flyte UI, run the workflow `wf_ref_task.wf`. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-hardware-environment === # Task hardware environment ## Customizing task resources You can customize the hardware environment in which your task code executes. Depending on your needs, there are two different of ways to define and register tasks with their own custom hardware requirements: * Configuration in the `@fl.task` decorator * Defining a `PodTemplate` ### Using the `@fl.task` decorator You can specify `requests` and `limits` on: * CPU number * GPU number * Memory size * Ephemeral storage size See **Core concepts > Tasks > Task hardware environment > Customizing task resources** for details. ### Using PodTemplate If your needs are more complex, you can use Kubernetes-level configuration to constrain a task to only run on a specific machine type. This requires that you set up the required machine types and node groups with the appropriate node assignment configuration (node selector labels, node affinities, taints, tolerations, etc.) In your task definition you then use a `PodTemplate` that uses the matching node assignment configuration to make sure that the task will only be scheduled on the appropriate machine type. ### `pod_template` and `pod_template_name` @fl.task parameters The `pod_template` parameter can be used to supply a custom Kubernetes `PodTemplate` to the task. This can be used to define details about node selectors, affinity, tolerations, and other Kubernetes-specific settings. The `pod_template_name` is a related parameter that can be used to specify the name of an already existing `PodTemplate` resource which will be used in this task. For details see [Configuring task pods with Kubernetes PodTemplates](). ## Accelerators If you specify GPUs, you can also specify the type of GPU to be used by setting the `accelerator` parameter. See **Core concepts > Tasks > Task hardware environment > Accelerators** for more information. ## Task-level monitoring You can also monitor the hardware resources used by a task. See **Core concepts > Tasks > Task hardware environment > Task-level monitoring** for details. ## Subpages - **Core concepts > Tasks > Task hardware environment > Customizing task resources** - **Core concepts > Tasks > Task hardware environment > Accelerators** - **Core concepts > Tasks > Task hardware environment > Retries and timeouts** - **Core concepts > Tasks > Task hardware environment > Interruptible instances** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-hardware-environment/customizing-task-resources === # Customizing task resources When defining a task function, you can specify resource requirements for the pod that runs the task. Flyte will take this into account to ensure that the task pod is scheduled to run on a Kubernetes node that meets the specified resource profile. Resources are specified in the `@fl.task` decorator. Here is an example: ```python from flytekit.extras.accelerators import A100 @fl.task( requests=Resources(mem="120Gi", cpu="44", ephemeral_storage="100Gi"), limits=Resources(mem="200Gi", cpu="100", gpu="12", ephemeral_storage="200Gi"), accelerator=GPUAccelerator("nvidia-tesla-a100") ) def my_task() ... ``` There are three separate resource-related settings: * `requests` * `limits` * `accelerator` ## The `requests` and `limits` settings The `requests` and `limits` settings each takes a [`Resource`]() object, which itself has five possible attributes: * `cpu`: Number of CPU cores (in whole numbers or millicores (`m`)). * `gpu`: Number of GPU cores (in whole numbers or millicores (`m`)). * `mem`: Main memory (in `Mi`, `Gi`, etc.). * `ephemeral_storage`: Ephemeral storage (in `Mi`, `Gi` etc.). Note that CPU and GPU allocations can be specified either as whole numbers or in millicores (`m`). For example, `cpu="2500m"` means two and a half CPU cores and `gpu="3000m"`, meaning three GPU cores. The `requests` setting tells the system that the task requires _at least_ the resources specified and therefore the pod running this task should be scheduled only on a node that meets or exceeds the resource profile specified. The `limits` setting serves as a hard upper bound on the resource profile of nodes to be scheduled to run the task. The task will not be scheduled on a node that exceeds the resource profile specified (in any of the specified attributes). > [!NOTE] GPUs take only `limits` > GPUs should only be specified in the `limits` section of the task decorator: > * You should specify GPU requirements only in `limits`, not in `requests`, because Kubernetes will use the `limits` value as the `requests` value anyway. > * You _can_ specify GPU in both `limits` and `requests` but the two values must be equal. > * You cannot specify GPU `requests` without specifying `limits`. ## The `accelerator` setting The `accelerator` setting further specifies the *type* of specialized hardware required for the task. This can be a GPU, a specific variation of a GPU, a fractional GPU, or a different hardware device, such as a TPU. See **Core concepts > Tasks > Task hardware environment > Accelerators** for more information. ## The `with_overrides` method When `requests`, `limits`, or `accelerator` are specified in the `@fl.task` decorator, they apply every time that a task is invoked from a workflow. In some cases, you may wish to change the resources specified from one invocation to another. To do that, use the [`with_overrides` method](../../../../api-reference/flytekit-sdk/packages/flytekit.core.node#with_overrides) of the task function. For example: ```python @fl.task def my_task(ff: FlyteFile): ... @fl.workflow def my_workflow(): my_task(ff=smallFile) my_task(ff=bigFile).with_overrides(requests=Resources(mem="120Gi", cpu="10")) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-hardware-environment/accelerators === # Accelerators Flyte allows you to specify **Core concepts > Tasks > Task hardware environment > Customizing task resources** for the number of GPUs available for a given task. However, in some cases, you may want to be more specific about the type of GPU or other specialized device to be used. You can use the `accelerator` parameter to specify specific GPU types, variations of GPU types, fractional GPUs, or other specialized hardware devices such as TPUs. Each device type has a constant name that you can use to specify the device in the `accelerator` parameter. For example: ```python from flytekit.extras.accelerators import A100 @fl.task( limits=Resources(gpu="1"), accelerator=A100, ) def my_task(): ... ``` ## Using predefined accelerator constants There are a number of predefined accelerator constants available in the `flytekit.extras.accelerators` module. The predefined list is not exhaustive, but it includes the most common accelerators. If you know the name of the accelerator, but there is no predefined constant for it, you can simply pass the string name to the task decorator directly. Note that in order for a specific accelerator to be available in your Flyte installation, it must have been provisioned in your Flyte cluster as part of your **Platform deployment**. If using the constants, you can import them directly from the module, e.g.: ```python from flytekit.extras.accelerators import T4 @fl.task( limits=Resources(gpu="1"), accelerator=T4, ) def my_task(): ... ``` if you want to use a fractional GPU, you can use the `partitioned` method on the accelerator constant, e.g.: ```python from flytekit.extras.accelerators import A100 @fl.task( limits=Resources(gpu="1"), accelerator=A100.partition_2g_10gb, ) def my_task(): ... ``` ## List of predefined accelerator constants * `A10G`: [NVIDIA A10 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) * `L4`: [NVIDIA L4 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/l4/) * `K80`: [NVIDIA Tesla K80 GPU](https://www.nvidia.com/en-gb/data-center/tesla-k80/) * `M60`: [NVIDIA Tesla M60 GPU](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/nvidia-m60-datasheet.pdf) * `P4`: [NVIDIA Tesla P4 GPU](https://images.nvidia.com/content/pdf/tesla/184457-Tesla-P4-Datasheet-NV-Final-Letter-Web.pdf) * `P100`: [NVIDIA Tesla P100 GPU](https://www.nvidia.com/en-us/data-center/tesla-p100/) * `T4`: [NVIDIA T4 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/tesla-t4/) * `V100` [NVIDIA Tesla V100 GPU](https://www.nvidia.com/en-us/data-center/tesla-v100/) * `A100`: An entire [NVIDIA A100 GPU](https://www.nvidia.com/en-us/data-center/a100/). Fractional partitions are also available: * `A100.partition_1g_5gb`: 5GB partition of an A100 GPU. * `A100.partition_2g_10gb`: 10GB partition of an A100 GPU - 2x5GB slices with 2/7th of the SM (streaming multiprocessor). * `A100.partition_3g_20gb`: 20GB partition of an A100 GPU - 4x5GB slices, with 3/7th fraction of the SM. * `A100.partition_4g_20gb`: 20GB partition of an A100 GPU - 4x5GB slices, with 4/7th fraction of the SM. * `A100.partition_7g_40gb`: 40GB partition of an A100 GPU - 8x5GB slices, with 7/7th fraction of the SM. * `A100_80GB`: An entire [NVIDIA A100 80GB GPU](https://www.nvidia.com/en-us/data-center/a100/). Fractional partitions are also available: * `A100_80GB.partition_1g_10gb`: 10GB partition of an A100 80GB GPU - 2x5GB slices with 1/7th of the SM (streaming multiprocessor). * `A100_80GB.partition_2g_20gb`: 2GB partition of an A100 80GB GPU - 4x5GB slices with 2/7th of the SM. * `A100_80GB.partition_3g_40gb`: 3GB partition of an A100 80GB GPU - 8x5GB slices with 3/7th of the SM. * `A100_80GB.partition_4g_40gb`: 4GB partition of an A100 80GB GPU - 8x5GB slices with 4/7th of the SM. * `A100_80GB.partition_7g_80gb`: 7GB partition of an A100 80GB GPU - 16x5GB slices with 7/7th of the SM. For more information on partitioning, see [Partitioned GPUs](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#partitioning). === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-hardware-environment/retries-and-timeouts === # Retries and timeouts ## Retry types Flyte allows you to automatically retry failing tasks. This section explains the configuration and application of retries. Errors causing task failure are categorized into two main types, influencing the retry logic differently: * `SYSTEM`: These errors arise from infrastructure-related failures, such as hardware malfunctions or network issues. They are typically transient and can often be resolved with a retry. * `USER`: These errors are due to issues in the user-defined code, like a value error or a logic mistake, which usually require code modifications to resolve. ## Configuring retries Retries in Flyte are configurable to address both `USER` and `SYSTEM` errors, allowing for tailored fault tolerance strategies: `USER` error can be handled by setting the `retries` attribute in the task decorator to define how many times a task should retry. This requires a `FlyteRecoverableException` to be raised in the task definition, any other exception will not be retried: ```python import random from flytekit import task from flytekit.exceptions.user import FlyteRecoverableException @task(retries=3) def compute_mean(data: List[float]) -> float: if random() < 0.05: raise FlyteRecoverableException("Something bad happened ๐Ÿ”ฅ") return sum(data) / len(data) ``` `SYSTEM` errors are managed at the platform level through settings like `max-node-retries-system-failures` in the FlytePropeller configuration. This setting helps manage retries without requiring changes to the task code. Additionally, the `interruptible-failure-threshold` option in the `node-config` key defines how many system-level retries are considered interruptible. This is particularly useful for tasks running on preemptible instances. For more details, refer to the Flyte Propeller Configuration. `SYSTEM` errors are managed at the platform level through settings internal tothe Flyte implementation. ## Retrying interruptible tasks Tasks marked as interruptible can be preempted and retried without counting against the USER error budget. This is useful for tasks running on preemptible compute resources like spot instances. See **Core concepts > Tasks > Task hardware environment > Interruptible instances** ## Retrying map tasks For map tasks, the interruptible behavior aligns with that of regular tasks. The retries field in the task annotation is not necessary for handling SYSTEM errors, as these are managed by the platformโ€™s configuration. Alternatively, the USER budget is set by defining retries in the task decorator. See **Core concepts > Tasks > Map Tasks**. ## Timeouts To protect against zombie tasks that hang due to system-level issues, you can supply the timeout argument to the task decorator to make sure that problematic tasks adhere to a maximum runtime. In this example, we make sure that the task is terminated after itโ€™s been running for more that one hour. ```python from datetime import timedelta @task(timeout=timedelta(hours=1)) def compute_mean(data: List[float]) -> float: return sum(data) / len(data) ``` Notice that the timeout argument takes a built-in Python `timedelta` object. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/tasks/task-hardware-environment/interruptible-instances === # Interruptible instances > [!NOTE] > In AWS, the term *spot instance* is used. > In GCP, the equivalent term is *spot vm*. > Here we use the term *interruptible instance* generically for both providers. An interruptible instance is a machine instance made available to your cluster by your cloud provider that is not guaranteed to be always available. As a result, interruptible instances are cheaper than regular instances. In order to use an interruptible instance for a compute workload you have to be prepared for the possibility that an attempt to run the workload could fail due to lack of available resources and will need to be retried. When **Platform deployment** among the options available is the choice of whether to use interruptible instances. For each interruptible instance node group that you specify, we recommend that you configure an additional on-demand node group (though identical in every other respect to the interruptible one) so that this on-demand node group will be used as a fallback when attempts to complete the task on the interruptible instance have failed. ## Configuring tasks to use interruptible instances To schedule tasks on interruptible instances and retry them if they fail, specify the `interruptible` and `retries` parameters in the `@fl.task` decorator. For example: ```python @fl.task(interruptible=True, retries=3) ``` * A task will only be scheduled on an interruptible instance if it has the parameter `interruptible=True` (or if its workflow has the parameter `interruptible=True` and the task does not have an explicit `interruptible` parameter). * An interruptible task, like any other task, can have a `retries` parameter. * If an interruptible task does not have an explicitly set `retries` parameter, then the `retries` value defaults to `1`. * An interruptible task with `retries=n` will be attempted `n` times on an interruptible instance. If it still fails after `n` attempts, the final (`n+1`) retry will be done on the fallback on-demand instance. ## Workflow level interruptible Interruptible is also available **Core concepts > Workflows**. If you set it there, it will apply to all tasks in the workflow that do not themselves have an explicit value set. A task-level interruptible setting always overrides whatever the workflow-level setting is. ## Advantages and disadvantages of interruptible instances The advantage of using interruptible instance for a task is simply that it is less costly than using an on-demand instance (all other parameters being equal). However, there are two main disadvantages: 1. The task is successfully scheduled on an interruptible instance but is interrupted. In the worst case scenario, for `retries=n` the task may be interrupted `n` times until, finally, the fallback on-demand instance is used. Clearly, this may be a problem for time-critical tasks. 2. Interruptible instances of the selected node type may simply be unavailable on the initial attempt to schedule. When this happens, the task may hang indefinitely until an interruptible instance becomes available. Note that this is a distinct failure mode from the previous one where an interruptible node is successfully scheduled but is then interrupted. In general, we recommend that you use interruptible instances whenever available, but only for tasks that are not time-critical. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/launch-plans === # Launch plans A launch plan is a template for a workflow invocation. It brings together: * A **Core concepts > Workflows** * A (possibly partial) set of inputs required to initiate that workflow * Optionally, **Core concepts > Launch plans > Notifications** and **Core concepts > Launch plans > Schedules** When invoked, the launch plan starts the workflow, passing the inputs as parameters. If the launch plan does not contain the entire set of required workflow inputs, additional input arguments must be provided at execution time. ## Default launch plan Every workflow automatically comes with a *default launch plan*. This launch plan does not define any default inputs, so they must all be provided at execution time. A default launch plan always has the same name as its workflow. ## Launch plans are versioned Like tasks and workflows, launch plans are versioned. A launch plan can be updated to change, for example, the set of inputs, the schedule, or the notifications. Each update creates a new version of the launch plan. ## Custom launch plans Additional launch plans, other than the default one, can be defined for any workflow. In general, a given workflow can be associated with multiple launch plans, but a given launch plan is always associated with exactly one workflow. ## Viewing launch plans for a workflow To view the launch plans for a given workflow, in the UI, navigate to the workflow's page and click **Launch Workflow**. You can choose which launch plan to use to launch the workflow from the **Launch Plan** dropdown menu. The default launch plan will be selected by default. If you have not defined any custom launch plans for the workflow, only the default plan will be available. If you have defined one or more custom launch plans, they will be available in the dropdown menu along with the default launch plan. For more details, see **Core concepts > Launch plans > Running launch plans**. ## Registering a launch plan ### Registering a launch plan on the command line In most cases, launch plans are defined alongside the workflows and tasks in your project code and registered as a bundle with the other entities using the CLI (see **Development cycle > Running your code**). ### Registering a launch plan in Python with `FlyteRemote` As with all Flyte command line actions, you can also perform registration of launch plans programmatically with [`FlyteRemote`](../../development-cycle/union-remote), specifically, `FlyteRemote.register_launch_plan`. ### Results of registration When the code above is registered to Flyte, it results in the creation of four objects: * The task `workflows.launch_plan_example.my_task` * The workflow `workflows.launch_plan_example.my_workflow` * The default launch plan `workflows.launch_plan_example.my_workflow` (notice that it has the same name as the workflow) * The custom launch plan `my_workflow_custom_lp` (this is the one we defined in the code above) ### Changing a launch plan Launch plans are changed by altering their definition in code and re-registering. When a launch plan with the same project, domain, and name as a preexisting one is re-registered, a new version of that launch plan is created. ## Subpages - **Core concepts > Launch plans > Defining launch plans** - **Core concepts > Launch plans > Viewing launch plans** - **Core concepts > Launch plans > Notifications** - **Core concepts > Launch plans > Schedules** - **Core concepts > Launch plans > Activating and deactivating** - **Core concepts > Launch plans > Running launch plans** - **Core concepts > Launch plans > Reference launch plans** - **Core concepts > Launch plans > Concurrency control** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/launch-plans/defining-launch-plans === # Defining launch plans You can define a launch plan with the [`LaunchPlan` class](../../../api-reference/flytekit-sdk/packages/flytekit.core.launch_plan). This is a simple example of defining a launch plan: ```python import flytekit as fl @fl.workflow def my_workflow(a: int, b: str) -> str: return f"Result: {a} and {b}" # Create a default launch plan default_lp = @fl.LaunchPlan.get_or_create(workflow=my_workflow) # Create a named launch plan named_lp = @fl.LaunchPlan.get_or_create( workflow=my_workflow, name="my_custom_launch_plan" ) ``` ## Default and Fixed Inputs Default inputs can be overridden at execution time, while fixed inputs cannot be changed. ```python import flytekit as fl # Launch plan with default inputs lp_with_defaults = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="with_defaults", default_inputs={"a": 42, "b": "default_value"} ) # Launch plan with fixed inputs lp_with_fixed = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="with_fixed", fixed_inputs={"a": 100} # 'a' will always be 100, only 'b' can be specified ) # Combining default and fixed inputs lp_combined = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="combined_inputs", default_inputs={"b": "default_string"}, fixed_inputs={"a": 200} ) ``` ## Scheduled Execution ```python import flytekit as fl from datetime import timedelta from flytekit.core.schedule import CronSchedule, FixedRate # Using a cron schedule (runs at 10:00 AM UTC every Monday) cron_lp = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="weekly_monday", default_inputs={"a": 1, "b": "weekly"}, schedule=CronSchedule( schedule="0 10 * * 1", # Cron expression: minute hour day-of-month month day-of-week kickoff_time_input_arg=None ) ) # Using a fixed rate schedule (runs every 6 hours) fixed_rate_lp = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="every_six_hours", default_inputs={"a": 1, "b": "periodic"}, schedule=FixedRate( duration=timedelta(hours=6) ) ) ``` ## Labels and Annotations Labels and annotations help with organization and can be used for filtering or adding metadata. ```python import flytekit as fl from flytekit.models.common import Labels, Annotations # Adding labels and annotations lp_with_metadata = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="with_metadata", default_inputs={"a": 1, "b": "metadata"}, labels=Labels({"team": "data-science", "env": "staging"}), annotations=Annotations({"description": "Launch plan for testing", "owner": "jane.doe"}) ) ``` ## Execution Parameters ```python import flytekit as fl # Setting max parallelism to limit concurrent task execution lp_with_parallelism = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="with_parallelism", default_inputs={"a": 1, "b": "parallel"}, max_parallelism=10 # Only 10 task nodes can run concurrently ) # Disable caching for this launch plan's executions lp_no_cache = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="no_cache", default_inputs={"a": 1, "b": "fresh"}, overwrite_cache=True # Always execute fresh, ignoring cached results ) # Auto-activate on registration lp_auto_activate = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="auto_active", default_inputs={"a": 1, "b": "active"}, auto_activate=True # Launch plan will be active immediately after registration ) ``` ## Security and Authentication We can also override the auth role (either an iam role or a kubernetes service account) used to execute a launch plan. ```python import flytekit as fl from flytekit.models.common import AuthRole from flytekit import SecurityContext # Setting auth role for the launch plan lp_with_auth = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="with_auth", default_inputs={"a": 1, "b": "secure"}, auth_role=AuthRole( assumable_iam_role="arn:aws:iam::12345678:role/my-execution-role" ) ) # Setting security context lp_with_security = fl.LaunchPlan.get_or_create( workflow=my_workflow, name="with_security", default_inputs={"a": 1, "b": "context"}, security_context=SecurityContext( run_as=SecurityContext.K8sServiceAccount(name="my-service-account") ) ) ``` ## Raw Output Data Configuration ```python from flytekit.models.common import RawOutputDataConfig # Configure where large outputs should be stored lp_with_output_config = LaunchPlan.get_or_create( workflow=my_workflow, name="with_output_config", default_inputs={"a": 1, "b": "output"}, raw_output_data_config=RawOutputDataConfig( output_location_prefix="s3://my-bucket/workflow-outputs/" ) ) ``` ## Putting It All Together A pretty comprehensive example follows below. This custom launch plan has d ```python comprehensive_lp = LaunchPlan.get_or_create( workflow=my_workflow, name="comprehensive_example", default_inputs={"b": "configurable"}, fixed_inputs={"a": 42}, schedule=CronSchedule(schedule="0 9 * * *"), # Daily at 9 AM UTC notifications=[ Notification( phases=["SUCCEEDED", "FAILED"], email=EmailNotification(recipients_email=["team@example.com"]) ) ], labels=Labels({"env": "production", "team": "data"}), annotations=Annotations({"description": "Daily data processing"}), max_parallelism=20, overwrite_cache=False, auto_activate=True, auth_role=AuthRole(assumable_iam_role="arn:aws:iam::12345678:role/workflow-role"), raw_output_data_config=RawOutputDataConfig( output_location_prefix="s3://results-bucket/daily-run/" ) ) ``` These examples demonstrate the flexibility of Launch Plans in Flyte, allowing you to customize execution parameters, inputs, schedules, and more to suit your workflow requirements. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/launch-plans/viewing-launch-plans === # Viewing launch plans ## Viewing launch plans in the UI Select **Launch Plans** in the sidebar to display a list of all the registered launch plans in the project and domain: ![Launch plans list](../../../_static/images/user-guide/core-concepts/launch-plans/viewing-launch-plans/launch-plans-list.png) You can search the launch plans by name and filter for only those that are archived. The columns in the launch plans table are defined as follows: * **Name**: The name of the launch plan. Click to inspect a specific launch plan in detail. * **Triggers**: * If the launch plan is active, a green **Active** badge is shown. When a launch plan is active, any attached schedule will be in effect and the launch plan will be invoked according to that schedule. * Shows whether the launch plan has a **Core concepts > Launch plans > Reactive workflows**. To filter for only those launch plans with a trigger, check the **Has Triggers** box in the top right. * **Last Execution**: The last execution timestamp of this launch plan, irrespective of how the last execution was invoked (by schedule, by trigger, or manually). * **Last 10 Executions**: A visual representation of the last 10 executions of this launch plan, irrespective of how these executions were invoked (by schedule, by trigger, or manually). Select an entry on the list to go to that specific launch plan: ![Launch plan view](../../../_static/images/user-guide/core-concepts/launch-plans/viewing-launch-plans/launch-plan-view.png) Here you can see: * **Launch Plan Detail (Latest Version)**: * **Expected Inputs**: The input and output types for the launch plan. * **Fixed Inputs**: If the launch plan includes predefined input values, they are shown here. * **Launch Plan Versions**: A list of all versions of this launch plan. * **All executions in the Launch Plan**: A list of all executions of this launch plan. In the top right you can see if this launch plan is active (and if it is, which version, specifically, is active). There is also a control for changing the active version or deactivating the launch plan entirely. See **Core concepts > Launch plans > Activating and deactivating** for more details. ## Viewing launch plans on the command line with `uctl` To view all launch plans within a project and domain: ```shell $ uctl get launchplans \ --project \ --domain ``` To view a specific launch plan: ```shell $ uctl get launchplan \ --project \ --domain \ ``` See the **Uctl CLI** for more details. ## Viewing launch plans in Python with `FlyteRemote` Use the method `FlyteRemote.client.list_launch_plans_paginated` to get the list of launch plans. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/launch-plans/notifications === # Notifications A launch plan may be associated with one or more notifications, which are triggered when the launch plan's workflow is completed. There are three types of notifications: * `Email`: Sends an email to the specified recipients. * `PagerDuty`: Sends a PagerDuty notification to the PagerDuty service (with recipients specified). PagerDuty then forwards the notification as per your PagerDuty configuration. * `Slack`: Sends a Slack notification to the email address of a specified channel. This requires that you configure your Slack account to accept notifications. Separate notifications can be sent depending on the specific end state of the workflow. The options are: * `WorkflowExecutionPhase.ABORTED` * `WorkflowExecutionPhase.FAILED` * `WorkflowExecutionPhase.SUCCEEDED` * `WorkflowExecutionPhase.TIMED_OUT` For example: ```python from datetime import datetime import flytekit as fl from flytekit import ( WorkflowExecutionPhase, Email, PagerDuty, Slack ) @fl.task def add_numbers(a: int, b: int, c: int) -> int: return a + b + c @fl.task def generate_message(s: int, kickoff_time: datetime) -> str: return f"sum: {s} at {kickoff_time}" @fl.workflow def my_workflow(a: int, b: int, c: int, kickoff_time: datetime) -> str: return generate_message( add_numbers(a, b, c), kickoff_time, ) fl.LaunchPlan.get_or_create( workflow=my_workflow, name="my_workflow_custom_lp", fixed_inputs={"a": 3}, default_inputs={"b": 4, "c": 5}, notifications=[ Email( phases=[WorkflowExecutionPhase.FAILED], recipients_email=["me@example.com", "you@example.com"], ), PagerDuty( phases=[WorkflowExecutionPhase.SUCCEEDED], recipients_email=["myboss@example.com"], ), Slack( phases=[ WorkflowExecutionPhase.SUCCEEDED, WorkflowExecutionPhase.ABORTED, WorkflowExecutionPhase.TIMED_OUT, ], recipients_email=["your_slack_channel_email"], ), ], ) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/launch-plans/schedules === # Schedules Launch plans let you schedule the invocation of your workflows. A launch plan can be associated with one or more schedules, where at most one schedule is active at any one time. If a schedule is activated on a launch plan, the workflow will be invoked automatically by the system at the scheduled time with the inputs provided by the launch plan. Schedules can be either fixed-rate or `cron`-based. To set up a schedule, you can use the `schedule` parameter of the `LaunchPlan.get_or_create()` method. ## Fixed-rate schedules In the following example we add a [FixedRate](../../../api-reference/flytekit-sdk/packages/flytekit.core.schedule#flytekitcoreschedulefixedrate) that will invoke the workflow every 10 minutes. ```python from datetime import timedelta import flytekit as fl from flytekit import FixedRate @fl.task def my_task(a: int, b: int, c: int) -> int: return a + b + c @fl.workflow def my_workflow(a: int, b: int, c: int) -> int: return my_task(a=a, b=b, c=c) fl.LaunchPlan.get_or_create( workflow=my_workflow, name="my_workflow_custom_lp", fixed_inputs={"a": 3}, default_inputs={"b": 4, "c": 5}, schedule=FixedRate( duration=timedelta(minutes=10) ) ) ``` Above, we defined the duration of the `FixedRate` schedule using `minutes`. Fixed rate schedules can also be defined using `days` or `hours`. ## Cron schedules A [`CronSchedule`](../../../api-reference/flytekit-sdk/packages/flytekit.core.schedule#flytekitcoreschedulecronschedule) allows you to specify a schedule using a `cron` expression: ```python import flytekit as fl from flytekit import CronSchedule @fl.task def my_task(a: int, b: int, c: int) -> int: return a + b + c @fl.workflow def my_workflow(a: int, b: int, c: int) -> int: return my_task(a=a, b=b, c=c) fl.LaunchPlan.get_or_create( workflow=my_workflow, name="my_workflow_custom_lp", fixed_inputs={"a": 3}, default_inputs={"b": 4, "c": 5}, schedule=CronSchedule( schedule="*/10 * * * *" ) ) ``` ### Cron expression format A `cron` expression is a string that defines a schedule using five space-separated fields, each representing a time unit. The format of the string is: ``` minute hour day-of-month month day-of-week ``` Each field can contain values and special characters. The fields are defined as follows: | Field | Values | Special characters | |----------------|---------------------|--------------------| | `minute` | `0-59` | `* / , -` | | `hour` | `0-23` | `* / , -` | | `day-of-month` | `1-31` | `* / , - ?` | | `month` | `1-12` or `JAN-DEC` | `* / , -` | | `day-of-week` | `0-6` or` SUN-SAT` | `* / , - ?` | * The `month` and `day-of-week` abbreviations are not case-sensitive. * The `,` (comma) is used to specify multiple values. For example, in the `month` field, `JAN,FEB,MAR` means every January, February, and March. * The `-` (dash) specifies a range of values. For example, in the `day-of-month` field, `1-15` means every day from `1` through `15` of the specified month. * The `*` (asterisk) specifies all values of the field. For example, in the `hour` field, `*` means every hour (on the hour), from `0` to `23`. You cannot use `*` in both the `day-of-month` and `day-of-week` fields in the same `cron` expression. If you use it in one, you must use `?` in the other. * The `/` (slash) specifies increments. For example, in the `minute` field, `1/10` means every tenth minute, starting from the first minute of the hour (that is, the 11th, 21st, and 31st minute, and so on). * The `?` (question mark) specifies any value of the field. For example, in the `day-of-month` field you could enter `7` and, if any day of the week was acceptable, you would enter `?` in the `day-of-week` field. ### Cron expression examples | Expression | Description | |--------------------|-------------------------------------------| | `0 0 * * *` | Midnight every day. | | `0 12 * * MON-FRI` | Noon every weekday. | | `0 0 1 * *` | Midnight on the first day of every month. | | `0 0 * JAN,JUL *` | Midnight every day in January and July. | | `*/5 * * * *` | Every five minutes. | | `30 2 * * 1` | At 2:30 AM every Monday. | | `0 0 15 * ?` | Midnight on the 15th of every month. | ### Cron aliases The following aliases are also available. An alias is used in place of an entire `cron` expression. | Alias | Description | Equivalent to | |------------|------------------------------------------------------------------|-----------------| | `@yearly` | Once a year at midnight at the start of 1 January. | `0 0 1 1 *` | | `@monthly` | Once a month at midnight at the start of first day of the month. | `0 0 1 * *` | | `@weekly` | Once a week at midnight at the start of Sunday. | `0 0 * * 0` | | `@daily` | Once a day at midnight. | `0 0 * * *` | | `@hourly` | Once an hour at the beginning of the hour. | `0 * * * *` | ## kickoff_time_input_arg Both `FixedRate` and `CronSchedule` can take an optional parameter called `kickoff_time_input_arg` This parameter is used to specify the name of a workflow input argument. Each time the system invokes the workflow via this schedule, the time of the invocation will be passed to the workflow through the specified parameter. For example: ```python from datetime import datetime, timedelta import flytekit as fl from flytekit import FixedRate @fl.task def my_task(a: int, b: int, c: int) -> int: return a + b + c @fl.workflow def my_workflow(a: int, b: int, c: int, kickoff_time: datetime ) -> str: return f"sum: {my_task(a=a, b=b, c=c)} at {kickoff_time}" fl.LaunchPlan.get_or_create( workflow=my_workflow, name="my_workflow_custom_lp", fixed_inputs={"a": 3}, default_inputs={"b": 4, "c": 5}, schedule=FixedRate( duration=timedelta(minutes=10), kickoff_time_input_arg="kickoff_time" ) ) ``` Here, each time the schedule calls `my_workflow`, the invocation time is passed in the `kickoff_time` argument. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/launch-plans/activating-and-deactivating === # Activating and deactivating You can set an active/inactive status on launch plans. Specifically: * Among the versions of a given launch plan (as defined by name), at most one can be set to active. All others are inactive. * If a launch plan version that has a schedule attached is activated, then its schedule also becomes active and its workflow will be invoked automatically according to that schedule. * When a launch plan version with a schedule is inactive, its schedule is inactive and will not be used to invoke its workflow. Launch plans that do not have schedules attached can also have an active version. For such non-scheduled launch plans, this status serves as a flag that can be used to distinguish one version from among the others. It can, for example, be used by management logic to determine which version of a launch plan to use for new invocations. Upon registration of a new launch plan, the first version is automatically inactive. If it has a schedule attached, the schedule is also inactive. Once activated, a launch plan version remains active even as new, later, versions are registered. A launch plan version with a schedule attached can be activated through either the UI, `uctl`, or [`FlyteRemote`](../../../user-guide/development-cycle/union-remote). ## Activating and deactivating a launch plan in the UI To activate a launch plan, go to the launch plan view and click **Add active launch plan** in the top right corner of the screen: ![Activate schedule](../../../_static/images/user-guide/core-concepts/launch-plans/activating-and-deactivating/add-active-launch-plan.png) A modal will appear that lets you select which launch plan version to activate: ![Activate schedule](../../../_static/images/user-guide/core-concepts/launch-plans/activating-and-deactivating/update-active-launch-plan-dialog.png) This modal will contain all versions of the launch plan that have an attached schedule. Note that at most one version (and therefore at most one schedule) of a launch plan can be active at any given time. Selecting the launch plan version and clicking **Update** activates the launch plan version and schedule. The launch plan version and schedule are now activated. The launch plan will be triggered according to the schedule going forward. > [!WARNING] > Non-scheduled launch plans cannot be activated via the UI. > The UI does not support activating launch plans that do not have schedules attached. > You can activate them with `uctl` or `FlyteRemote`. To deactivate a launch plan, navigate to a launch plan with an active schedule, click the **...** icon in the top-right corner of the screen beside **Active launch plan**, and click โ€œDeactivateโ€. ![Deactivate schedule](../../../_static/images/user-guide/core-concepts/launch-plans/activating-and-deactivating/deactivate-launch-plan.png) A confirmation modal will appear, allowing you to deactivate the launch plan and its schedule. > [!WARNING] > Non-scheduled launch plans cannot be deactivated via the UI. > The UI does not support deactivating launch plans that do not have schedules attached. > You can deactivate them with `uctl` or `FlyteRemote`. ## Activating and deactivating a launch plan on the command line with `uctl` To activate a launch plan version with `uctl`, execute the following command: ```shell $ uctl update launchplan \ --activate \ --project \ --domain \ \ --version ``` To deactivate a launch plan version with `uctl`, execute the following command: ```shell $ uctl update launchplan \ --deactivate \ --project \ --domain \ \ --version ``` See **Uctl CLI** for more details. ## Activating and deactivating a launch plan in Python with `FlyteRemote` To activate a launch plan using version `FlyteRemote`: ```python from union.remote import FlyteRemote from flytekit.configuration import Config remote = FlyteRemote(config=Config.auto(), default_project=, default_domain=) launch_plan = remote.fetch_launch_plan(ame=, version=).id remote.client.update_launch_plan(launch_plan.id, "ACTIVE") ``` To deactivate a launch plan version using `FlyteRemote`: ```python from union.remote import FlyteRemote from flytekit.remote import Config remote = FlyteRemote(config=Config.auto(), default_project=, default_domain=) launch_plan = remote.fetch_launch_plan(ame=, version=) remote.client.update_launch_plan(launch_plan.id, "INACTIVE") ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/launch-plans/running-launch-plans === # Running launch plans ## Running a launch plan in the UI To invoke a launch plan, go to the **Workflows** list, select the desired workflow, click **Launch Workflow**. In the new execution dialog, select the desired launch plan from the **Launch Plan** dropdown menu and click **Launch**. ## Running a launch plan on the command line with `uctl` To invoke a launch plan via the command line, first generate the execution spec file for the launch plan: ```shell $ uctl get launchplan \ --project --domain \ \ --execFile .yaml ``` Then you can execute the launch plan with the following command: ```shell $ uctl create execution \ --project \ --domain \ --execFile .yaml ``` See **Uctl CLI** for more details. ## Running a launch plan in Python with `FlyteRemote` The following code executes a launch plan using `FlyteRemote`: ```python import flytekit as fl from flytekit.remote import Config remote = fl.FlyteRemote(config=Config.auto(), default_project=, default_domain=) launch_plan = remote.fetch_launch_plan(name=, version=) remote.execute(launch_plan, inputs=) ``` See the [FlyteRemote](../../development-cycle/union-remote) for more details. ## Sub-launch plans The above invocation examples assume you want to run your launch plan as a top-level entity within your project. However, you can also invoke a launch plan from *within a workflow*, creating a *sub-launch plan*. This causes the invoked launch plan to kick off its workflow, passing any parameters specified to that workflow. This differs from the case of **Core concepts > Workflows > Subworkflows and sub-launch plans** where you invoke one workflow function from within another. A subworkflow becomes part of the execution graph of the parent workflow and shares the same execution ID and context. On the other hand, when a sub-launch plan is invoked a full, top-level workflow is kicked off with its own execution ID and context. See **Core concepts > Workflows > Subworkflows and sub-launch plans** for more details. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/launch-plans/reference-launch-plans === # Reference launch plans A reference launch plan references previously defined, serialized, and registered launch plans. You can reference launch plans from other projects and create workflows that use launch plans declared by others. When you create a reference launch plan, be sure to verify that the workflow interface corresponds to that of the referenced workflow. > [!NOTE] > Reference launch plans cannot be run locally. To test locally, mock them out. ## Example In this example, we create a reference launch plan for the [`simple_wf`](https://github.com/flyteorg/flytesnacks/blob/master/examples/basics/basics/workflow.py#L25) workflow from the [Flytesnacks repository](https://github.com/flyteorg/flytesnacks). 1. Clone the Flytesnacks repository: ```shell $ git clone git@github.com:flyteorg/flytesnacks.git ``` 2. Navigate to the `basics` directory: ```shell $ cd flytesnacks/examples/basics ``` 3. Register the `simple_wf` workflow: ```shell $ pyflyte register --project flytesnacks --domain development --version v1 basics/workflow.py. ``` 4. Create a file called `simple_wf_ref_lp.py` and copy the following code into it: ```python import flytekit as fl from flytekit import reference_launch_plan @reference_launch_plan( project="flytesnacks", domain="development", name="basics.workflow.simple_wf", version="v1", ) def simple_wf_lp( x: list[int], y: list[int] ) -> float: return 1.0 @fl.workflow def run_simple_wf() -> float: x = [-8, 2, 4] y = [-2, 4, 7] return simple_wf_lp(x=x, y=y) ``` 5. Register the `run_simple_wf` workflow: ```shell $ pyflyte register simple_wf_ref_lp.py ``` 6. In the Flyte UI, run the workflow `run_simple_wf`. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/launch-plans/concurrency-control === # Concurrency control Concurrency control allows you to limit the number of concurrently running workflow executions for a specific launch plan, identified by its unique `project`, `domain`, and `name`. This control is applied across all versions of that launch plan. > [!NOTE] > To clone and run the example code on this page, see the [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/productionizing/). ## How it works When a new execution for a launch plan with a `ConcurrencyPolicy` is requested, Flyte performs a check to count the number of currently active executions for that same launch plan (`project/domain/name`), irrespective of their versions. This check is done using a database query that joins the `executions` table with the `launch_plans` table. It filters for executions that are in an active phase (e.g., `QUEUED`, `RUNNING`, `ABORTING`, etc.) and belong to the launch plan name being triggered. If the number of active executions is already at or above the `max_concurrency` limit defined in the policy of the launch plan version being triggered, the new execution will be handled according to the specified `behavior`. ## Basic usage Here's an example of how to define a launch plan with concurrency control: ```python from flytekit import ConcurrencyPolicy, ConcurrencyLimitBehavior, LaunchPlan, workflow @workflow def my_workflow() -> str: return "Hello, World!" # Create a launch plan with concurrency control concurrency_limited_lp = LaunchPlan.get_or_create( name="my_concurrent_lp", workflow=my_workflow, concurrency=ConcurrencyPolicy( max_concurrency=3, behavior=ConcurrencyLimitBehavior.SKIP, ), ) ``` ## Scheduled workflows with concurrency control Concurrency control is particularly useful for scheduled workflows to prevent overlapping executions: ```python from flytekit import ConcurrencyPolicy, ConcurrencyLimitBehavior, CronSchedule, LaunchPlan, workflow @workflow def scheduled_workflow() -> str: # This workflow might take a long time to complete return "Processing complete" # Create a scheduled launch plan with concurrency control scheduled_lp = LaunchPlan.get_or_create( name="my_scheduled_concurrent_lp", workflow=scheduled_workflow, concurrency=ConcurrencyPolicy( max_concurrency=1, # Only allow one execution at a time behavior=ConcurrencyLimitBehavior.SKIP, ), schedule=CronSchedule(schedule="*/5 * * * *"), # Runs every 5 minutes ) ``` ## Defining the policy A `ConcurrencyPolicy` is defined with two main parameters: - `max_concurrency` (integer): The maximum number of workflows that can be running concurrently for this launch plan name. - `behavior` (enum): What to do when the `max_concurrency` limit is reached. Currently, only `SKIP` is supported, which means new executions will not be created if the limit is hit. ```python from flytekit import ConcurrencyPolicy, ConcurrencyLimitBehavior policy = ConcurrencyPolicy( max_concurrency=5, behavior=ConcurrencyLimitBehavior.SKIP ) ``` ## Key behaviors and considerations ### Version-agnostic check, version-specific enforcement The concurrency check counts all active workflow executions of a given launch plan (`project/domain/name`). However, the enforcement (i.e., the `max_concurrency` limit and `behavior`) is based on the `ConcurrencyPolicy` defined in the specific version of the launch plan you are trying to launch. **Example scenario:** 1. Launch plan `MyLP` version `v1` has a `ConcurrencyPolicy` with `max_concurrency = 3`. 2. Three executions of `MyLP` (they could be `v1` or any other version) are currently running. 3. You try to launch `MyLP` version `v2`, which has a `ConcurrencyPolicy` with `max_concurrency = 10`. - **Result**: This `v2` execution will launch successfully because its own limit (10) is not breached by the current 3 active executions. 4. Now, with 4 total active executions (3 original + the new `v2`), you try to launch `MyLP` version `v1` again. - **Result**: This `v1` execution will **fail**. The check sees 4 active executions, and `v1`'s policy only allows a maximum of 3. ### Concurrency limit on manual trigger Upon manual trigger of an execution (via `pyflyte` for example) which would breach the concurrency limit, you should see this error in the console: ```bash _InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.RESOURCE_EXHAUSTED details = "Concurrency limit (1) reached for launch plan my_workflow_lp. Skipping execution." > ``` ### Scheduled execution behavior When the scheduler attempts to trigger an execution and the concurrency limit is met, the creation will fail and the error message from FlyteAdmin will be logged in FlyteScheduler logs. **This will be transparent to the user. A skipped execution will not appear as skipped in the UI or project execution page**. ## Limitations ### "At most" enforcement While the system aims to respect `max_concurrency`, it acts as an "at most" limit. Due to the nature of scheduling, workflow execution durations, and the timing of the concurrency check (at launch time), there might be periods where the number of active executions is below `max_concurrency` even if the system could theoretically run more. For example, if `max_concurrency` is 5 and all 5 workflows finish before the next scheduled check/trigger, the count will drop. The system prevents exceeding the limit but doesn't actively try to always maintain `max_concurrency` running instances. ### Notifications for skipped executions Currently, there is no built-in notification system for skipped executions. When a scheduled execution is skipped due to concurrency limits, it will be logged in FlyteScheduler but no user notification will be sent. This is an area for future enhancement. ## Best practices 1. **Use with scheduled workflows**: Concurrency control is most beneficial for scheduled workflows that might take longer than the schedule interval to complete. 2. **Set appropriate limits**: Consider your system resources and the resource requirements of your workflows when setting `max_concurrency`. 3. **Monitor skipped executions**: Regularly check FlyteAdmin logs to monitor if executions are being skipped due to concurrency limits. 4. **Version management**: Be aware that different versions of the same launch plan can have different concurrency policies, but the check is performed across all versions. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/caching === # Caching Flyte allows you to cache the output of nodes (**Core concepts > Tasks**, **Core concepts > Workflows > Subworkflows and sub-launch plans**) to make subsequent executions faster. Caching is useful when many executions of identical code with the same input may occur. Here's a video with a brief explanation and demo, focused on task caching: ๐Ÿ“บ [Watch on YouTube](https://www.youtube.com/watch?v=WNkThCp-gqo) ## Inputs caching In Flyte, input caching allows tasks to automatically cache the input data required for execution. This feature is particularly useful in scenarios where tasks may need to be re-executed, such as during retries due to failures or when manually triggered by users. By caching input data, Flyte optimizes workflow performance and resource usage, preventing unnecessary recomputation of task inputs. ## Outputs caching Output caching in Flyte allows users to cache the results of tasks to avoid redundant computations. This feature is especially valuable for tasks that perform expensive or time-consuming operations where the results are unlikely to change frequently. > [!NOTE] > * Caching is available and individiually enablable for all nodes *within* a workflow directed acyclic graph (DAG). > * Nodes in this sense include tasks, subworkflows (workflows called directly within another workflow), and sub-launch plans (launch plans called within a workflow). > * Caching is *not available* for top-level workflows or launch plans (that is, those invoked from UI or CLI). > * By default, caching is *disabled* on all tasks, subworkflows and sub-launch plans, to avoid unintended consequences when caching executions with side effects. It must be explcitly enabled on any node where caching is desired. ## Enabling and configuring caching Caching can be enabled by setting the `cache` parameter of the `@fl.task` (for tasks) decorator or `with_overrides` method (for subworkflows or sub-launch plans) to a `Cache` object. The parameters of the `Cache` object are used to configure the caching behavior. For example: ```python import flytekit as fl # Define a task and enable caching for it @fl.task(cache=fl.Cache(version="1.0", serialize=True, ignored_inputs=["a"])) def sum(a: int, b: int, c: int) -> int: return a + b + c # Define a workflow to be used as a subworkflow @fl.workflow def child*wf(a: int, b: int, c: int) -> list[int]: return [ sum(a=a, b=b, c=c) for _ in range(5) ] # Define a launch plan to be used as a sub-launch plan child_lp = fl.LaunchPlan.get_or_create(child_wf) # Define a parent workflow that uses the subworkflow @fl.workflow def parent_wf_with_subwf(input: int = 0): return [ # Enable caching on the subworkflow child_wf(a=input, b=3, c=4).with_overrides(cache=fl.Cache(version="1.0", serialize=True, ignored_inputs=["a"])) for i in [1, 2, 3] ] # Define a parent workflow that uses the sub-launch plan @fl.workflow def parent_wf_with_sublp(input: int = 0): return [ child_lp(a=input, b=1, c=2).with_overrides(cache=fl.Cache(version="1.0", serialize=True, ignored_inputs=["a"])) for i in [1, 2, 3] ] ``` In the above example, caching is enabled at multiple levels: * At the task level, in the `@fl.task` decorator of the task `sum`. * At the workflow level, in the `with_overrides` method of the invocation of the workflow `child_wf`. * At the launch plan level, in the `with_overrides` method of the invocation of the launch plan `child_lp`. In each case, the result of the execution is cached and reused in subsequent executions. Here the reuse is demonstrated by calling the `child_wf` and `child_lp` workflows multiple times with the same inputs. Additionally, if the same node is invoked again with the same inputs (excluding input "a", as it is ignored for purposes of versioning) the cached result is returned immediately instead of re-executing the process. This applies even if the cached node is invoked externally through the UI or CLI. ## The `Cache` object The [Cache]() object takes the following parameters: * `version` (`Optional[str]`): Part of the cache key. Changing this version number tells Flyte to ignore previous cached results and run the task again if the taskโ€™s function has changed. This allows you to explicitly indicate when a change has been made to the task that should invalidate any existing cached results. Note that this is not the only change that will invalidate the cache (see below). Also, note that you can manually trigger cache invalidation per execution using the `overwrite-cache` flag. * `serialize` (`bool`): Enables or disables **Core concepts > Caching > Cache serialization**. When enabled, Flyte ensures that a single instance of the node is run before any other instances that would otherwise run concurrently. This allows the initial instance to cache its result and lets the later instances reuse the resulting cached outputs. If not set, cache serialization is disabled. * `ignored_inputs` (`Union[Tuple[str, ...], str]`): Input variables that should not be included when calculating the hash for the cache. If not set, no inputs are ignored. * `policies` (`Optional[Union[List[CachePolicy], CachePolicy]]`): A list of cache policies to generate the version hash. * `salt` (`str`): A [salt]() used in the hash generation. A salt is a random value that is combined with the input values before hashing. ## The `overwrite-cache` flag When launching the execution of a workflow, launch plan or task, you can use the `overwrite-cache` flag to invalidate the cache and force re-execution. ### Overwrite cache on the command line The `overwrite-cache` flag can be used from the command line with the `pyflyte run` command. For example: ```shell $ pyflyte run --remote --overwrite-cache example.py wf ``` ### Overwrite cache in the UI You can also trigger cache invalidation when launching an execution from the UI by checking **Override**, in the launch dialog: ![Overwrite cache flag in the UI](../../_static/images/user-guide/core-concepts/caching/overwrite-cached-outputs.png) ### Overwrite cache programmatically When using `FlyteRemote`, you can use the `overwrite_cache` parameter in the [`FlyteRemote.execute`]() method: ```python from flytekit.configuration import Config from flytekit.remote import FlyteRemote remote = FlyteRemote( config=Config.auto(), default_project="flytesnacks", default_domain="development" ) wf = remote.fetch_workflow(name="workflows.example.wf") execution = remote.execute(wf, inputs={"name": "Kermit"}, overwrite_cache=True) ``` ## How caching works When a node (with caching enabled) completes on Flyte, a **key-value entry** is created in the **caching table**. The **value** of the entry is the output. The **key** is composed of: * **Project:** A task run under one project cannot use the cached task execution from another project which would cause inadvertent results between project teams that could result in data corruption. * **Domain:** To separate test, staging, and production data, task executions are not shared across these environments. * **Cache version**: When task functionality changes, you can change the cache_version of the task. Flyte will know not to use older cached task executions and create a new cache entry on the subsequent execution. * **Node signature:** The cache is specific to the signature associated with the execution. The signature comprises the name, input parameter names/types, and the output parameter name/type of the node. If the signature changes, the cache entry is invalidated. * **Input values:** A well-formed Flyte node always produces deterministic outputs. This means that, given a set of input values, every execution should have identical outputs. When an execution is cached, the input values are part of the cache key. If a node is run with a new set of inputs, a new cache entry is created for the combination of that particular entity with those particular inputs. The result is that within a given project and domain, a cache entry is created for each distinct combination of name, signature, cache version, and input set for every node that has caching enabled. If the same node with the same input values is encountered again, the cached output is used instead of running the process again. ### Explicit cache version When a change to code is made that should invalidate the cache for that node, you can explicitly indicate this by incrementing the `version` parameter value. For a task example, see below. (For workflows and launch plans, the parameter would be specified in the `with_overrides` method.) ```python @fl.task(cache=fl.Cache(version="1.1")) def t(n: int) -> int: return n \* n + 1 ``` Here the `version` parameter has been bumped from `1.0`to `1.1`, invalidating of the existing cache. The next time the task is called it will be executed and the result re-cached under an updated key. However, if you change the version back to `1.0`, you will get a "cache hit" again and skip the execution of the task code. If used, the `version` parameter must be explicitly changed in order to invalidate the cache. Not every Git revision of a node will necessarily invalidate the cache. A change in Git SHA does not necessarily correlate to a change in functionality. You can refine your code without invalidating the cache as long as you explicitly use, and don't change, the `version` parameter (or the signature, see below) of the node. The idea behind this is to decouple syntactic sugar (for example, changed documentation or renamed variables) from changes to logic that can affect the process's result. When you use Git (or any version control system), you have a new version per code change. Since the behavior of most nodes in a Git repository will remain unchanged, you don't want their cached outputs to be lost. When a node's behavior does change though, you can bump `version` to invalidate the cache entry and make the system recompute the outputs. ### Node signature If you modify the signature of a node by adding, removing, or editing input parameters or output return types, Flyte invalidates the cache entries for that node. During the next execution, Flyte executes the process again and caches the outputs as new values stored under an updated key. ### Caching when running locally The description above applies to caching when executing a node remotely on your Flyte cluster. Caching is also available **Development cycle > Running in a local cluster**. When running locally the caching mechanism is the same except that the cache key does not include **project** or **domain** (since there are none). The cache key is composed only of **cache version**, **signature**, and **inputs**. The results of local executions are stored under `~/.flyte/local-cache/`. Similar to the remote case, a local cache entry for a node will be invalidated if either the `cache_version` or the signature is modified. In addition, the local cache can also be emptied by running ```shell $ pyflyte local-cache clear ``` This removes the contents of the `~/.flyte/local-cache/` directory. Occasionally, you may want to disable the local cache for testing purposes, without making any code changes to your task decorators. You can set the `FLYTE_LOCAL_CACHE_ENABLED` environment variable to `false` in your terminal in order to bypass caching temporarily. ## Cache serialization Cache serialization means only executing a single instance of a unique cacheable task (determined by the `cache_version` parameter and task signature) at a time. Using this mechanism, Flyte ensures that during multiple concurrent executions of a task only a single instance is evaluated, and all others wait until completion and reuse the resulting cached outputs. Ensuring serialized evaluation requires a small degree of overhead to coordinate executions using a lightweight artifact reservation system. Therefore, this should be viewed as an extension to rather than a replacement for non-serialized cacheable tasks. It is particularly well fit for long-running or otherwise computationally expensive tasks executed in scenarios similar to the following examples: * Periodically scheduled workflow where a single task evaluation duration may span multiple scheduled executions. * Running a commonly shared task within different workflows (which receive the same inputs). ### Enabling cache serialization Task cache serializing is disabled by default to avoid unexpected behavior for task executions. To enable, set `serialize=True` in the `@fl.task` decorator. The cache key definitions follow the same rules as non-serialized cache tasks. ```python @fl.task(cache=fl.Cache(version="1.1", serialize=True)) def t(n: int) -> int: return n \* n ``` In the above example calling `t(n=2)` multiple times concurrently (even in different executions or workflows) will only execute the multiplication operation once. Concurrently evaluated tasks will wait for completion of the first instance before reusing the cached results and subsequent evaluations will instantly reuse existing cache results. ### How does cache serialization work? The cache serialization paradigm introduces a new artifact reservation system. Executions with cache serialization enabled use this reservation system to acquire an artifact reservation, indicating that they are actively evaluating a node, and release the reservation once the execution is completed. Flyte uses a clock-skew algorithm to define reservation timeouts. Therefore, executions are required to periodically extend the reservation during their run. The first execution of a serializable node will successfully acquire the artifact reservation. Execution will be performed as usual and upon completion, the results are written to the cache, and the reservation is released. Concurrently executed node instances (those that would otherwise run in parallel with the initial execution) will observe an active reservation, in which case these instances will wait until the next reevaluation and perform another check. Once the initial execution completes, they will reuse the cached results as will any subsequent instances of the same node. Flyte handles execution failures using a timeout on the reservation. If the execution currently holding the reservation fails to extend it before it times out, another execution may acquire the reservation and begin processing. ## Caching of offloaded objects In some cases, the default behavior displayed by Flyteโ€™s caching feature might not match the user's intuition. For example, this code makes use of pandas dataframes: ```python @fl.task def foo(a: int, b: str) -> pandas.DataFrame: df = pandas.DataFrame(...) ... return df @fl.task(cache=True) def bar(df: pandas.DataFrame) -> int: ... @fl.workflow def wf(a: int, b: str): df = foo(a=a, b=b) v = bar(df=df) ``` If run twice with the same inputs, one would expect that `bar` would trigger a cache hit, but thatโ€™s not the case because of the way dataframes are represented in Flyte. However, Flyte provides a new way to control the caching behavior of literals. This is done via a `typing.Annotated` call on the node signature. For example, in order to cache the result of calls to `bar`, you can rewrite the code above like this: ```python def hash_pandas_dataframe(df: pandas.DataFrame) -> str: return str(pandas.util.hash_pandas_object(df)) @fl.task def foo_1(a: int, b: str) -> Annotated[pandas.DataFrame, HashMethod(hash_pandas_dataframe)]: df = pandas.DataFrame(...) ... return df @fl.task(cache=True) def bar_1(df: pandas.DataFrame) -> int: ... @fl.workflow def wf_1(a: int, b: str): df = foo(a=a, b=b) v = bar(df=df) ``` Note how the output of the task `foo` is annotated with an object of type `HashMethod`. Essentially, it represents a function that produces a hash that is used as part of the cache key calculation in calling the task `bar`. ### How does caching of offloaded objects work? Recall how input values are taken into account to derive a cache key. This is done by turning the literal representation into a string and using that string as part of the cache key. In the case of dataframes annotated with `HashMethod`, we use the hash as the representation of the literal. In other words, the literal hash is used in the cache key. This feature also works in local execution. Hereโ€™s a complete example of the feature: ```python def hash_pandas_dataframe(df: pandas.DataFrame) -> str: return str(pandas.util.hash_pandas_object(df)) @fl.task def uncached_data_reading_task() -> Annotated[pandas.DataFrame, HashMethod(hash_pandas_dataframe)]: return pandas.DataFrame({"column_1": [1, 2, 3]}) @fl.task(cache=True) def cached_data_processing_task(df: pandas.DataFrame) -> pandas.DataFrame: time.sleep(1) return df \* 2 @fl.task def compare_dataframes(df1: pandas.DataFrame, df2: pandas.DataFrame): assert df1.equals(df2) @fl.workflow def cached_dataframe_wf(): raw_data = uncached_data_reading_task() # Execute `cached_data_processing_task` twice, but force those # two executions to happen serially to demonstrate how the second run # hits the cache. t1_node = create_node(cached_data_processing_task, df=raw_data) t2_node = create_node(cached_data_processing_task, df=raw_data) t1_node >> t2_node # Confirm that the dataframes actually match compare_dataframes(df1=t1_node.o0, df2=t2_node.o0) if **name** == "**main**": df1 = cached_dataframe_wf() stickioesprint(f"Running cached_dataframe_wf once : {df1}") ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/named-outputs === # Named outputs By default, Flyte employs a standardized convention to assign names to the outputs of tasks or workflows. Each output is sequentially labeled as `o1`, `o2`, `o3`, and so on. You can, however, customize these output names by using a `NamedTuple`. To begin, import the required dependencies: ```python # basics/named_outputs.py from typing import NamedTuple import flytekit as fl ``` Here we define a `NamedTuple` and assign it as an output to a task called `slope`: ```python slope_value = NamedTuple("slope_value", [("slope", float)]) @fl.task def slope(x: list[int], y: list[int]) -> slope_value: sum_xy = sum([x[i] * y[i] for i in range(len(x))]) sum_x_squared = sum([x[i] ** 2 for i in range(len(x))]) n = len(x) return (n * sum_xy - sum(x) * sum(y)) / (n * sum_x_squared - sum(x) ** 2) ``` Similarly, we define another `NamedTuple` and assign it to the output of another task, `intercept`: ```python intercept_value = NamedTuple("intercept_value", [("intercept", float)]) @fl.task def intercept(x: list[int], y: list[int], slope: float) -> intercept_value: mean_x = sum(x) / len(x) mean_y = sum(y) / len(y) intercept = mean_y - slope * mean_x return intercept ``` > [!Note] > While itโ€™s possible to create `NamedTuples` directly within the code, > itโ€™s often better to declare them explicitly. > This helps prevent potential linting errors in tools like `mypy`. > > ```python > def slope() -> NamedTuple("slope_value", slope=float): > pass > ``` You can easily unpack the `NamedTuple` outputs directly within a workflow. Additionally, you can also have the workflow return a `NamedTuple` as an output. >[!Note] > Remember that we are extracting individual task execution outputs by dereferencing them. > This is necessary because `NamedTuples` function as tuples and require dereferencing. ```python slope_and_intercept_values = NamedTuple("slope_and_intercept_values", [("slope", float), ("intercept", float)]) @fl.workflow def simple_wf_with_named_outputs(x: list[int] = [-3, 0, 3], y: list[int] = [7, 4, -2]) -> slope_and_intercept_values: slope_value = slope(x=x, y=y) intercept_value = intercept(x=x, y=y, slope=slope_value.slope) return slope_and_intercept_values(slope=slope_value.slope, intercept=intercept_value.intercept) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/core-concepts/image-spec === # ImageSpec In this section, you will uncover how Flyte utilizes Docker images to construct containers under the hood, and you'll learn how to craft your own images to encompass all the necessary dependencies for your tasks or workflows. You will explore how to execute a raw container with custom commands, indicate multiple container images within a single workflow, and get familiar with the ins and outs of `ImageSpec`! `ImageSpec` allows you to customize the container image for your Flyte tasks without a Dockerfile. `ImageSpec` speeds up the build process by allowing you to reuse previously downloaded packages from the PyPI and APT caches. By default, the `ImageSpec` will be built using the default Docker builder, but you can always specify your own e.g. [flytekitplugins-envd](https://github.com/flyteorg/flytekit/plugins/flytekit-envd/flytekitplugins/envd/image_builder.py#L25) which uses envd to build the ImageSpec. For every `flytekit.PythonFunctionTask` task or a task decorated with the `@task` decorator, you can specify rules for binding container images. By default, flytekit binds a single container image, i.e., the [default Docker image](https://ghcr.io/flyteorg/flytekit), to all tasks. To modify this behavior, use the `container_image` parameter available in the `flytekit.task` decorator, and pass an `ImageSpec` definition. Before building the image, flytekit checks the container registry to see if the image already exists. If the image does not exist, flytekit will build the image before registering the workflow and replace the image name in the task template with the newly built image name. > [!NOTE] Prerequisites > * Make sure `docker` is running on your local machine. > * When using a registry in ImageSpec, `docker login` is required to push the image ## Install Python or APT packages You can specify Python packages and APT packages in the `ImageSpec`. These specified packages will be added on top of the [default image](https://github.com/flyteorg/flytekit/blob/master/Dockerfile), which can be found in the flytekit Dockerfile. More specifically, flytekit invokes [DefaultImages.default_image()](https://github.com/flyteorg/flytekit/blob/master/flytekit/configuration/default_images.py#L26-L27) function. This function determines and returns the default image based on the Python version and flytekit version. For example, if you are using Python 3.8 and flytekit 1.6.0, the default image assigned will be `ghcr.io/flyteorg/flytekit:py3.8-1.6.0`. > [!NOTE] Prerequisites > Replace `ghcr.io/flyteorg` with a container registry you can publish to. > To upload the image to the local registry in the demo cluster, > indicate the registry as `localhost:30000` using the `registry` argument to `ImageSpec`. ```python from flytekit import ImageSpec sklearn_image_spec = ImageSpec( packages=["scikit-learn", "tensorflow==2.5.0"], apt_packages=["curl", "wget"], ) ``` > [!WARNING] > Images built with ImageSpec do **not** include CA certificates by default, which can break TLS validation and block access to remote storage when Polars uses its native Rust-based networking (e.g., when using `polars.scan_parquet()`). > **Solution:** Add `"ca-certificates"` to `apt_packages` in your `ImageSpec`. ## Install Conda packages Define the `ImageSpec` to install packages from a specific conda channel. ```python image_spec = ImageSpec( conda_packages=["langchain"], conda_channels=["conda-forge"], # List of channels to pull packages from. ) ``` ## Use different Python versions in the image You can specify the Python version in the `ImageSpec` to build the image with a different Python version. ```python image_spec = ImageSpec( packages=["pandas"], python_version="3.9", ) ``` ## Import modules only in a specify imageSpec environment The `is_container()` method is used to determine whether the task is utilizing the image constructed from the `ImageSpec`. If the task is indeed using the image built from the `ImageSpec`, it will return true. This approach helps minimize module loading time and prevents unnecessary dependency installation within a single image. In the following example, both `task1` and `task2` will import the `pandas` module. However, `Tensorflow` will only be imported in `task2`. ```python from flytekit import ImageSpec, task import pandas as pd pandas_image_spec = ImageSpec( packages=["pandas"], registry="ghcr.io/flyteorg", ) tensorflow_image_spec = ImageSpec( packages=["tensorflow", "pandas"], registry="ghcr.io/flyteorg", ) # Return if and only if the task is using the image built from tensorflow_image_spec. if tensorflow_image_spec.is_container(): import tensorflow as tf @task(container_image=pandas_image_spec) def task1() -> pd.DataFrame: return pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [1, 22]}) @task(container_image=tensorflow_image_spec) def task2() -> int: num_gpus = len(tf.config.list_physical_devices('GPU')) print("Num GPUs Available: ", num_gpus) return num_gpus ``` ## Install CUDA in the image There are few ways to install CUDA in the image. ### Use Nvidia docker image CUDA is pre-installed in the Nvidia docker image. You can specify the base image in the `ImageSpec`. ```python image_spec = ImageSpec( base_image="nvidia/cuda:12.6.1-cudnn-devel-ubuntu22.04", packages=["tensorflow", "pandas"], python_version="3.9", ) ``` ### Install packages from extra index CUDA can be installed by specifying the `pip_extra_index_url` in the `ImageSpec`. ```python image_spec = ImageSpec( name="pytorch-mnist", packages=["torch", "torchvision", "flytekitplugins-kfpytorch"], pip_extra_index_url=["https://download.pytorch.org/whl/cu118"], ) ``` ## Build an image in different architecture You can specify the platform in the `ImageSpec` to build the image in a different architecture, such as `linux/arm64` or `darwin/arm64`. ```python image_spec = ImageSpec( packages=["pandas"], platform="linux/arm64", ) ``` ## Install flytekit from GitHub When you update the flytekit, you may want to test the changes with your tasks. You can install the flytekit from a specific commit hash in the `ImageSpec`. ```python new_flytekit = "git+https://github.com/flyteorg/flytekit@90a4455c2cc2b3e171dfff69f605f47d48ea1ff1" new_spark_plugins = f"git+https://github.com/flyteorg/flytekit.git@90a4455c2cc2b3e171dfff69f605f47d48ea1ff1#subdirectory=plugins/flytekit-spark" image_spec = ImageSpec( apt_packages=["git"], packages=[new_flytekit, new_spark_plugins], registry="ghcr.io/flyteorg", ) ``` ## Customize the tag of the image You can customize the tag of the image by specifying the `tag_format` in the `ImageSpec`. In the following example, the tag will be `-dev`. ```python image_spec = ImageSpec( name="my-image", packages=["pandas"], tag_format="{spec_hash}-dev", ) ``` ## Copy additional files or directories You can specify files or directories to be copied into the container `/root`, allowing users to access the required files. The directory structure will match the relative path. Since Docker only supports relative paths, absolute paths and paths outside the current working directory (e.g., paths with "../") are not allowed. ```python from flytekit import task, workflow, ImageSpec image_spec = ImageSpec( name="image_with_copy", copy=["files/input.txt"], ) @task(container_image=image_spec) def my_task() -> str: with open("/root/files/input.txt", "r") as f: return f.read() ``` ## Define ImageSpec in a YAML File You can override the container image by providing an ImageSpec YAML file to the `pyflyte run` or `pyflyte register` command. This allows for greater flexibility in specifying a custom container image. For example: ```yaml # imageSpec.yaml python_version: 3.11 packages: - sklearn env: Debug: "True" ``` Use pyflyte to register the workflow: ```shell $ pyflyte run --remote --image image.yaml image_spec.py wf ``` ## Build the image without registering the workflow If you only want to build the image without registering the workflow, you can use the `pyflyte build` command. ```shell $ pyflyte build --remote image_spec.py wf ``` ## Force push an image In some cases, you may want to force an image to rebuild, even if the ImageSpec hasnโ€™t changed. To overwrite an existing image, pass the `FLYTE_FORCE_PUSH_IMAGE_SPEC=True` to the `pyflyte` command. ```bash FLYTE_FORCE_PUSH_IMAGE_SPEC=True pyflyte run --remote image_spec.py wf ``` You can also force push an image in the Python code by calling the `force_push()` method. ```python image = ImageSpec(packages=["pandas"]).force_push() ``` ## Getting source files into ImageSpec Typically, getting source code files into a task's image at run time on a live Flyte backend is done through the fast registration mechanism. However, if your `ImageSpec` constructor specifies a `source_root` and the `copy` argument is set to something other than `CopyFileDetection.NO_COPY`, then files will be copied regardless of fast registration status. If the `source_root` and `copy` fields to an `ImageSpec` are left blank, then whether or not your source files are copied into the built `ImageSpec` image depends on whether or not you use fast registration. Please see **Development cycle > Running your code** for the full explanation. Since files are sometimes copied into the built image, the tag that is published for an ImageSpec will change based on whether fast register is enabled, and the contents of any files copied. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle === # Development cycle This section covers developing production-ready workflows for Flyte. ## Subpages - **Development cycle > Project structure** - **Development cycle > Projects and domains** - **Development cycle > Building workflows** - **Development cycle > Setting up a production project** - **Development cycle > Local dependencies** - **Development cycle > ImageSpec** - **Development cycle > Running your code** - **Development cycle > Overriding parameters** - **Development cycle > Run details** - **Development cycle > Debugging with interactive tasks** - **Development cycle > Task resource validation** - **Development cycle > Running in a local cluster** - **Development cycle > Jupyter notebooks** - **Development cycle > Decks** - **Development cycle > Migrating from Airflow to Flyte** - **Development cycle > FlyteRemote** - **Development cycle > Testing** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/project-structure === # Project structure Organizing a workflow project repository effectively is key for ensuring scalability, collaboration, and easy maintenance. Here are best practices for structuring a Flyte workflow project repo, covering task organization, workflow management, dependency handling, and documentation. ## Recommended Directory Structure A typical Flyte workflow project structure could look like this: ```shell โ”œโ”€โ”€ .github/workflows/ โ”œโ”€โ”€ .gitignore โ”œโ”€โ”€ docs/ โ”‚ โ””โ”€โ”€ README.md โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ core/ # Core logic specific to the use case โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”‚ โ”œโ”€โ”€ model.py โ”‚ โ”‚ โ”œโ”€โ”€ data.py โ”‚ โ”‚ โ””โ”€โ”€ structs.py โ”‚ โ”œโ”€โ”€ tasks/ # Contains individual tasks โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”‚ โ”œโ”€โ”€ preprocess.py โ”‚ โ”‚ โ”œโ”€โ”€ fit.py โ”‚ โ”‚ โ”œโ”€โ”€ test.py โ”‚ โ”‚ โ””โ”€โ”€ plot.py โ”‚ โ”œโ”€โ”€ workflows/ # Contains workflow definitions โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”‚ โ”œโ”€โ”€ inference.py โ”‚ โ”‚ โ””โ”€โ”€ train.py โ”‚ โ””โ”€โ”€ orchestration/ # For helper constructs (e.g., secrets, images) โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ””โ”€โ”€ constants.py โ”œโ”€โ”€ uv.lock โ””โ”€โ”€ pyproject.toml ``` This structure is designed to ensure each project component has a clear, logical home, making it easy for team members to find and modify files. ## Organizing Tasks and Workflows In Flyte, tasks are the building blocks of workflows, so itโ€™s important to structure them intuitively: * **Tasks**: Store each task in its own file within the `tasks/` directory. If multiple tasks are closely related, consider grouping them within a module. Alternatively, each task can have its own module to allow more granular organization and sub-directories could be used to group similar tasks. * **Workflows**: Store workflows, which combine tasks into end-to-end processes, in the `workflows/` directory. This separation ensures workflows are organized independently from core task logic, promoting modularity and reuse. ## Orchestration Directory for Helper Constructs Include a directory, such as `orchestration/` or `union_utils/`, for constructs that facilitate workflow orchestration. This can house helper files like: * **Secrets**: Definitions for accessing secrets (e.g., API keys) in Flyte. * **ImageSpec**: A tool that simplifies container management, allowing you to avoid writing Dockerfiles directly. ## Core Logic for Workflow-Specific Functionality Use a `core/` directory for business logic specific to your workflows. This keeps the core application code separate from workflow orchestration code, improving maintainability and making it easier for new team members to understand core functionality. ## Importance of `__init__.py` Adding `__init__.py` files within each directory is essential: * **For Imports**: These files make the directory a Python package, enabling proper imports across modules. * **For Flyte's Fast Registration**: When performing fast registration, Flyte considers the first directory without an `__init__.py` as the root. Flyte will then package the root and its contents into a tarball, streamlining the registration process and avoiding the need to rebuild the container image every time you make code changes. ## Monorepo vs Multi-repo: Choosing a structure When working with multiple teams, you have two main options: * **Monorepo**: A single repository shared across all teams, which can simplify dependency management and allow for shared constructs. However, it can introduce complexity in permissions and version control for different teams. * **Multi-repo**: Separate repositories for each team or project can improve isolation and control. In this case, consider creating shared, installable packages for constructs that multiple teams use, ensuring consistency without merging codebases. ## CI/CD The GitHub action should: * Register (and promote if needed) on merge to domain branch. * Execute on merge of input YAML. * Inject git SHA as entity version. ## Documentation and Docstrings Writing clear docstrings is encouraged, as they are automatically propagated to the Flyte UI. This provides useful context for anyone viewing the workflows and tasks in the UI, reducing the need to consult source code for explanations. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/projects-and-domains === # Projects and domains Projects and domains are the principle organizational categories into which you group your workflows in Flyte. Projects define groups of task, workflows, launch plans and other entities that share a functional purpose. Domains represent distinct steps through which the entities in a project transition as they proceed through the development cycle. By default, Flyte provides three domains: `development`, `staging`, and `production`. During onboarding, you can configure your Flyte instance to have different domains. Speak to the Flyte team for more information. Projects and domains are orthogonal to each other, meaning that a project has multiple domains and a domain has multiple projects. Here is an example arrangement: | | Development | Staging | Production | |-----------|-------------------|-------------------|-------------------| | Project 1 | workflow_1 (v2.0) | workflow_1 (v1.0) | workflow_1 (v1.0) | | Project 2 | workflow_2 (v2.0) | workflow_2 (v1.0) | workflow_2 (v1.0) | ## Projects Projects represent independent workflows related to specific teams, business areas, or applications. Each project is isolated from others, but workflows can reference entities (workflows or tasks) from other projects to reuse generalizable resources. ## Domains Domains represent distinct environments orthogonal to the set of projects in your org within Flyte, such as development, staging, and production. These enable dedicated configurations, permissions, secrets, cached execution history, and resource allocations for each environment, preventing unintended impact on other projects and/or domains. Using domains allows for a clear separation between environments, helping ensure that development and testing don't interfere with production workflows. A production domain ensures a โ€œclean slateโ€ so that cached development executions do not result in unexpected behavior. Additionally, secrets may be configured for external production data sources. ## When to use different Flyte projects? Projects help group independent workflows related to specific teams, business areas, or applications. Generally speaking, each independent team or ML product should have its own Flyte project. Even though these are isolated from one another, teams may reference entities (workflows or tasks) from other Flyte projects to reuse generalizable resources. For example, one team may create a generalizable task to train common model types. However, this requires advanced collaboration and common coding standards. When setting up workflows in Flyte, effective use of **projects** and **domains** is key to managing environments, permissions, and resource allocation. Below are best practices to consider when organizing workflows in Flyte. ## Projects and Domains: The Power of the Project-Domain Pair Flyte uses a project-domain pair to create isolated configurations for workflows. This pairing allows for: * **Dedicated Permissions**: Through Role-Based Access Control (RBAC), users can be assigned roles with tailored permissionsโ€”such as contributor or adminโ€”specific to individual project-domain pairs. This allows fine-grained control over who can manage or execute workflows within each pair, ensuring that permissions are both targeted and secure. More details **Administration > User management > Custom roles and policies**. * **Resource and Execution Monitoring**: Track and monitor resource utilization, executions, and performance metrics on a dashboard unique to each project-domain pair. This helps maintain visibility over workflow execution and ensures optimal performance. More details **Administration > Resources**. * **Resource Allocations and Quotas**: By setting quotas for each project-domain pair, Flyte can ensure that workflows do not exceed designated limits, preventing any project or domain from unintentionally impacting resources available to others. Additionally, you can configure unique resource defaultsโ€”such as memory, CPU, and storage allocationsโ€”for each project-domain pair. This allows each pair to meet the specific requirements of its workflows, which is particularly valuable given the unique needs across different projects. More details **Core concepts > Tasks > Task hardware environment > Customizing task resources > Execution defaults and resource quotas** and **Administration > Resources**. * **Configuring Secrets**: Flyte allows you to configure secrets at the project-domain level, ensuring sensitive information, such as API keys and tokens, is accessible only within the specific workflows that need them. This enhances security by isolating secrets according to the project and domain, reducing the risk of unauthorized access across environments. More details **Development cycle > Managing secrets**. ## Domains: Clear Environment Separation Domains represent distinct environments within Flyte, allowing clear separation between development, staging, and production. This structure helps prevent cross-environment interference, ensuring that changes made in development or testing do not affect production workflows. Using domains for this separation ensures that workflows can evolve in a controlled manner across different stages, from initial development through to production deployment. ## Projects: Organizing Workflows by Teams, Business Areas, or Applications Projects in Flyte are designed to group independent workflows around specific teams, business functions, or applications. By aligning projects to organizational structure, you can simplify access control and permissions while encouraging a clean separation of workflows across different teams or use cases. Although workflows can reference each other across projects, it's generally cleaner to maintain independent workflows within each project to avoid complexity. Flyteโ€™s CLI tools and SDKs provide options to specify projects and domains easily: * **CLI Commands**: In most commands within the `pyflyte` and `uctl` CLIs, you can specify the project and domain by using the `--project` and `--domain` flags, enabling precise control over which project-domain pair a command applies to. More details **Union CLI** and **Uctl CLI**. * **Python SDK**: When working with the `flytekit` SDK, you can leverage `FlyteRemote` to define the project and domain for workflow interactions programmatically, ensuring that all actions occur in the intended environment. More details [here](union-remote). === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/building-workflows === # Building workflows ## When should I decompose tasks? There are several reasons why one may choose to decompose a task into smaller tasks. Doing so may result in better computational performance, improved cache performance, and taking advantage of interruptible tasks. However, decomposition comes at the cost of the overhead among tasks, including spinning up nodes and downloading data. In some cases, these costs may be remediated by using **Core concepts > Actors**. ### Differing runtime requirements Firstly, decomposition provides support for heterogeneous environments among the operations in the task. For example, you may have some large task that trains a machine learning model and then uses the model to run batch inference on your test data. However, training a model typically requires significantly more memory than inference. For that reason, given large enough scale, it could actually be beneficial to decompose this large task into two tasks that (1) train a model and then (2) run batch inference. By doing so, you could request significantly less memory for the second task in order to save on the expense of this workflow. If you are working with even more data, then you might benefit from decomposing the batch inference task via `map_task` such that you may further parallelize this operation, substantially reducing the runtime of this step. Generally speaking, decomposition provides infrastructural flexibility regarding the ability to define resources, dependencies, and execution parallelism. ### Improved cache performance Secondly, you may decompose large tasks into smaller tasks to enable โ€œfine-grainedโ€ caching. In other words, each unique task provides an automated โ€œcheckpointโ€ system. Thus, by breaking down a large workflow into its many natural tasks, one may minimize redundant work among multiple serial workflow executions. This is especially useful during rapid, iterative development, during which a user may attempt to run the same workflow multiple times in a short period of time. โ€œFine-grainedโ€ caching will dramatically improve productivity while executing workflows both locally and remotely. ### Take advantage of interruptible tasks Lastly, one may utilize โ€œfine-grainedโ€ caching to leverage interruptible tasks. Interruptible tasks will attempt to run on spot instances or spot VMs, where possible. These nodes are interruptible, meaning that the task may occasionally fail due to another organization willing to pay more to use it. However, these spot instances can be substantially cheaper than their non-interruptible counterparts (on-demand instances / VMs). By utilizing โ€œfine-grainedโ€ caching, one may reap the significant cost savings on interruptible tasks while minimizing the effects of having their tasks being interrupted. ## When should I parallelize tasks? In general, parallelize early and often. A lot of Flyteโ€™s powerful ergonomics like caching and workflow recovery happen at the task level, as mentioned above. Decomposing into smaller tasks and parallelizing enables for a performant and fault-tolerant workflow. One caveat is for very short duration tasks, where the overhead of spinning up a pod and cleaning it up negates any benefits of parallelism. With reusable containers via **Core concepts > Actors**, however, these overheads are transparently obviated, providing the best of both worlds at the cost of some up-front work in setting up that environment. In any case, it may be useful to batch the inputs and outputs to amortize any overheads. Please be mindful to keep the sequencing of inputs within a batch, and of the batches themselves, to ensure reliable cache hits. ### Parallelization constructs The two main parallelization constructs in Flyte are the **Development cycle > Building workflows > map task** and the **Core concepts > Workflows > Dynamic workflows**. They accomplish roughly the same goal but are implemented quite differently and have different advantages. Dynamic tasks are more akin to a `for` loop, iterating over inputs sequentially. The parallelism is controlled by the overall workflow parallelism. Map tasks are more efficient and have no such sequencing guarantees. They also have their own concurrency setting separate from the overall workflow and can have a minimum failure threshold of their constituent tasks. A deeper explanation of their differences is available [here]() while examples of how to use them together can be found [here](). ## When should I use caching? Caching should be enabled once the body of a task has stabilized. Cache keys are implicitly derived from the task signature, most notably the inputs and outputs. If the body of a task changes without a modification to the signature, and the same inputs are used, it will produce a cache hit. This can result in unexpected behavior when iterating on the core functionality of the task and expecting different inputs downstream. Moreover, caching will not introspect the contents of a `FlyteFile` for example. If the same URI is used as input with completely different contents, it will also produce a cache hit. For these reasons, itโ€™s wise to add an explicit cache key so that it can be invalidated at any time. Despite these caveats, caching is a huge time saver during workflow development. Caching upstream tasks enable a rapid run through of the workflow up to the node youโ€™re iterating on. Additionally, caching can be valuable in complex parallelization scenarios where youโ€™re debugging the failure state of large map tasks, for example. In production, if your cluster is under heavy resource constraints, caching can allow a workflow to complete across re-runs as more and more tasks are able to return successfully with each run. While not an ideal scenario, caching can help soften the blow of production failures. With these caveats in mind, there are very few scenarios where caching isnโ€™t warranted. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/setting-up-a-project === # Setting up a production project In Flyte, your work is organized in a hierarchy with the following structure: * **Organization**: Your Flyte instance, accessible at a specific URL like `flyte.my-company.com`. * **Domains** Within an organization there are (typically) three domains, `development`, `staging`, and `production`, used to organize your code during the development process. You can configure a custom set of domains to suit your needs during **Configuring your data plane**. * **Projects**: Orthogonal to domains, projects are used to organize your code into logical groups. You can create as many projects as you need. A given workflow will reside in a specific project. For example, let's say `my_workflow` is a workflow in `my_project`. When you start work on `my_workflow` you would typically register it in the project-domain `my_project/development`. As you work on successive iterations of the workflow you might promote `my_workflow` to `my_project/staging` and eventually `my_project/production`. Promotion is done simply by **Development cycle > Running your code**. ## Terminology In everyday use, the term "project" is often used to refer to not just the Flyte entity that holds a set of workflows, but also to the local directory in which you are developing those workflows, and to the GitHub (or other SCM) repository that you are using to store the same workflow code. To avoid confusion, in this guide we will stick to the following naming conventions: * **Flyte project**: The entity in your Flyte instance that holds a set of workflows, as described above. Often referred to simply as a **project**. * **Local project**: The local directory (usually the working directory of a GitHub repository) in which you are developing workflows. ## Create a Flyte project Ensure that you have **Getting started > Local setup > Install `flytectl` to set up a local cluster** and the connection to your Flyte cluster **Development cycle > Setting up a production project > properly configured**. Now, create a new project on your Flyte cluster: ```shell $ flytectl create project \ --id "my-project" \ --labels "my-label=my-project" \ --description "My Flyte project" \ --name "My project" ``` ## Creating a local production project directory using `pyflyte init` Earlier, in the [Getting started](../getting-started/_index) section we used `pyflyte init` to create a new local project based on the `flyte-simple`. Here, we will do the same, but use the `flyte-production` template. Perform the following command: ```shell $ pyflyte init --template union-production my-project ``` ## Directory structure In the `basic-example` directory youโ€™ll see the following file structure: ```shell โ”œโ”€โ”€ LICENSE โ”œโ”€โ”€ README.md โ”œโ”€โ”€ docs โ”‚ โ””โ”€โ”€ docs.md โ”œโ”€โ”€ pyproject.toml โ”œโ”€โ”€ src โ”‚ โ”œโ”€โ”€ core โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”‚ โ””โ”€โ”€ core.py โ”‚ โ”œโ”€โ”€ orchestration โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”‚ โ””โ”€โ”€ orchestration.py โ”‚ โ”œโ”€โ”€ tasks โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”‚ โ””โ”€โ”€ say_hello.py โ”‚ โ””โ”€โ”€ workflows โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ””โ”€โ”€ hello_world.py โ””โ”€โ”€ uv.lock ``` You can create your own conventions and file structure for your production projects, but this tempkate provides a good starting point. However, the separate `workflows` subdirectory and the contained `__init__.py` file are significant. We will discuss them when we cover the **Development cycle > Running your code**. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/local-dependencies === # Local dependencies During the development cycle you will want to be able to run your workflows both locally on your machine and remotely on Flyte. To enable this, you need to ensure that the required dependencies are installed in both places. Here we will explain how to install your dependencies locally. For information on how to make your dependencies available on Flyte, see **Development cycle > ImageSpec**. ## Define your dependencies in your `pyproject.toml` We recommend using the [`uv` tool](https://docs.astral.sh/uv/) for project and dependency management. When using the best way declare your dependencies is to list them under `dependencies` in your `pyproject.toml` file, like this: ```toml [project] name = "union-simple" version = "0.1.0" description = "A simple Flyte project" readme = "README.md" requires-python = ">=3.9,<3.13" dependencies = ["union"] ``` ## Create a Python virtual environment Ensure that your Python virtual environment is properly set up with the required dependencies. Using `uv`, you can install the dependencies with the command: ```shell $ uv sync ``` You can then activate the virtual environment with: ```shell $ source .venv/bin/activate ``` > [!NOTE] `activate` vs `uv run` > When running the Pyflyte CLI within your local project you must run it in the virtual environment _associated with_ that project. > > To run `pyflyte` within your project's virtual environment using `uv`, you can prefix it use the `uv run` command. For example: > > `uv run pyflyte ...` > > Alternatively, you can activate the virtual environment with `source .venv/bin/activate` and then run the `pyflyte` command directly. > In our examples we assume that you are doing the latter. Having installed your dependencies in your local environment, you can now **Development cycle > Running your code**. The next step is to ensure that the same dependencies are also **Development cycle > ImageSpec**. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/image-spec === # ImageSpec During the development cycle you will want to be able to run your workflows both locally on your machine and remotely on Flyte, so you will need to ensure that the required dependencies are installed in both environments. Here we will explain how to set up the dependencies for your workflow to run remotely on Flyte. For information on how to make your dependencies available locally, see **Development cycle > Local dependencies**. When a workflow is deployed to Flyte, each task is set up to run in its own container in the Kubernetes cluster. You specify the dependencies as part of the definition of the container image to be used for each task using the `ImageSpec` class. For example:: ```python import flytekit as fl image_spec = union.ImageSpec( name="say-hello-image", requirements="uv.lock", ) @fl.task(container_image=image_spec) def say_hello(name: str) -> str: return f"Hello, {name}!" @fl.workflow def hello_world_wf(name: str = "world") -> str: greeting = say_hello(name=name) return greeting ``` Here, the `ImageSpec` class is used to specify the container image to be used for the `say_hello` task. * The `name` parameter specifies the name of the image. This name will be used to identify the image in the container registry. * The `requirements` parameter specifies the path to a file (relative to the directory in which the `pyflyte run` or `pyflyte register` command is invoked) that specifies the dependencies to be installed in the image. The file may be: * A `requirements.txt` file. * A `uv.lock` file generated by the `uv sync` command. * A `poetry.lock` file generated by the `poetry install` command. * A `pyproject.toml` file. When you execute the `pyflyte run` or `pyflyte register` command, Flyte will build the container image defined in `ImageSpec` block (as well as registering the tasks and workflows defined in your code). === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/running-your-code === # Running your code ## Set up your development environment If you have not already done so, follow the **Getting started** section to sign in to Flyte, and set up your local environment. ## CLI commands for running your code The Pyflyte CLI and Flytectl CLI provide commands that allow you to deploy and run your code at different stages of the development cycle: 1. `pyflyte run`: For deploying and running a single script immediately in your local Python environment. 2. `pyflyte run --remote`: For deploying and running a single script immediately in the cloud on Flyte. 3. `pyflyte register`: For deploying multiple scripts to Flyte and running them from the Web interface. 4. `pyflyte package` and `flytectl register`: For deploying workflows to production and for scripting within a CI/CD pipeline. > [!NOTE] > In some cases, you may want to test your code in a local cluster before deploying it to Flyte. > This step corresponds to using the commands 2, 3, or 4, but targeting your local cluster instead of Flyte. > For more details, see **Development cycle > Running in a local cluster**. ## Registration pattern summary The following diagram provides a summarized view of the different registration patterns: ![Registration patterns](../../_static/images/user-guide/development-cycle/running-your-code/registration-patterns.png) ## Running a script in local Python with `pyflyte run` {#running-a-script-in-local-python} During the development cycle you will want to run a specific workflow or task in your local Python environment to test it. To quickly try out the code locally use `pyflyte run`: ```shell $ pyflyte run workflows/example.py wf --name 'Albert' ``` Here you are invoking `pyflyte run` and passing the name of the Python file and the name of the workflow within that file that you want to run. In addition, you are passing the named parameter `name` and its value. This command is useful for quickly testing a workflow locally to check for basic errors. For more details see [pyflyte run details](./details-of-pyflyte-run). ## Running a script on Flyte with `pyflyte run --remote` To quickly run a workflow on Flyte, use `pyflyte run --remote`: ```shell $ pyflyte run --remote --project basic-example --domain development workflows/example.py wf --name 'Albert' ``` Here we are invoking `pyflyte run --remote` and passing: * The project, `basic-example` * The domain, `development` * The Python file, `workflows/example.py` * The workflow within that file that you want to run, `wf` * The named parameter `name`, and its value This command will: * Build the container image defined in your `ImageSpec`. * Push the image to the container registry specified in that `ImageSpec`. Don't forget make the image accessible to Flyte. For example, if you are using GitHub Container Registry, you will need to make the image public. * Package up your code and deploy it to the specified project and domain in Flyte. * Run the workflow on Flyte. This command is useful for quickly deploying and running a specific workflow on Flyte. For more details see [pyflyte run details](./details-of-union-run). This command is useful for quickly deploying and running a specific workflow on Flyte. For more details see [pyflyte run details](./details-of-pyflyte-run). ## Running tasks through flytectl This is a multi-step process where we create an execution spec file, update the spec file, and then create the execution. ### Generate execution spec file ```shell $ flytectl launch task --project flytesnacks --domain development --name workflows.example.generate_normal_df --version v1 ``` ### Update the input spec file for arguments to the workflow ```yaml iamRoleARN: 'arn:aws:iam::12345678:role/defaultrole' inputs: n: 200 mean: 0.0 sigma: 1.0 kubeServiceAcct: "" targetDomain: "" targetProject: "" task: workflows.example.generate_normal_df version: "v1" ``` ### Create execution using the exec spec file ```shell $ flytectl create execution -p flytesnacks -d development --execFile exec_spec.yaml ``` ### Monitor the execution by providing the execution id from create command ```shell $ flytectl get execution -p flytesnacks -d development ``` ## Running workflows through flytectl Workflows on their own are not runnable directly. However, a launchplan is always bound to a workflow (at least the auto-create default launch plan) and you can use launchplans to `launch` a workflow. The `default launchplan` for a workflow has the same name as its workflow and all argument defaults are also identical. Tasks also can be executed using the launch command. One difference between running a task and a workflow via launchplans is that launchplans cannot be associated with a task. This is to avoid triggers and scheduling. ## Running launchplans through flytectl This is multi-step process where we create an execution spec file, update the spec file and then create the execution. More details can be found **Flytectl CLI > flytectl create > flytectl create execution**. ### Generate an execution spec file ```shell $ flytectl get launchplan -p flytesnacks -d development myapp.workflows.example.my_wf --execFile exec_spec.yaml ``` ### Update the input spec file for arguments to the workflow ```yaml inputs: name: "adam" ``` ### Create execution using the exec spec file ```shell $ flytectl create execution -p flytesnacks -d development --execFile exec_spec.yaml ``` ### Monitor the execution by providing the execution id from create command ```bash $ flytectl get execution -p flytesnacks -d development ``` ## Deploying your code to Flyte with `pyflyte register` ```shell $ pyflyte register workflows --project basic-example --domain development ``` Here we are registering all the code in the `workflows` directory to the project `basic-example` in the domain `development`. This command will: * Build the container image defined in your `ImageSpec`. * Package up your code and deploy it to the specified project and domain in Flyte. The package will contain the code in the Python package located in the `workflows` directory. Note that the presence of the `__init__.py` file in this directory is necessary in order to make it a Python package. The command will not run the workflow. You can run it from the Web interface. This command is useful for deploying your full set of workflows to Flyte for testing. ### Fast registration `pyflyte register` packages up your code through a mechanism called fast registration. Fast registration is useful when you already have a container image thatโ€™s hosted in your container registry of choice, and you change your workflow/task code without any changes in your system-level/Python dependencies. At a high level, fast registration: * Packages and zips up the directory/file that you specify as the argument to `pyflyte register`, along with any files in the root directory of your project. The result of this is a tarball that is packaged into a `.tar.gz` file, which also includes the serialized task (in `protobuf` format) and workflow specifications defined in your workflow code. * Registers the package to the specified cluster and uploads the tarball containing the user-defined code into the configured blob store (e.g. S3, GCS). At workflow execution time, Flyte knows to automatically inject the zipped up task/workflow code into the running container, thereby overriding the user-defined tasks/workflows that were originally baked into the image. > [!NOTE] `WORKDIR`, `PYTHONPATH`, and `PATH` > When executing any of the above commands, the archive that gets creates is extracted wherever the `WORKDIR` is set. > This can be handled directly via the `WORKDIR` directive in a `Dockerfile`, or specified via `source_root` if using `ImageSpec`. > This is important for discovering code and executables via `PATH` or `PYTHONPATH`. > A common pattern for making your Python packages fully discoverable is to have a top-level `src` folder, adding that to your `PYTHONPATH`, > and making all your imports absolute. > This avoids having to โ€œinstallโ€ your Python project in the image at any point e.g. via `pip install -e`. ## Inspecting executions Flytectl supports inspecting execution by retrieving its details. For a deeper dive, refer to the [Reference](../../api-reference/flytectl-cli/_index) guide. Monitor the execution by providing the execution id from create command which can be task or workflow execution. ```shell $ flytectl get execution -p flytesnacks -d development ``` For more details use `--details` flag which shows node executions along with task executions on them. ```shell $ flytectl get execution -p flytesnacks -d development --details ``` If you prefer to see yaml/json view for the details then change the output format using the -o flag. ```shell $ flytectl get execution -p flytesnacks -d development --details -o yaml ``` To see the results of the execution you can inspect the node closure outputUri in detailed yaml output. ```shell "outputUri": "s3://my-s3-bucket/metadata/propeller/flytesnacks-development-/n0/data/0/outputs.pb" ``` ## Deploying your code to production ### Package your code with `pyflyte package` The combination of `pyflyte package` and `flytectl register` is the standard way of deploying your code to production. This method is often used in scripts to **Development cycle > CI/CD deployment**. First, package your workflows: ```shell $ pyflyte --pkgs workflows package ``` This will create a tar file called `flyte-package.tgz` of the Python package located in the `workflows` directory. Note that the presence of the `__init__.py` file in this directory is necessary in order to make it a Python package. > [!NOTE] > You can specify multiple workflow directories using the following command: > > `pyflyte --pkgs DIR1 --pkgs DIR2 package ...` > > This is useful in cases where you want to register two different projects that you maintain in a single place. > > If you encounter a ModuleNotFoundError when packaging, use the --source option to include the correct source paths. For instance: > > `pyflyte --pkgs package --source ./src -f` ### Register the package with `flytectl register` Once the code is packaged you register it using the `flytectl` CLI: ```shell $ flytectl register files \ --project basic-example \ --domain development \ --archive flyte-package.tgz \ --version "$(git rev-parse HEAD)" ``` Letโ€™s break down what each flag is doing here: * `--project`: The target Flyte project. * `--domain`: The target domain. Usually one of `development`, `staging`, or `production`. * `--archive`: This argument allows you to pass in a package file, which in this case is the `flyte-package.tgz` produced earlier. * `--version`: This is a version string that can be any string, but we recommend using the Git SHA in general, especially in production use cases. See [Flytectl CLI](../../api-reference/flytectl-cli/_index) for more details. ## Using pyflyte register versus pyflyte package + flytectl register As a rule of thumb, `pyflyte register` works well when you are working on a single cluster and iterating quickly on your task/workflow code. On the other hand, `pyflyte package` and `flytectl register` is appropriate if you are: * Working with multiple clusters, since it uses a portable package * Deploying workflows to a production context * Testing your workflows in your CI/CD infrastructure. > [!NOTE] Programmatic Python API > You can also perform the equivalent of the three methods of registration using a [FlyteRemote object](../development-cycle/union-remote/_index). ## Image management and registration method The `ImageSpec` construct available in `flytekit` also has a mechanism to copy files into the image being built. Its behavior depends on the type of registration used: * If fast register is used, then itโ€™s assumed that you donโ€™t also want to copy source files into the built image. * If fast register is not used (which is the default for `pyflyte package`, or if `pyflyte register --copy none` is specified), then itโ€™s assumed that you do want source files copied into the built image. * If your `ImageSpec` constructor specifies a `source_root` and the `copy` argument is set to something other than `CopyFileDetection.NO_COPY`, then files will be copied regardless of fast registration status. ## Building your own images While we recommend that you use `ImageSpec` and the `envd` image builder on registration, you can, if you wish build and deploy your own images separately. You can start with `pyflyte init --template basic-template-dockerfile`, the resulting template project includes a `docker_build.sh` script that you can use to build and tag a container according to the recommended practice: ```shell $ ./docker_build.sh ``` By default, the `docker_build.sh` script: * Uses the `PROJECT_NAME` specified in the pyflyte command, which in this case is my_project. * Will not use any remote registry. * Uses the Git SHA to version your tasks and workflows. You can override the default values with the following flags: ```shell $ ./docker_build.sh -p PROJECT_NAME -r REGISTRY -v VERSION ``` For example, if you want to push your Docker image to Githubโ€™s container registry you can specify the `-r ghcr.io` flag. > [!NOTE] > The `docker_build.sh` script is purely for convenience; you can always roll your own way of building Docker containers. Once youโ€™ve built the image, you can push it to the specified registry. For example, if youโ€™re using Github container registry, do the following: ```shell $ docker login ghcr.io $ docker push TAG ``` ## CI/CD with Flyte and GitHub Actions You can use any of the commands we learned in this guide to register, execute, or test Flyte workflows in your CI/CD process. Flyte provides two GitHub actions that facilitate this: * `flyte-setup-action`: This action handles the installation of flytectl in your action runner. * `flyte-register-action`: This action uses `flytectl register` under the hood to handle registration of packages, for example, the `.tgz` archives that are created by `pyflyte package`. ### Some CI/CD best practices In the case where workflows are registered on each commit in your build pipelines, you can consider the following recommendations and approach: * **Versioning Strategy** : Determining the version of the build for different types of commits makes them consistent and identifiable. For commits on feature branches, use `{branch-name}-{short-commit-hash}` and for the ones on main branches, use `main-{short-commit-hash}`. Use version numbers for the released (tagged) versions. * **Workflow Serialization and Registration** : Workflows should be serialized and registered based on the versioning of the build and the container image. Depending on whether the build is for a feature branch or `main`, the registration domain should be adjusted accordingly. * **Container Image Specification** : When managing multiple images across tasks within a workflow, use the `--image` flag during registration to specify which image to use. This avoids hardcoding the image within the task definition, promoting reusability and flexibility in workflows. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/overriding-parameters === # Overriding parameters The `with_overrides` method allows you to specify parameter overrides on [tasks](../core-concepts/tasks/_index), **Core concepts > Workflows > Subworkflows and sub-launch plans** at execution time. This is useful when you want to change the behavior of a task, subworkflow, or sub-launch plan without modifying the original definition. ## Task parameters When calling a task, you can specify the following parameters in `with_overrides`: * `accelerator`: Specify **Core concepts > Tasks > Task hardware environment > Accelerators**. * `cache_serialize`: Enable **Core concepts > Caching**. * `cache_version`: Specify the **Core concepts > Caching**. * `cache`: Enable **Core concepts > Caching**. * `container_image`: Specify a **Core concepts > Tasks > Task software environment > Local image building**. * `interruptible`: Specify whether the task is **Core concepts > Tasks > Task hardware environment > Interruptible instances**. * `limits`: Specify **Core concepts > Tasks > Task hardware environment > Customizing task resources**. * `name`: Give a specific name to this task execution. This will appear in the workflow flowchart in the UI (see **Development cycle > Overriding parameters > below**). * `node_name`: Give a specific name to the DAG node for this task. This will appear in the workflow flowchart in the UI (see **Development cycle > Overriding parameters > below**). * `requests`: Specify **Core concepts > Tasks > Task hardware environment > Customizing task resources**. * `retries`: Specify the **Core concepts > Tasks > Task parameters**. * `task_config`: Specify a **Core concepts > Tasks > Task parameters**. * `timeout`: Specify the **Core concepts > Tasks > Task parameters**. For example, if you have a task that does not have caching enabled, you can use `with_overrides` to enable caching at execution time as follows: ```python my_task(a=1, b=2, c=3).with_overrides(cache=True) ``` ### Using `with_overrides` with `name` and `node_name` Using `with_overrides` with `name` on a task is a particularly useful feature. For example, you can use `with_overrides(name="my_task")` to give a specific name to a task execution, which will appear in the UI. The name specified can be chosen or generated at invocation time without modifying the task definition. ```python @fl.workflow def wf() -> int: my_task(a=1, b=1, c=1).with_overrides(name="my_task_1") my_task(a=2, b=2, c=2).with_overrides(name="my_task_2", node_name="my_node_2") return my_task(a=1, b=1, c=1) ``` The above code would produce the following workflow display in the UI: ![Overriding name](../../_static/images/user-guide/development-cycle/overriding-parameters/override-name.png) There is also a related parameter called `node_name` that can be used to give a specific name to the DAG node for this task. The DAG node name is usually autogenerated as `n0`, `n1`, `n2`, etc. It appears in the `node` column of the workflow table. Overriding `node_name` results in the autogenerated name being replaced by the specified name: ![Overriding node name](../../_static/images/user-guide/development-cycle/overriding-parameters/override-node-name.png) Note that the `node_name` was specified as `my_node_2` in the code but appears as `my_node_2` in the UI. This is to the fact that Kubernetes node names cannot contain underscores. Flyte automatically alters the name to be Kubernetes-compliant. ## Subworkflow and sub-launch plan parameters When calling a workflow or launch plan from within a high-level workflow (in other words, when invoking a subworkflow or sub-launch plan), you can specify the following parameters in `with_overrides`: * `cache_serialize`: Enable **Core concepts > Caching**. * `cache_version`: Specify the **Core concepts > Caching**. * `cache`: Enable **Core concepts > Caching**. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/run-details === # Run details The `pyflyte run` command is used to run a specific workflow or task in your local Python environment or on Flyte. In this section we will discuss some details of how and why to use it. ## Passing parameters `pyflyte run` enables you to execute a specific workflow using the syntax: ```shell $ pyflyte run ``` Keyword arguments can be supplied to `pyflyte run` by passing them in like this: ```shell -- ``` For example, above we invoked `pyflyte run` with script `example.py`, workflow `wf`, and named parameter `name`: ```shell $ pyflyte run example.py wf --name 'Albert' ``` The value `Albert` is passed for the parameter `name`. With `snake_case` argument names, you have to convert them to `kebab-case`. For example, if the code were altered to accept a `last_name` parameter then the following command: ```shell $ pyflyte run example.py wf --last-name 'Einstein' ``` This passes the value `Einstein` for that parameter. ## Why `pyflyte run` rather than `python`? You could add a `main` guard at the end of the script like this: ```python if __name__ == "__main__": training_workflow(hyperparameters={"C": 0.1}) ``` This would let you run it with `python example.py`, though you have to hard code your arguments. It would become even more verbose if you want to pass in your arguments: ```python if __name__ == "__main__": import json from argparse import ArgumentParser parser = ArgumentParser() parser.add_argument("--hyperparameters", type=json.loads) ... # add the other options args = parser.parse_args() training_workflow(hyperparameters=args.hyperparameters)Py ``` `pyflyte run` is less verbose and more convenient for running workflows with arguments. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/debugging-with-interactive-tasks === # Debugging with interactive tasks With interactive tasks you can inspect and debug live task code directly in the UI in an embedded Visual Studio Code IDE. ## Enabling interactive tasks in your code To enable interactive tasks, you need to: * Include `flytekitplugins-flyteinteractive` as a dependency * Use the `@vscode` decorator on the tasks you want to make interactive. The `@vscode` decorator, when applied, converts a task into a Visual Studio Code server during runtime. This process overrides the standard execution of the taskโ€™s function body, initiating a command to start a Visual Studio Code server instead. > [!NOTE] No need for ingress or port forwarding > The Flyte interactive tasks feature is an adaptation of the open-source > **External service backend plugins > FlyteInteractive**. > It improves on the open-source version by removing the need for ingress > configuration or port forwarding, providing a more seamless debugging > experience. ## Basic example The following example demonstrates interactive tasks in a simple workflow. ### requirements.txt This `requirements.txt` file is used by all the examples in this section: ```text flytekit flytekitplugins-flyteinteractive ``` ### example.py ```python """Flyte workflow example of interactive tasks (@vscode)""" import flytekit as fl from flytekitplugins.flyteinteractive import vscode image = fl.ImageSpec( registry="", name="interactive-tasks-example", base_image="ghcr.io/flyteorg/flytekit:py3.11-latest", requirements="requirements.txt" ) @fl.task(container_image=image) @vscode def say_hello(name: str) -> str: s = f"Hello, {name}!" return s @fl.workflow def wf(name: str = "world") -> str: greeting = say_hello(name=name) return greeting ``` ## Register and run the workflow To register the code to a project on Flyte and run the workflow, follow the directions in **Development cycle > Running your code** ## Access the IDE 1. Select the first task in the workflow page (in this example the task is called `say_hello`). The task info pane will appear on the right side of the page. 2. Wait until the task is in the **Running** state and the **VSCode (User)** link appears. 3. Click the **VSCode (User)** link. ![VSCode link](../../_static/images/user-guide/development-cycle/debugging-with-interactive-tasks/vscode-link.png) ## Inspect the task code Once the IDE opens, you will be able to see your task code in the editor. ![Inspect code](../../_static/images/user-guide/development-cycle/debugging-with-interactive-tasks/inspect-code.png) ## Interactive debugging To run the task in VSCode, click the _Run and debug_ symbol on the left rail of the IDE and select the **Interactive Debugging** configuration. ![Interactive debugging](../../_static/images/user-guide/development-cycle/debugging-with-interactive-tasks/interactive-debugging.png) Click the **Play** button beside the configuration drop-down to run the task. This will run your task with inputs from the previous task. To inspect intermediate states, set breakpoints in the Python code and use the debugger for tracing. > [!NOTE] No task output written to Flyte storage > Itโ€™s important to note that during the debugging phase the task runs entirely within VSCode and does not write the output to Flyte storage. ## Update your code You can edit your code in the VSCode environment and run the task again to see the changes. Note, however, that the changes will not be automatically persisted anywhere. You will have to manually copy and paste the changes back to your local environment. ## Resume task After you finish debugging, you can resume your task with updated code by executing the **Resume Task** configuration. This will terminate the code server, run the task with inputs from the previous task, and write the output to Flyte storage. > [!NOTE] Remember to persist your code > Remember to persist your code (for example, by checking it into GitHub) before resuming the task, since you will lose the connection to the VSCode server afterwards. ![Resume task](../../_static/images/user-guide/development-cycle/debugging-with-interactive-tasks/resume-task.png) ## Auxiliary Python files You will notice that aside from your code, there are some additional files in the VSCode file explorer that have been automatically generated by the system: ### flyteinteractive_interactive_entrypoint.py The `flyteinteractive_interactive_entrypoint.py` script implements the **Interactive Debugging** action that we used above: ![Interactive entrypoint](../../_static/images/user-guide/development-cycle/debugging-with-interactive-tasks/flyteinteractive-interactive-entrypoint-py.png) ### flyteinteractive_resume_task.py The `flyteinteractive_resume_task.py` script implements the **Resume Task** action that we used above: ![Resume task](../../_static/images/user-guide/development-cycle/debugging-with-interactive-tasks/flyteinteractive-resume-task-py.png) ### launch.json The `launch.json` file in the `.vscode` directory configures the **Interactive Debugging** and **Resume Task** actions. ![launch.json](../../_static/images/user-guide/development-cycle/debugging-with-interactive-tasks/launch-json.png) ## Integrated terminal In addition to using the convenience functions defined by the auxiliary files, you can also run your Python code script directly from the integrated terminal using `python .py` (in this example, `python hello.py`). ![Interactive terminal](../../_static/images/user-guide/development-cycle/debugging-with-interactive-tasks/interactive-terminal.png) ## Install extensions As with local VSCode, you can install a variety of extensions to assist development. Available extensions differ from official VSCode for legal reasons and are hosted on the [Open VSX Registry](https://open-vsx.org/). Python and Jupyter extensions are installed by default. Additional extensions can be added by defining a configuration object and passing it to the `@vscode` decorator, as shown below: ### example-extensions.py ```python """Flyte workflow example of interactive tasks (@vscode) with extensions""" import flytekit as fl from flytekitplugins.flyteinteractive import COPILOT_EXTENSION, VscodeConfig, vscode image = fl.ImageSpec( registry="", name="interactive-tasks-example", base_image="ghcr.io/flyteorg/flytekit:py3.11-latest", requirements="requirements.txt" ) config = VscodeConfig() config.add_extensions(COPILOT_EXTENSION) # Use predefined URL config.add_extensions( "https://open-vsx.org/api/vscodevim/vim/1.27.0/file/vscodevim.vim-1.27.0.vsix" ) # Copy raw URL from Open VSX @fl.task(container_image=image) @vscode(config=config) def say_hello(name: str) -> str: s = f"Hello, {name}!" return s @fl.workflow def wf(name: str = "world") -> str: greeting = say_hello(name=name) return greeting ``` ## Manage resources To manage resources, the VSCode server is terminated after a period of idleness (no active HTTP connections). Idleness is monitored via a heartbeat file. The `max_idle_seconds` parameter can be used to set the maximum number of seconds the VSCode server can be idle before it is terminated. ### example-manage-resources.py ```python """Flyte workflow example of interactive tasks (@vscode) with max_idle_seconds""" import flytekit as fl from flytekitplugins.flyteinteractive import vscode image = fl.ImageSpec( registry="", name="interactive-tasks-example", base_image="ghcr.io/flyteorg/flytekit:py3.11-latest", requirements="requirements.txt" ) @fl.task(container_image=image) @vscode(max_idle_seconds=60000) def say_hello(name: str) -> str: s = f"Hello, {name}!" return s @fl.workflow def wf(name: str = "world") -> str: greeting = say_hello(name=name) return greeting ``` ## Pre and post hooks Interactive tasks also allow the registration of functions to be executed both before and after VSCode starts. This can be used for tasks requiring setup or cleanup. ### example-pre-post-hooks.py ```python """Flyte workflow example of interactive tasks (@vscode) with pre and post hooks""" import flytekit as fl from flytekitplugins.flyteinteractive import vscode image = fl.ImageSpec( registry="", name="interactive-tasks-example", base_image="ghcr.io/flyteorg/flytekit:py3.11-latest", requirements="requirements.txt" ) def set_up_proxy(): print("set up") def push_code(): print("push code") @fl.task(container_image=image) @vscode(pre_execute=set_up_proxy, post_execute=push_code) def say_hello(name: str) -> str: s = f"Hello, {name}!" return s @fl.workflow def wf(name: str = "world") -> str: greeting = say_hello(name=name) return greeting ``` ## Only initiate VSCode on task failure The system can also be set to only initiate VSCode _after a task failure_, preventing task termination and thus enabling inspection. This is done by setting the `run_task_first` parameter to `True`. ### example-run-task-first.py ```python """Flyte workflow example of interactive tasks (@vscode) with run_task_first""" import flytekit as fl from flytekitplugins.flyteinteractive import vscode image = fl.ImageSpec( registry="", name="interactive-tasks-example", base_image="ghcr.io/flyteorg/flytekit:py3.11-latest", requirements="requirements.txt" ) @fl.task(container_image=image) @vscode(run_task_first=True) def say_hello(name: str) -> str: s = f"Hello, {name}!" return s @fl.workflow def wf(name: str = "world") -> str: greeting = say_hello(name=name) return greeting ``` ## Debugging execution issues The inspection of task and workflow execution provides log links to debug things further. Using `--details` flag you can view node executions with log links. ```shell โ””โ”€โ”€ n1 - FAILED - 2021-06-30 08:51:07.3111846 +0000 UTC - 2021-06-30 08:51:17.192852 +0000 UTC โ””โ”€โ”€ Attempt :0 โ””โ”€โ”€ Task - FAILED - 2021-06-30 08:51:07.3111846 +0000 UTC - 2021-06-30 08:51:17.192852 +0000 UTC โ””โ”€โ”€ Logs : โ””โ”€โ”€ Name :Kubernetes Logs (User) โ””โ”€โ”€ URI :http://localhost:30082/#/log/flytectldemo-development/f3a5a4034960f4aa1a09-n1-0/pod?namespace=flytectldemo-development ``` Additionally, you can check the pods launched in `\-\` namespace ```shell $ kubectl get pods -n - ``` The launched pods will have a prefix of execution name along with suffix of `nodeId`: ```shell NAME READY STATUS RESTARTS AGE f65009af77f284e50959-n0-0 0/1 ErrImagePull 0 18h ``` For example, above we see that the `STATUS` indicates an issue with pulling the image. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/task-resource-validation === # Task resource validation In Flyte, when you attempt to execute a workflow with unsatisfiable resource requests, we fail the execution immediately rather than allowing it to queue forever. We intercept execution creation requests in executions service to validate that their resource requirements can be met and fast-fail if not. A failed validation returns a message similar to ```text Request failed with status code 400 rpc error: code = InvalidArgument desc = no node satisfies task 'workflows.fotd.fotd_directory' resource requests ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/running-in-a-local-cluster === # Running in a local cluster ## Running in a local Kubernetes cluster Ultimately you will be running your workflows in a Kubernetes cluster in Flyte. But it can be handy to try out a workflow in a cluster on your local machine. First, ensure that you have [Docker](https://www.docker.com/products/docker-desktop/) (or a similar OCI-compliant container engine) installed locally and that _the daemon is running_. Then start the demo cluster using `flytectl`: ```shell $ flytectl demo start ``` ### Configuration When `flytectl` starts the cluster in your local container engine it also writes configuration information to the directory `~/.flyte/`. Most importantly, it creates the file `~/.flyte/config-sandbox.yaml`. This file holds (among other things) the location of the Kubernetes cluster to which we will be deploying the workflow: ```yaml admin: endpoint: localhost:30080 authType: Pkce insecure: true console: endpoint: http://localhost:30080 logger: show-source: true level: 0 ``` Right now this file indicates that the target cluster is your local Docker instance (`localhost:30080`), but later we will change it to point to your Flyte cluster. Later invocations of `flytectl` or `pyflyte` will need to know the location of the target cluster. This can be provided in two ways: 1. Explicitly passing the location of the config file on the command line * `flytectl --config ~/.flyte/config-sandbox.yaml ` * `pyflyte --config ~/.flyte/config-sandbox.yaml ` 2. Setting the environment variable `FLYTECTL_CONFIG`to the location of the config file: * `export FLYTECTL_CONFIG=~/.flyte/config-sandbox.yaml` > [!NOTE] > In this guide, we assume that you have set the `FLYTECTL_CONFIG` environment variable in your shell to the location of the configuration file. ### Start the workflow Now you can run your workflow in the local cluster simply by adding the `--remote` flag to your `pyflyte` command: ```shell $ pyflyte run --remote \ workflows/example.py \ training_workflow \ --hyperparameters '{"C": 0.1}' ``` The output supplies a URL to your workflow execution in the UI. ### Inspect the results Navigate to the URL produced by `pyflyte run` to see your workflow in the Flyte UI. ## Local cluster with default image ```shell $ pyflyte run --remote my_file.py my_workflow ``` _Where `pyflyte` is configured to point to the local cluster started with `flytectl demo start`._ * Task code runs in the environment of the default image in your local cluster. * Python code is dynamically overlaid into the container at runtime. * Only supports Python code whose dependencies are installed in the default image (see here). * Includes a local S3. * Supports some plugins but not all. * Single workflow runs immediately. * Workflow is registered to a default project. * Useful for demos. ## Local cluster with custom image ```shell $ pyflyte run --remote \ --image my_cr.io/my_org/my_image:latest \ my_file.py \ my_workflow ``` _Where `pyflyte` is configured to point to the local cluster started with `flytectl demo start`._ * Task code runs in the environment of your custom image (`my_cr.io/my_org/my_image:latest`) in your local cluster. * Python code is dynamically overlaid into the container at runtime * Supports any Python dependencies you wish, since you have full control of the image. * Includes a local S3. * Supports some plugins but not all. * Single workflow runs immediately. * Workflow is registered to a default project. * Useful for advanced testing during the development cycle. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/jupyter-notebooks === # Jupyter notebooks Flyte supports the development, running, and debugging of tasks and workflows in an interactive Jupyter notebook environment, which accelerates the iteration speed when building data- or machine learning-driven applications. ## Write your workflows and tasks in cells When building tasks and workflows in a notebook, you write the code in cells as you normally would. From those cells you can run the code locally (i.e., in the notebook itself, not on Flyte) by clikcing the run button, as you would in any notebook. ## Enable the notebook to register workflows to Flyte To enable the tasks and workflows in your notebok to be easily registered and run on your Flyte instance, you needdto set up an _interactive_ FlyteRemote object and then use to invoke the remote executions: First, in a cell, create an interactive FlyteRemote object: ```python from flytekit.configuration import Config from flytekit.remote import FlyteRemote remote = FlyteRemote( config=Config.auto(), default_project="default", default_domain="development", interactive_mode_enabled=True, ) ``` The `interactive_mode_enabled` flag must be set to `True` when running in a Jupyter notebook environment, enabling interactive registration and execution of workflows. Next, set up the execution invocation in another cell: ```python execution = remote.execute(my_task, inputs={"name": "Joe"}) execution = remote.execute(my_wf, inputs={"name": "Anne"}) ``` The interactive FlyteRemote client re-registers an entity whenever itโ€™s redefined in the notebook, including when you re-execute a cell containing the entity definition, even if the entity remains unchanged. This behavior facilitates iterative development and debugging of tasks and workflows in a Jupyter notebook. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/decks === # Decks Decks lets you display customized data visualizations from within your task code. Decks are rendered as HTML and appear right in the Flyte UI when you run your workflow. > [!NOTE] > Decks is an opt-in feature; to enable it, set `enable_deck` to `True` in the task parameters. To begin, import the dependencies: ```python import flytekit as fl from flytekit.deck.renderer import MarkdownRenderer from sklearn.decomposition import PCA import plotly.express as px import plotly ``` > [!NOTE] > The renderers are packaged separately from `flytekit` itself. > To enable the `MarkdownRenderer` imported above > you first have to install the package `flytekitplugins-deck-standard` > in your local Python environment and include it in your `ImageSpec` (as shown below). We create a new deck named `pca` and render Markdown content along with a [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) plot. Now, declare the required dependnecies in an `ImageSpec`: ```python custom_image = fl.ImageSpec( packages=[ "flytekitplugins-deck-standard", "markdown", "pandas", "pillow", "plotly", "pyarrow", "scikit-learn", "ydata_profiling", ], ) ``` Next, we define the task that will construct the figure and create the Deck: ```python @fl.task(enable_deck=True, container_image=custom_image) def pca_plot(): iris_df = px.data.iris() X = iris_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]] pca = PCA(n_components=3) components = pca.fit_transform(X) total_var = pca.explained_variance_ratio_.sum() * 100 fig = px.scatter_3d( components, x=0, y=1, z=2, color=iris_df["species"], title=f"Total Explained Variance: {total_var:.2f}%", labels={"0": "PC 1", "1": "PC 2", "2": "PC 3"}, ) main_deck = fl.Deck("pca", MarkdownRenderer().to_html("### Principal Component Analysis")) main_deck.append(plotly.io.to_html(fig)) ``` Note the usage of `append` to append the Plotly figure to the Markdown deck. The following is the expected output containing the path to the `deck.html` file: ``` {"asctime": "2023-07-11 13:16:04,558", "name": "flytekit", "levelname": "INFO", "message": "pca_plot task creates flyte deck html to file:///var/folders/6f/xcgm46ds59j7g__gfxmkgdf80000gn/T/flyte-0_8qfjdd/sandbox/local_flytekit/c085853af5a175edb17b11cd338cbd61/deck.html"} ``` ![Union deck plot](../../_static/images/user-guide/development-cycle/decks/flyte-deck-plot-local.webp) Once you execute this task on the Flyte instance, you can access the deck by going to the task view and clicking the _Deck_ button: ![Union deck button](../../_static/images/user-guide/development-cycle/decks/flyte-deck-button.png) ## Deck tabs Each Deck has a minimum of three tabs: input, output and default. The input and output tabs are used to render the input and output data of the task, while the default deck can be used to creta cusom renderings such as line plots, scatter plots, Markdown text, etc. Additionally, you can create other tabs as well. ## Deck renderers > [!NOTE] > The renderers are packaged separately from `flytekit` itself. > To enable them you first have to install the package `flytekitplugins-deck-standard` > in your local Python environment and include it in your `ImageSpec`. ### Frame profiling renderer The frame profiling render creates a profile report from a Pandas DataFrame. ```python import flytekit as fl import pandas as pd from flytekitplugins.deck.renderer import FrameProfilingRenderer @fl.task(enable_deck=True, container_image=custom_image) def frame_renderer() -> None: df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]}) fl.Deck("Frame Renderer", FrameProfilingRenderer().to_html(df=df)) ``` ![Frame renderer](../../_static/images/user-guide/development-cycle/decks/flyte-decks-frame-renderer.png) ### Top-frame renderer The top-fram renderer renders a DataFrame as an HTML table. ```python import flytekit as fl from typing import Annotated from flytekit.deck import TopFrameRenderer @fl.task(enable_deck=True, container_image=custom_image) def top_frame_renderer() -> Annotated[pd.DataFrame, TopFrameRenderer(1)]: return pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]}) ``` ![Top frame renderer](../../_static/images/user-guide/development-cycle/decks/flyte-decks-top-frame-renderer.png) ### Markdown renderer The Markdown renderer converts a Markdown string into HTML. ```python import flytekit as fl from flytekit.deck import MarkdownRenderer @fl.task(enable_deck=True, container_image=custom_image) def markdown_renderer() -> None: fl.current_context().default_deck.append( MarkdownRenderer().to_html("You can install flytekit using this command: ```import flytekit```") ) ``` ![Markdown renderer](../../_static/images/user-guide/development-cycle/decks/flyte-decks-markdown-renderer.png) ### Box renderer The box renderer groups rows of a DataFrame together into a box-and-whisker mark to visualize their distribution. Each box extends from the first quartile (Q1) to the third quartile (Q3). The median (Q2) is indicated by a line within the box. Typically, the whiskers extend to the edges of the box, plus or minus 1.5 times the interquartile range (IQR: Q3-Q1). ```python import flytekit as fl from flytekitplugins.deck.renderer import BoxRenderer @fl.task(enable_deck=True, container_image=custom_image) def box_renderer() -> None: iris_df = px.data.iris() fl.Deck("Box Plot", BoxRenderer("sepal_length").to_html(iris_df)) ``` ![Box renderer](../../_static/images/user-guide/development-cycle/decks/flyte-decks-box-renderer.png) ### Image renderer The image renderer converts a `FlyteFile` or `PIL.Image.Image` object into an HTML displayable image, where the image data is encoded as a base64 string. ```python import flytekit as fl from flytekitplugins.deck.renderer import ImageRenderer @fl.task(enable_deck=True, container_image=custom_image) def image_renderer(image: fl.FlyteFile) -> None: fl.Deck("Image Renderer", ImageRenderer().to_html(image_src=image)) @fl.workflow def image_renderer_wf(image: fl.FlyteFile = "https://bit.ly/3KZ95q4",) -> None: image_renderer(image=image) ``` ![Image renderer](../../_static/images/user-guide/development-cycle/decks/flyte-decks-image-renderer.png) #### Table renderer The table renderer converts a Pandas DataFrame into an HTML table. ```python import flytekit as fl from flytekitplugins.deck.renderer import TableRenderer @fl.task(enable_deck=True, container_image=custom_image) def table_renderer() -> None: fl.Deck( "Table Renderer", TableRenderer().to_html(df=pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]}), table_width=50), ) ``` ![Table renderer](../../_static/images/user-guide/development-cycle/decks/flyte-decks-table-renderer.png) ### Contribute to renderers Don't hesitate to integrate a new renderer into [renderer.py](https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-deck-standard/flytekitplugins/deck/renderer.py) if your deck renderers can enhance data visibility. Feel encouraged to open a pull request and play a part in enhancing the Flyte deck renderer ecosystem! ### Custom renderers You can also create your own custom renderer. A renderer is essentially a class with a `to_html` method. Here we create custom renderer that summarizes the data from a Pandas `DataFrame` instead of showing raw values. ```python class DataFrameSummaryRenderer: def to_html(self, df: pd.DataFrame) -> str: assert isinstance(df, pd.DataFrame) return df.describe().to_html() ``` Then we can use the Annotated type to override the default renderer of the `pandas.DataFrame` type: ```python try: from typing import Annotated except ImportError: from typing_extensions import Annotated @task(enable_deck=True) def iris_data( sample_frac: Optional[float] = None, random_state: Optional[int] = None, ) -> Annotated[pd.DataFrame, DataFrameSummaryRenderer()]: data = px.data.iris() if sample_frac is not None: data = data.sample(frac=sample_frac, random_state=random_state) md_text = ( "# Iris Dataset\n" "This task loads the iris dataset using the `plotly` package." ) flytekit.current_context().default_deck.append(MarkdownRenderer().to_html(md_text)) flytekit.Deck("box plot", BoxRenderer("sepal_length").to_html(data)) return data ``` ## Streaming Decks You can stream a Deck directly using `Deck.publish()`: ```python import flytekit as fl @task(enable_deck=True) def t_deck(): fl.Deck.publish() ``` This will create a live deck that where you can click the refresh button and see the deck update until the task succeeds. ### Union Deck Succeed Video ๐Ÿ“บ [Watch on YouTube](https://www.youtube.com/watch?v=LJaBP0mdFeE) ### Union Deck Fail Video ๐Ÿ“บ [Watch on YouTube](https://www.youtube.com/watch?v=xaBF6Jlzjq0) === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/migrating_from_airflow_to_flyte === # Migrating from Airflow to Flyte > [!WARNING] > Many Airflow operators and sensors have been tested on Flyte, but some may not work as expected. If you encounter any issues, please file an [issue](https://github.com/flyteorg/flyte/issues) or reach out to the Flyte community on [Slack](https://slack.flyte.org/). Flyte can compile Airflow tasks into Flyte tasks without changing code, which allows you to migrate your Airflow DAGs to Flyte with minimal effort. In addition to migration capabilities, Flyte users can seamlessly integrate Airflow tasks into their workflows, leveraging the ecosystem of Airflow operators and sensors. By combining the robust Airflow ecosystem with Flyte's capabilities such as scalability, versioning, and reproducibility, users can run more complex data and machine learning workflows with ease. For more information, see the **Connectors > Airflow connector**. Even if you're already using Flyte and have no intentions of migrating from Airflow, you can still incorporate Airflow tasks into your Flyte workflows. For instance, Airflow offers support for Google Cloud [Dataproc Operators](https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/dataproc.html), facilitating the execution of Spark jobs on Google Cloud Dataproc clusters. Rather than developing a custom plugin in Flyte, you can seamlessly integrate Airflow's Dataproc Operators into your Flyte workflows to execute Spark jobs. ## Prerequisites - Install `flytekitplugins-airflow` in your Python environment. - Enable an {ref}`Airflow connector` in your Flyte cluster. ## Steps ### 1. Define your Airflow tasks in a Flyte workflow Flytekit compiles Airflow tasks into Flyte tasks, so you can use any Airflow sensor or operator in a Flyte workflow: ```python from flytekit import task, workflow from airflow.sensors.filesystem import FileSensor @task def say_hello() -> str: return "Hello, World!" @workflow def airflow_wf(): flyte_task = say_hello() airflow_task = FileSensor(task_id="sensor", filepath="/") airflow_task >> flyte_task if __name__ == "__main__": print(f"Running airflow_wf() {airflow_wf()}") ``` ### 2. Test your workflow locally > [!NOTE] Before running your workflow locally, you must configure the [Airflow connection](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html) by setting the `AIRFLOW_CONN_{CONN_ID}` environment variable. For example, ```bash export AIRFLOW_CONN_MY_PROD_DATABASE='my-conn-type://login:password@host:port/schema?param1=val1¶m2=val2' ``` Although Airflow doesn't support local execution, you can run your workflow that contains Airflow tasks locally, which is helpful for testing and debugging your tasks before moving to production. ```bash AIRFLOW_CONN_FS_DEFAULT="/" pyflyte run workflows.py airflow_wf ``` > [!WARNING] > Some Airflow operators may require certain permissions to execute. For instance, `DataprocCreateClusterOperator` requires the `dataproc.clusters.create` permission. > When running Airflow tasks locally, you may need to set the necessary permissions locally for the task to execute successfully. ### 3. Move your workflow to production > [!NOTE] > In production, we recommend storing connections in a [secrets backend](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/secrets-backend/index.html). > Make sure the connector pod has the right permission (IAM role) to access the secret from the external secrets backend. After you have tested your workflow locally, you can execute it on a Flyte cluster using the `--remote` flag. In this case, Flyte creates a pod in the Kubernetes cluster to run the `say_hello` task, and then runs your Airflow `BashOperator` task on the Airflow connector. ```bash pyflyte run --remote workflows.py airflow_wf ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/remote-management === # FlyteRemote The `FlyteRemote` Python API supports functionality similar to that of the Pyflyte CLI, enabling you to manage Flyte workflows, tasks, launch plans and artifacts from within your Python code. > [!NOTE] > The primary use case of `FlyteRemote` is to automate the deployment of Flyte entities. As such, it is intended for use within scripts *external* to actual Flyte workflow and task code, for example CI/CD pipeline scripts. > > In other words: _Do not use `FlyteRemote` within task code._ ## Creating a `FlyteRemote` object Ensure that you have the Flytekit SDK installed, import the `FlyteRemote` class and create the object like this: ```python import flytekit as fl remote = fl.FlyteRemote() ``` By default, when created with a no-argument constructor, `FlyteRemote` will use the prevailing configuration in the local environment to connect to Flyte, that is, the same configuration as would be used by the Pyflyte CLI in that environment (see **Development cycle > FlyteRemote > Pyflyte CLI configuration search path**). In the default case, as with the Pyflyte CLI, all operations will be applied to the default project, `flytesnacks` and default domain, `development`. Alternatively, you can initialize `FlyteRemote` by explicitly specifying a `flytekit.configuration.Config` object with connection information to a Flyte instance, a project, and a domain. Additionall, the constructor supports specifying a file upload location (equivalent to a default raw data prefix): ```python import flytekit as fl from flytekit.configuration import Config remote = fl.FlyteRemote( config=Config.for_endpoint(endpoint="union.example.com"), default_project="my-project", default_domain="my-domain", data_upload_location="://my-bucket/my-prefix", ) ``` Here we use the `Config.for_endpoint` method to specify the URL to connect to. There are other ways to configure the `Config` object. In general, you have all the same options as you would when specifying a connection for the Pyflyte CLI using a `config.yaml` file. ### Authenticating using a client secret In some cases, you may be running a script with `FlyteRemote` in a CI/CD pipeline or via SSH, where you don't have access to a browser for the default authentication flow. In such scenarios, you can use the **Development cycle > Authentication > 3. ClientSecret (Best for CI/CD and Automation)** authentication method to establish a connection to Flyte. After **Development cycle > Managing API keys**, you can initialize `FlyteRemote` as follows: ```python import flytekit as fl from flytekit.configuration import Config, PlatformConfig remote = fl.FlyteRemote( config=Config( platform=PlatformConfig( endpoint="union.example.com", insecure=False, client_id="", # this is the api-key name client_credentials_secret="", # this is the api-key auth_mode="client_credentials", ) ), ) ``` For details see **Development cycle > FlyteRemote > the API docs for `flytekit.configuration.Config`** ## Subpages - **Development cycle > FlyteRemote > FlyteRemote examples** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/remote-management/remote-examples === # FlyteRemote examples ## Registering and running a workflow In the following example we register and run a workflow and retrieve its output: ```shell โ”œโ”€โ”€ remote.py โ””โ”€โ”€ workflow โ”œโ”€โ”€ __init__.py โ””โ”€โ”€ example.py ``` The workflow code that will be registered and run on Flyte resides in the `workflow` directory and consists of an empty `__init__.py` file and the workflow and task code in `example.py`: ```python import os import flytekit as fl @fl.task() def create_file(message: str) -> fl.FlyteFile: with open("data.txt", "w") as f: f.write(message) return fl.FlyteFile(path="data.txt") @fl.workflow def my_workflow(message: str) -> fl.FlyteFile: f = create_file(message) return f ``` The file `remote.py` contains the `FlyteRemote` logic. It is not part of the workflow code, and is meant to be run on your local machine. ```python import flytekit as fl from workflow.example import my_workflow def run_workflow(): remote = fl.FlyteRemote() remote.fast_register_workflow(entity=my_workflow) execution = remote.execute( entity=my_workflow, inputs={"message": "Hello, world!"}, wait=True) output = execution.outputs["o0"] print(output) with open(output, "r") as f: read_lines = f.readlines() print(read_lines) ``` The `my_workflow` workflow and the `create_file` task is registered and run. Once the workflow completes, the output is passed back to the `run_workflow` function and printed out. The output is also be available via the UI, in the **Outputs** tab of the `create_file` task details view: ![Outputs](../../../_static/images/user-guide/development-cycle/union-remote/outputs.png) The steps above demonstrates the simplest way of registering and running a workflow with `FlyteRemote`. For more options and details see [Reference > FlyteRemote](../../../api-reference/flytekit-sdk/packages/flytekit.remote.remote). ## Fetching outputs By default, `FlyteRemote.execute` is non-blocking, but you can also pass in `wait=True` to make it synchronously wait for the task or workflow to complete, as we did above. You can print out the Flyte console URL corresponding to your execution with: ```python print(f"Execution url: {remote.generate_console_url(execution)}") ``` And you can synchronize the state of the execution object with the remote state with the `sync()` method: ```python synced_execution = remote.sync(execution) print(synced_execution.inputs) # print out the inputs ``` You can also wait for the execution after youโ€™ve launched it and access the outputs: ```shell completed_execution = remote.wait(execution) print(completed_execution.outputs) # print out the outputs ``` ## Terminating all running executions for a workflow This example shows how to terminate all running executions in a given workflow name. ```python import flytekit as fl from dataclasses import dataclass import json from flytekit.configuration import Config from flytekit.models.core.execution import NodeExecutionPhase @dataclass class Execution: name: str link: str SOME_LARGE_LIMIT = 5000 PHASE = NodeExecutionPhase.RUNNING WF_NAME = "your_workflow_name" EXECUTIONS_TO_IGNORE = ["some_execution_name_to_ignore"] PROJECT = "your_project" DOMAIN = "production" ENDPOINT = "union.example.com" remote = fl.FlyteRemote( config=Config.for_endpoint(endpoint=ENDPOINT), default_project=PROJECT, default_domain=DOMAIN, ) executions_of_interest = [] executions = remote.recent_executions(limit=SOME_LARGE_LIMIT) for e in executions: if e.closure.phase == PHASE: if e.spec.launch_plan.name == WF_NAME: if e.id.name not in EXECUTIONS_TO_IGNORE: execution_on_interest = Execution(name=e.id.name, link=f"https://{ENDPOINT}/console/projects/{PROJECT}/domains/{DOMAIN}/executions/{e.id.name}") executions_of_interest.append(execution_on_interest) remote.terminate(e, cause="Terminated manually via script.") with open('terminated_executions.json', 'w') as f: json.dump([{'name': e.name, 'link': e.link} for e in executions_of_interest], f, indent=2) print(f"Terminated {len(executions_of_interest)} executions.") ``` ## Rerunning all failed executions of a workflow This example shows how to identify all failed executions from a given workflow since a certain time, and re-run them with the same inputs and a pinned workflow version. ```python import datetime import pytz import flytekit as fl from flytekit.models.core.execution import NodeExecutionPhase SOME_LARGE_LIMIT = 5000 WF_NAME = "your_workflow_name" PROJECT = "your_project" DOMAIN = "production" ENDPOINT = "union.example.com" VERSION = "your_target_workflow_version" remote = fl.FlyteRemote( config=Config.for_endpoint(endpoint=ENDPOINT), default_project=PROJECT, default_domain=DOMAIN, ) executions = remote.recent_executions(limit=SOME_LARGE_LIMIT) failures = [ NodeExecutionPhase.FAILED, NodeExecutionPhase.ABORTED, NodeExecutionPhase.FAILING, ] # time of the last successful execution date = datetime.datetime(2024, 10, 30, tzinfo=pytz.UTC) # filter executions by name filtered = [execution for execution in executions if execution.spec.launch_plan.name == WF_NAME] # filter executions by phase failed = [execution for execution in filtered if execution.closure.phase in failures] # filter executions by time windowed = [execution for execution in failed if execution.closure.started_at > date] # get inputs for each execution inputs = [remote.sync(execution).inputs for execution in windowed] # get new workflow version entity workflow = remote.fetch_workflow(name=WF_NAME, version=VERSION) # execute new workflow for each failed previous execution [remote.execute(workflow, inputs=X) for X in inputs] ``` ## Filtering for executions using a `Filter` This example shows how to use a `Filter` to only query for the executions you want. ```python from flytekit.models import filters import flytekit as fl WF_NAME = "your_workflow_name" LP_NAME = "your_launchplan_name" PROJECT = "your_project" DOMAIN = "production" ENDPOINT = "union.example.com" remote = fl.FlyteRemote.for_endpoint(ENDPOINT) # Only query executions from your project project_filter = filters.Filter.from_python_std(f"eq(workflow.name,{WF_NAME})") project_executions = remote.recent_executions(project=PROJECT, domain=DOMAIN, filters=[project_filter]) # Query for the latest execution that succeeded and was between 8 and 16 minutes latest_success = remote.recent_executions( limit=1, filters=[ filters.Equal("launch_plan.name", LP_NAME), filters.Equal("phase", "SUCCEEDED"), filters.GreaterThan("duration", 8 * 60), filters.LessThan("duration", 16 * 60), ], ) ``` ## Launch task via FlyteRemote with a new version ```python import flytekit as fl from flytekit.remote import FlyteRemote from flytekit.configuration import Config, SerializationSettings # FlyteRemote object is the main entrypoint to API remote = fl.FlyteRemote( config=Config.for_endpoint(endpoint="flyte.example.net"), default_project="flytesnacks", default_domain="development", ) # Get Task task = remote.fetch_task(name="workflows.example.generate_normal_df", version="v1") task = remote.register_task( entity=flyte_task, serialization_settings=SerializationSettings(image_config=None), version="v2", ) # Run Task execution = remote.execute( task, inputs={"n": 200, "mean": 0.0, "sigma": 1.0}, execution_name="task-execution", wait=True ) # Or use execution_name_prefix to avoid repeated execution names execution = remote.execute( task, inputs={"n": 200, "mean": 0.0, "sigma": 1.0}, execution_name_prefix="flyte", wait=True ) # Inspecting execution # The 'inputs' and 'outputs' correspond to the task execution. input_keys = execution.inputs.keys() output_keys = execution.outputs.keys() ``` ## Launch workflow via FlyteRemote Workflows can be executed with `FlyteRemote` because under the hood it fetches and triggers a default launch plan. ```python import flytekit as fl from flytekit.configuration import Config # UnionRemote object is the main entrypoint to API remote = fl.FlyteRemote( config=Config.for_endpoint(endpoint="flyte.example.net"), default_project="flytesnacks", default_domain="development", ) # Fetch workflow workflow = remote.fetch_workflow(name="workflows.example.wf", version="v1") # Execute execution = remote.execute( workflow, inputs={"mean": 1}, execution_name="workflow-execution", wait=True ) # Or use execution_name_prefix to avoid repeated execution names execution = remote.execute( workflow, inputs={"mean": 1}, execution_name_prefix="flyte", wait=True ) ``` ## Launch launchplan via FlyteRemote A launch plan can be launched via FlyteRemote programmatically. ```python import flytekit as fl from flytekit.configuration import Config # UnionRemote object is the main entrypoint to API remote = fl.FlyteRemote( config=Config.for_endpoint(endpoint="flyte.example.net"), default_project="flytesnacks", default_domain="development", ) # Fetch launch plan lp = remote.fetch_launch_plan( name="workflows.example.wf", version="v1", project="flytesnacks", domain="development" ) # Execute execution = remote.execute( lp, inputs={"mean": 1}, execution_name="lp-execution", wait=True ) # Or use execution_name_prefix to avoid repeated execution names execution = remote.execute( lp, inputs={"mean": 1}, execution_name_prefix="flyte", wait=True ) ``` ## Inspecting executions With `FlyteRemote`, you can fetch the inputs and outputs of executions and inspect them. ```python import flytekit as fl from flytekit.configuration import Config # UnionRemote object is the main entrypoint to API remote = fl.FlyteRemote( config=Config.for_endpoint(endpoint="flyte.example.net"), default_project="flytesnacks", default_domain="development", ) execution = remote.fetch_execution( name="fb22e306a0d91e1c6000", project="flytesnacks", domain="development" ) input_keys = execution.inputs.keys() output_keys = execution.outputs.keys() # The inputs and outputs correspond to the top-level execution or the workflow itself. # To fetch a specific output, say, a model file: model_file = execution.outputs["model_file"] with open(model_file) as f: ... # You can use UnionRemote.sync() to sync the entity object's state with the remote state during the execution run. synced_execution = remote.sync(execution, sync_nodes=True) node_keys = synced_execution.node_executions.keys() # node_executions will fetch all the underlying node executions recursively. # To fetch output of a specific node execution: node_execution_output = synced_execution.node_executions["n1"].outputs["model_file"] === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/testing === # Testing The `flytekit` python SDK provides a few utilities for making it easier to test your tasks and workflows in your test suite. For more details, you can also refer to the [`flytekit.testing`](../../../api-reference/flytekit-sdk/packages/flytekit.core.testing) module in the Reference section. ## Subpages - **Development cycle > Testing > Mocking tasks** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/development-cycle/testing/mocking-tasks === # Mocking tasks A lot of the tasks that you write you can run locally, but some of them you will not be able to, usually because they are tasks that depend on a third-party only available on the backend. Hive tasks are a common example, as most users will not have access to the service that executes Hive queries from their development environment. However, it's still useful to be able to locally run a workflow that calls such a task. In these instances, flytekit provides a couple of utilities to help navigate this. For example, this is a generic SQL task (and is by default not hooked up to any datastore nor handled by any plugin), and must be mocked if it is to be used in a unit test like so: ```python import datetime import pandas from flytekit import SQLTask, TaskMetadata, kwtypes, task, workflow from flytekit.testing import patch, task_mock from flytekit.types.schema import FlyteSchema sql = SQLTask( "my-query", query_template="SELECT * FROM hive.city.fact_airport_sessions WHERE ds = '{{ .Inputs.ds }}' LIMIT 10", inputs=kwtypes(ds=datetime.datetime), outputs=kwtypes(results=FlyteSchema), metadata=TaskMetadata(retries=2), ) @task def t1() -> datetime.datetime: return datetime.datetime.now() ``` Suppose you have a workflow that uses these two tasks: ```python @workflow def my_wf() -> FlyteSchema: dt = t1() return sql(ds=dt) ``` Without a mock, calling the workflow would typically raise an exception, but with the `task_mock` construct, which returns a `MagicMock` object, we can override the return value. ```python def test_demonstrate_mock(): with task_mock(sql) as mock: mock.return_value = pandas.DataFrame(data={"x": [1, 2], "y": ["3", "4"]}) assert (my_wf().open().all() == pandas.DataFrame(data={"x": [1, 2], "y": ["3", "4"]})).all().all() ``` There is another utility as well called `patch` which offers the same functionality, but in the traditional Python patching style, where the first argument is the `MagicMock` object. ```python @patch(sql) def test_demonstrate_patch(mock_sql): mock_sql.return_value = pandas.DataFrame(data={"x": [1, 2], "y": ["3", "4"]}) assert (my_wf().open().all() == pandas.DataFrame(data={"x": [1, 2], "y": ["3", "4"]})).all().all() ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output === # Data input/output Flyte being a data-aware orchestration platform, types play a vital role within it. This section provides an introduction to the wide range of data types that Flyte supports. These types serve a dual-purpose by not only validating the data but also enabling seamless transfer of data between local and cloud storage. They enable: - Data lineage - Memoization - Auto parallelization - Simplifying access to data - Auto generated CLI and launch UI For a more comprehensive understanding of how Flyte manages data, refer to **Data handling**. ## Mapping Python to Flyte types Flytekit automatically translates most Python types into Flyte types. Here's a breakdown of these mappings: | Python Type | Flyte Type | Conversion | Comment | |-------------|------------|------------|---------| | `int` | `Integer` | Automatic | Use Python 3 type hints. | | `float` | `Float` | Automatic | Use Python 3 type hints. | | `str` | `String` | Automatic | Use Python 3 type hints. | | `bool` | `Boolean` | Automatic | Use Python 3 type hints. | | `bytes`/`bytearray` | `Binary` | Not Supported | You have the option to employ your own custom typetransformer. | | `complex` | NA | Not Supported | You have the option to employ your own custom type transformer. | | `datetime.timedelta` | `Duration` | Automatic | Use Python 3 type hints. | | `datetime.datetime` | `Datetime` | Automatic | Use Python 3 type hints. | | `datetime.date` | `Datetime` | Automatic | Use Python 3 type hints. | | `typing.List[T]` / `list[T]` | `Collection [T]` | Automatic | Use `typing.List[T]` or `list[T]`, where `T` canrepresent one of the other supported types listed in the table. | | `typing.Iterator[T]` | `Collection [T]` | Automatic | Use `typing.Iterator[T]`, where `T` can represent one of the other supported types listed in the table. | | File / file-like / `os.PathLike` | `FlyteFile` | Automatic | If you're using `file` or `os.PathLike` objects,Flyte will default to the binary protocol for the file. When using `FlyteFile["protocol"]`, it is assumedthat the file is in the specified protocol, such as 'jpg', 'png', 'hdf5', etc. | | Directory | `FlyteDirectory` | Automatic | When using `FlyteDirectory["protocol"]`, it is assumed that all thefiles belong to the specified protocol. | | `typing.Dict[str, V]` / `dict[str, V]` | `Map[str, V]` | Automatic | Use `typing.Dict[str, V]` or `dict**Extending Flyte > Custom types**. | ## Subpages - **Data input/output > FlyteFile and FlyteDirectory** - **Data input/output > Downloading with FlyteFile and FlyteDirectory** - **Data input/output > Task input and output** - **Data input/output > Accessing attributes** - **Data input/output > Dataclass** - **Data input/output > Enum type** - **Data input/output > Pickle type** - **Data input/output > Pydantic BaseModel** - **Data input/output > PyTorch type** - **Data input/output > StructuredDataset** - **Data input/output > TensorFlow types** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/flyte-file-and-flyte-directory === # FlyteFile and FlyteDirectory ## FlyteFile Files are one of the most fundamental entities that users of Python work with, and they are fully supported by Flyte. In the IDL, they are known as [Blob](https://github.com/flyteorg/flyteidl/blob/master/protos/flyteidl/core/literals.proto#L33) literals which are backed by the [blob type](https://github.com/flyteorg/flyteidl/blob/master/protos/flyteidl/core/types.proto#L47). Let's assume our mission here is pretty simple. We download a few CSV file links, read them with the python built-in `csv.DictReader` function, normalize some pre-specified columns, and output the normalized columns to another CSV file. > [!NOTE] > To clone and run the example code on this page, see the > [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/data_types_and_io/). First, import the libraries: ```python import csv from collections import defaultdict from pathlib import Path from typing import List import flytekit as fl ``` Define a task that accepts `FlyteFile` as an input. The following is a task that accepts a `FlyteFile`, a list of column names, and a list of column names to normalize. The task then outputs a CSV file containing only the normalized columns. For this example, we use z-score normalization, which involves mean-centering and standard-deviation-scaling. > [!NOTE] > The `FlyteFile` literal can be scoped with a string, which gets inserted > into the format of the Blob type (`"jpeg"` is the string in > `FlyteFile[typing.TypeVar("jpeg")]`). The format is entirely optional, > and if not specified, defaults to `""`. > Predefined aliases for commonly used flyte file formats are also available. > You can find them [here](https://github.com/flyteorg/flytekit/blob/master/flytekit/types/file/__init__.py). ```python @fl.task def normalize_columns( csv_url: fl.FlyteFile, column_names: List[str], columns_to_normalize: List[str], output_location: str, ) -> fl.FlyteFile: # read the data from the raw csv file parsed_data = defaultdict(list) with open(csv_url, newline="\n") as input_file: reader = csv.DictReader(input_file, fieldnames=column_names) next(reader) # Skip header for row in reader: for column in columns_to_normalize: parsed_data[column].append(float(row[column].strip())) # normalize the data normalized_data = defaultdict(list) for colname, values in parsed_data.items(): mean = sum(values) / len(values) std = (sum([(x - mean) ** 2 for x in values]) / len(values)) ** 0.5 normalized_data[colname] = [(x - mean) / std for x in values] # write to local path out_path = str(Path(fl.current_context().working_directory) / f"normalized-{Path(csv_url.path).stem}.csv") with open(out_path, mode="w") as output_file: writer = csv.DictWriter(output_file, fieldnames=columns_to_normalize) writer.writeheader() for row in zip(*normalized_data.values()): writer.writerow({k: row[i] for i, k in enumerate(columns_to_normalize)}) if output_location: return fl.FlyteFile(path=str(out_path), remote_path=output_location) else: return fl.FlyteFile(path=str(out_path)) ``` When the image URL is sent to the task, the system translates it into a `FlyteFile` object on the local drive (but doesn't download it). The act of calling the `download()` method should trigger the download, and the `path` attribute enables to `open` the file. If the `output_location` argument is specified, it will be passed to the `remote_path` argument of `FlyteFile`, which will use that path as the storage location instead of a random location (Flyte's object store). When this task finishes, the system returns the `FlyteFile` instance, uploads the file to the location, and creates a blob literal pointing to it. Lastly, define a workflow. The `normalize_csv_files` workflow has an `output_location` argument which is passed to the `location` input of the task. If it's not an empty string, the task attempts to upload its file to that location. ```python @fl.workflow def normalize_csv_file( csv_url: fl.FlyteFile, column_names: List[str], columns_to_normalize: List[str], output_location: str = "", ) -> fl.FlyteFile: return normalize_columns( csv_url=csv_url, column_names=column_names, columns_to_normalize=columns_to_normalize, output_location=output_location, ) ``` You can run the workflow locally as follows: ```python if __name__ == "__main__": default_files = [ ( "https://raw.githubusercontent.com/flyteorg/flytesnacks/refs/heads/master/examples/data_types_and_io/test_data/biostats.csv", ["Name", "Sex", "Age", "Heights (in)", "Weight (lbs)"], ["Age"], ), ( "https://raw.githubusercontent.com/flyteorg/flytesnacks/refs/heads/master/examples/data_types_and_io/test_data/faithful.csv", ["Index", "Eruption length (mins)", "Eruption wait (mins)"], ["Eruption length (mins)"], ), ] print(f"Running {__file__} main...") for index, (csv_url, column_names, columns_to_normalize) in enumerate(default_files): normalized_columns = normalize_csv_file( csv_url=csv_url, column_names=column_names, columns_to_normalize=columns_to_normalize, ) print(f"Running normalize_csv_file workflow on {csv_url}: " f"{normalized_columns}") ``` You can enable type validation if you have the [python-magic](https://pypi.org/project/python-magic/) package installed. ### Mac OS ```shell $ brew install libmagic ``` ### Linux ```shell $ sudo apt-get install libmagic1 ``` > [!NOTE] > Currently, type validation is only supported on the `Mac OS` and `Linux` platforms. ## Streaming support `FlyteFile` supports streaming via the `fsspec` library. This integration enables efficient, on-demand access to remote files, eliminating the need for fully downloading them to local storage. Here is a simple example of removing some columns from a CSV file and writing the result to a new file: ```python @fl.task() def remove_some_rows(ff: fl.FlyteFile) -> fl.FlyteFile: """ Remove the rows that the value of city is 'Seattle'. This is an example with streaming support. """ new_file = fl.FlyteFile.new_remote_file("data_without_seattle.csv") with ff.open("r") as r: with new_file.open("w") as w: df = pd.read_csv(r) df = df[df["City"] != "Seattle"] df.to_csv(w, index=False) ``` ## FlyteDirectory In addition to files, folders are another fundamental operating system primitive. Flyte supports folders in the form of [multi-part blobs](https://github.com/flyteorg/flyteidl/blob/master/protos/flyteidl/core/types.proto#L73). > [!NOTE] > To clone and run the example code on this page, see the [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/data_types_and_io/). To begin, import the libraries: ```python import csv import urllib.request from collections import defaultdict from pathlib import Path from typing import List import flytekit as fl ``` Building upon the previous example demonstrated in the **Data input/output > FlyteFile and FlyteDirectory > FlyteFile**, let's continue by considering the normalization of columns in a CSV file. The following task downloads a list of URLs pointing to CSV files and returns the folder path in a `FlyteDirectory` object. ```python @fl.task def download_files(csv_urls: List[str]) -> union.FlyteDirectory: working_dir = fl.current_context().working_directory local_dir = Path(working_dir) / "csv_files" local_dir.mkdir(exist_ok=True) # get the number of digits needed to preserve the order of files in the local directory zfill_len = len(str(len(csv_urls))) for idx, remote_location in enumerate(csv_urls): # prefix the file name with the index location of the file in the original csv_urls list local_image = Path(local_dir) / f"{str(idx).zfill(zfill_len)}_{Path(remote_location).name}" urllib.request.urlretrieve(remote_location, local_image) return fl.FlyteDirectory(path=str(local_dir)) ``` > [!NOTE] > You can annotate a `FlyteDirectory` when you want to download or upload the contents of the directory in batches. > For example, > > ```python > @fl.task > def t1(directory: Annotated[fl.FlyteDirectory, BatchSize(10)]) -> Annotated[fl.FlyteDirectory, BatchSize(100)]: > ... > return fl.FlyteDirectory(...) > > Flytekit efficiently downloads files from the specified input directory in 10-file chunks. > It then loads these chunks into memory before writing them to the local disk. > The process repeats for subsequent sets of 10 files. > Similarly, for outputs, Flytekit uploads the resulting directory in chunks of 100. We define a helper function to normalize the columns in-place. > [!NOTE] > This is a plain Python function that will be called in a subsequent Flyte task. This example > demonstrates how Flyte tasks are simply entrypoints of execution, which can themselves call > other functions and routines that are written in pure Python. ```python def normalize_columns( local_csv_file: str, column_names: List[str], columns_to_normalize: List[str], ): # read the data from the raw csv file parsed_data = defaultdict(list) with open(local_csv_file, newline="\n") as input_file: reader = csv.DictReader(input_file, fieldnames=column_names) for row in (x for i, x in enumerate(reader) if i > 0): for column in columns_to_normalize: parsed_data[column].append(float(row[column].strip())) # normalize the data normalized_data = defaultdict(list) for colname, values in parsed_data.items(): mean = sum(values) / len(values) std = (sum([(x - mean) ** 2 for x in values]) / len(values)) ** 0.5 normalized_data[colname] = [(x - mean) / std for x in values] # overwrite the csv file with the normalized columns with open(local_csv_file, mode="w") as output_file: writer = csv.DictWriter(output_file, fieldnames=columns_to_normalize) writer.writeheader() for row in zip(*normalized_data.values()): writer.writerow({k: row[i] for i, k in enumerate(columns_to_normalize)}) ``` We then define a task that accepts the previously downloaded folder, along with some metadata about the column names of each file in the directory and the column names that we want to normalize. ```python @fl.task def normalize_all_files( csv_files_dir: fl.FlyteDirectory, columns_metadata: List[List[str]], columns_to_normalize_metadata: List[List[str]], ) -> union.FlyteDirectory: for local_csv_file, column_names, columns_to_normalize in zip( # make sure we sort the files in the directory to preserve the original order of the csv urls list(sorted(Path(csv_files_dir).iterdir())), columns_metadata, columns_to_normalize_metadata, ): normalize_columns(local_csv_file, column_names, columns_to_normalize) return fl.FlyteDirectory(path=csv_files_dir.path) ``` Compose all the above tasks into a workflow. This workflow accepts a list of URL strings pointing to a remote location containing a CSV file, a list of column names associated with each CSV file, and a list of columns that we want to normalize. ```python @fl.workflow def download_and_normalize_csv_files( csv_urls: List[str], columns_metadata: List[List[str]], columns_to_normalize_metadata: List[List[str]], ) -> fl.FlyteDirectory: directory = download_files(csv_urls=csv_urls) return normalize_all_files( csv_files_dir=directory, columns_metadata=columns_metadata, columns_to_normalize_metadata=columns_to_normalize_metadata, ) ``` You can run the workflow locally as follows: ```python if __name__ == "__main__": csv_urls = [ "https://raw.githubusercontent.com/flyteorg/flytesnacks/refs/heads/master/examples/data_types_and_io/test_data/biostats.csv", "https://raw.githubusercontent.com/flyteorg/flytesnacks/refs/heads/master/examples/data_types_and_io/test_data/faithful.csv", ] columns_metadata = [ ["Name", "Sex", "Age", "Heights (in)", "Weight (lbs)"], ["Index", "Eruption length (mins)", "Eruption wait (mins)"], ] columns_to_normalize_metadata = [ ["Age"], ["Eruption length (mins)"], ] print(f"Running {__file__} main...") directory = download_and_normalize_csv_files( csv_urls=csv_urls, columns_metadata=columns_metadata, columns_to_normalize_metadata=columns_to_normalize_metadata, ) print(f"Running download_and_normalize_csv_files on {csv_urls}: " f"{directory}") ``` ## Changing the data upload location > [!NOTE] Upload location > With Flyte Serverless, the remote location to which `FlyteFile` and `FlyteDirectory` upload container-local > files is always a randomly generated (universally unique) location in Flyte's internal object store. It cannot be changed. > > With Flyte BYOC, the upload location is configurable. By default, Flyte uploads local files or directories to the default **raw data store** (Flyte's dedicated internal object store). However, you can change the upload location by setting the raw data prefix to your own bucket or specifying the `remote_path` for a `FlyteFile` or `FlyteDirectory`. > [!NOTE] Setting up your own object store bucket > For details on how to set up your own object store bucket, consult the direction for your cloud provider: > > * **Enabling AWS resources > Enabling AWS S3** > * **Enabling GCP resources > Enabling Google Cloud Storage** > * **Enabling Azure resources > Enabling Azure Blob Storage** ### Changing the raw data prefix If you would like files or directories to be uploaded to your own bucket, you can specify the AWS, GCS, or Azure bucket in the **raw data prefix** parameter at the workflow level on registration or per execution on the command line or in the UI. This setting can be done at the workflow level on registration or per execution on the command line or in the UI. Flyte will create a directory with a unique, random name in your bucket for each `FlyteFile` or `FlyteDirectory` data write to guarantee that you never overwrite your data. ### Specifying `remote_path` for a `FlyteFile` or `FlyteDirectory` If you specify the `remote_path` when initializing your `FlyteFile` (or `FlyteDirectory`), the underlying data is written to that precise location with no randomization. > [!NOTE] Using remote_path will overwrite data > If you set `remote_path` to a static string, subsequent runs of the same task will overwrite the file. > If you want to use a dynamically generated path, you will have to generate it yourself. ## Remote examples ### Remote file example In the example above, we started with a local file. To preserve that file across the task boundary, Flyte uploaded it to the Flyte object store before passing it to the next task. You can also _start with a remote file_, simply by initializing the `FlyteFile` object with a URI pointing to a remote source. For example: ```python @fl.task def task_1() -> fl.FlyteFile: remote_path = "https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv" return fl.FlyteFile(path=remote_path) ``` In this case, no uploading is needed because the source file is already in a remote location. When the object is passed out of the task, it is converted into a `Blob` with the remote path as the URI. After the `FlyteFile` is passed to the next task, you can call `FlyteFile.open()` on it, just as before. If you don't intend on passing the `FlyteFile` to the next task, and rather intend to open the contents of the remote file within the task, you can use `from_source`. ```python @fl.task def load_json(): uri = "gs://my-bucket/my-directory/example.json" my_json = FlyteFile.from_source(uri) # Load the JSON file into a dictionary and print it with open(my_json, "r") as json_file: data = json.load(json_file) print(data) ``` When initializing a `FlyteFile` with a remote file location, all URI schemes supported by `fsspec` are supported, including `http`, `https`(Web), `gs` (Google Cloud Storage), `s3` (AWS S3), `abfs`, and `abfss` (Azure Blob Filesystem). ### Remote directory example Below is an equivalent remote example for `FlyteDirectory`. The process of passing the `FlyteDirectory` between tasks is essentially identical to the `FlyteFile` example above. ```python @fl.task def task1() -> fl.FlyteDirectory: p = "https://people.sc.fsu.edu/~jburkardt/data/csv/" return fl.FlyteDirectory(p) @fl.task def task2(fd: fl.FlyteDirectory): # Get a list of the directory contents and display the first csv files = fl.FlyteDirectory.listdir(fd) with open(files[0], mode="r") as f: d = f.read() print(f"The first csv is: \n{d}") @fl.workflow def workflow(): fd = task1() task2(fd=fd) ``` ## Streaming In the above examples, we showed how to access the contents of `FlyteFile` by calling `FlyteFile.open()`. The object returned by `FlyteFile.open()` is a stream. In the above examples, the files were small, so a simple `read()` was used. But for large files, you can iterate through the contents of the stream: ```python @fl.task def task_1() -> fl.FlyteFile: remote_path = "https://sample-videos.com/csv/Sample-Spreadsheet-100000-rows.csv" return fl.FlyteFile(path=remote_path) @fl.task def task_2(ff: fl.FlyteFile): with ff.open(mode="r") as f for row in f: do_something(row) ``` ## Downloading Alternative, you can download the contents of a `FlyteFile` object to a local file in the task container. There are two ways to do this: **implicitly** and **explicitly**. ### Implicit downloading The source file of a `FlyteFile` object is downloaded to the local container file system automatically whenever a function is called that takes the `FlyteFile` object and then calls `FlyteFile`'s `__fspath__()` method. `FlyteFile` implements the `os.PathLike` interface and therefore the `__fspath__()` method. `FlyteFile`'s implementation of `__fspath__()` performs a download of the source file to the local container storage and returns the path to that local file. This enables many common file-related operations in Python to be performed on the `FlyteFile` object. The most prominent example of such an operation is calling Python's built-in `open()` method with a `FlyteFile`: ```python @fl.task def task_2(ff: fl.FlyteFile): with open(ff, mode="r") as f file_contents= f.read() ``` > [!NOTE] open() vs ff.open() > Note the difference between > > `ff.open(mode="r")` > > and > > `open(ff, mode="r")` > > The former calls the `FlyteFile.open()` method and returns an iterator without downloading the file. > The latter calls the built-in Python function `open()`, downloads the specified `FlyteFile` to the local container file system, > and returns a handle to that file. > > Many other Python file operations (essentially, any that accept an `os.PathLike` object) can also be performed on a `FlyteFile` > object and result in an automatic download. > > See **Data input/output > Downloading with FlyteFile and FlyteDirectory** for more information. ### Explicit downloading You can also explicitly download a `FlyteFile` to the local container file system by calling `FlyteFile.download()`: ```python @fl.task def task_2(ff: fl.FlyteFile): local_path = ff.download() ``` This method is typically used when you want to download the file without immediately reading it. ## Typed aliases The **Union SDK** defines some aliases of `FlyteFile` with specific type annotations. Specifically, `FlyteFile` has the following **Flytekit SDK > Packages > flytekit.types.file**: * `HDF5EncodedFile` * `HTMLPage` * `JoblibSerializedFile` * `JPEGImageFile` * `PDFFile` * `PNGImageFile` * `PythonPickledFile` * `PythonNotebook` * `SVGImageFile` Similarly, `FlyteDirectory` has the following **Flytekit SDK > Packages > flytekit.types.directory**: * `TensorboardLogs` * `TFRecordsDirectory` These aliases can optionally be used when handling a file or directory of the specified type, although the object itself will still be a `FlyteFile` or `FlyteDirectory`. The aliased versions of the classes are syntactic markers that enforce agreement between type annotations in the signatures of task functions, but they do not perform any checks on the actual contents of the file. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/downloading-with-ff-and-fd === # Downloading with FlyteFile and FlyteDirectory The basic idea behind `FlyteFile` and `FlyteDirectory` is that they represent files and directories in remote storage. When you work with these objects in your tasks, you are working with references to the remote files and directories. Of course, at some point you will need to access the actual contents of these files and directories, which means that they have to be downloaded to the local file system of the task container. The actual files and directories of a `FlyteFile` or `FlyteDirectory` are downloaded to the local file system of the task container in two ways: * Explicitly, through a call to the `download` method. * Implicitly, through automatic downloading. This occurs when an external function is called on the `FlyteFile` or `FlyteDirectory` that itself calls the `__fspath__` method. To write efficient and performant task and workflow code, it is particularly important to have a solid understanding of when exactly downloading occurs. Let's look at some examples showing when the content `FlyteFile` objects and `FlyteDirectory` objects are downloaded to the local task container file system. ## FlyteFile **Calling `download` on a FlyteFile** ```python @fl.task def my_task(ff: FlyteFile): print(os.path.isfile(ff.path)) # This will print False as nothing has been downloaded ff.download() print(os.path.isfile(ff.path)) # This will print True as the FlyteFile was downloaded ``` Note that we use `ff.path` which is of type `typing.Union[str, os.PathLike]` rather than using `ff` in `os.path.isfile` directly. In the next example, we will see that using `os.path.isfile(ff)` invokes `__fspath__` which downloads the file. **Implicit downloading by `__fspath__`** In order to make use of some functions like `os.path.isfile` that you may be used to using with regular file paths, `FlyteFile` implements a `__fspath__` method that downloads the remote contents to the `path` of `FlyteFile` local to the container. ```python @fl.task def my_task(ff: FlyteFile): print(os.path.isfile(ff.path)) # This will print False as nothing has been downloaded print(os.path.isfile(ff)) # This will print True as os.path.isfile(ff) downloads via __fspath__ print(os.path.isfile(ff.path)) # This will again print True as the file was downloaded ``` It is important to be aware of any operations on your `FlyteFile` that might call `__fspath__` and result in downloading. Some examples include, calling `open(ff, mode="r")` directly on a `FlyteFile` (rather than on the `path` attribute) to get the contents of the path, or similarly calling `shutil.copy` or `pathlib.Path` directly on a `FlyteFile`. ## FlyteDirectory **Calling `download` on a FlyteDirectory** ```python @fl.task def my_task(fd: FlyteDirectory): print(os.listdir(fd.path)) # This will print nothing as the directory has not been downloaded fd.download() print(os.listdir(fd.path)) # This will print the files present in the directory as it has been downloaded ``` Similar to how the `path` argument was used above for the `FlyteFile`, note that we use `fd.path` which is of type `typing.Union[str, os.PathLike]` rather than using `fd` in `os.listdir` directly. Again, we will see that this is because of the invocation of `__fspath__` when `os.listdir(fd)` is called. **Implicit downloading by `__fspath__`** In order to make use of some functions like `os.listdir` that you may be used to using with directories, `FlyteDirectory` implements a `__fspath__` method that downloads the remote contents to the `path` of `FlyteDirectory` local to the container. ```python @fl.task def my_task(fd: FlyteDirectory): print(os.listdir(fd.path)) # This will print nothing as the directory has not been downloaded print(os.listdir(fd)) # This will print the files present in the directory as os.listdir(fd) downloads via __fspath__ print(os.listdir(fd.path)) # This will again print the files present in the directory as it has been downloaded ``` It is important to be aware of any operations on your `FlyteDirectory` that might call `__fspath__` and result in downloading. Some other examples include, calling `os.stat` directly on a `FlyteDirectory` (rather than on the `path` attribute) to get the status of the path, or similarly calling `os.path.isdir` on a `FlyteDirectory` to check if a directory exists. **Inspecting the contents of a directory without downloading using `crawl`** As we saw above, using `os.listdir` on a `FlyteDirectory` to view the contents in remote blob storage results in the contents being downloaded to the task container. If this should be avoided, the `crawl` method offers a means of inspecting the contents of the directory without calling `__fspath__` and therefore downloading the directory contents. ```python @fl.task def task1() -> FlyteDirectory: p = os.path.join(current_context().working_directory, "my_new_directory") os.makedirs(p) # Create and write to two files with open(os.path.join(p, "file_1.txt"), 'w') as file1: file1.write("This is file 1.") with open(os.path.join(p, "file_2.txt"), 'w') as file2: file2.write("This is file 2.") return FlyteDirectory(p) @fl.task def task2(fd: FlyteDirectory): print(os.listdir(fd.path)) # This will print nothing as the directory has not been downloaded print(list(fd.crawl())) # This will print the files present in the remote blob storage # e.g. [('s3://union-contoso/ke/fe503def6ebe04fa7bba-n0-0/160e7266dcaffe79df85489771458d80', 'file_1.txt'), ('s3://union-contoso/ke/fe503def6ebe04fa7bba-n0-0/160e7266dcaffe79df85489771458d80', 'file_2.txt')] print(list(fd.crawl(detail=True))) # This will print the files present in the remote blob storage with details including type, the time it was created, and more # e.g. [('s3://union-contoso/ke/fe503def6ebe04fa7bba-n0-0/160e7266dcaffe79df85489771458d80', {'file_1.txt': {'Key': 'union-contoso/ke/fe503def6ebe04fa7bba-n0-0/160e7266dcaffe79df85489771458d80/file_1.txt', 'LastModified': datetime.datetime(2024, 7, 9, 16, 16, 21, tzinfo=tzlocal()), 'ETag': '"cfb2a3740155c041d2c3e13ad1d66644"', 'Size': 15, 'StorageClass': 'STANDARD', 'type': 'file', 'size': 15, 'name': 'union-contoso/ke/fe503def6ebe04fa7bba-n0-0/160e7266dcaffe79df85489771458d80/file_1.txt'}}), ('s3://union-contoso/ke/fe503def6ebe04fa7bba-n0-0/160e7266dcaffe79df85489771458d80', {'file_2.txt': {'Key': 'union-contoso/ke/fe503def6ebe04fa7bba-n0-0/160e7266dcaffe79df85489771458d80/file_2.txt', 'LastModified': datetime.datetime(2024, 7, 9, 16, 16, 21, tzinfo=tzlocal()), 'ETag': '"500d703f270d4bc034e159480c83d329"', 'Size': 15, 'StorageClass': 'STANDARD', 'type': 'file', 'size': 15, 'name': 'union-contoso/ke/fe503def6ebe04fa7bba-n0-0/160e7266dcaffe79df85489771458d80/file_2.txt'}})] print(os.listdir(fd.path)) # This will again print nothing as the directory has not been downloaded ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/task-input-and-output === # Task input and output The Flyte workflow engine automatically manages the passing of data from task to task, and to the workflow output. This mechanism relies on enforcing strong typing of task function parameters and return values. This enables the workflow engine to efficiently marshall and unmarshall values from one task container to the next. The actual data is temporarily stored in Flyte's internal object store within your data plane (AWS S3, Google Cloud Storage, or Azure Blob Storage, depending on your cloud provider). ## Metadata and raw data Flyte distinguishes between the metadata and raw data. Primitive values (`int`, `str`, etc.) are stored directly in the metadata store, while complex data objects (`pandas.DataFrame`, `FlyteFile`, etc.) are stored by reference, with the reference pointer in the metadata store and the actual data in the raw data store. ## Metadata store The metadata store is located in the dedicated Flyte object store in your data plane. Depending on your cloud provider, this may be an AWS S3, Google Cloud Storage, or Azure Blob Storage bucket. This data is accessible to the control plane. It is used to run and manage workflows and is surfaced in the UI. ## Raw data store The raw data store is, by default, also located in the dedicated Flyte object store in your data plane. However, this location can be overridden per workflow or per execution using the **raw data prefix** parameter. The data in the raw data store is not accessible to the control plane and will only be surfaced in the UI if your code explicitly does so (for example, in a Deck). For more details, see **Data handling**. ## Changing the raw data storage location There are a number of ways to change the raw data location: * When registering your workflow: * With [`uctl register`](), use the flag `--files.outputLocationPrefix`. * With [`pyflyte register`](), use the flag `--raw-data-prefix`. * At the execution level: * In the UI, set the **Raw output data config** parameter in the execution dialog. These options change the raw data location for **all large types** (`FlyteFile`, `FlyteDirectory`, `DataFrame`, any other large data object). If you are only concerned with controlling where raw data used by `FlyteFile` or `FlyteDirectory` is stored, you can **Data input/output > Task input and output > set the `remote_path` parameter** in your task code when initializing objects of those types. ### Setting up your own object store By default, when Flyte marshalls values across tasks, it stores both metadata and raw data in its own dedicated object store bucket. While this bucket is located in your Flyte BYOC data plane and is therefore under your control, it is part of the Flyte implementation and should not be accessed or modified directly by your task code. When changing the default raw data location, the target should therefore be a bucket that you set up, separate from the Flyte-implemented bucket. For information on setting up your own bucket and enabling access to it, see [Enabling AWS S3](../integrations/enabling-aws-resources/enabling-aws-s3), [Enabling Google Cloud Storage](../integrations/enabling-gcp-resources/enabling-google-cloud-storage), or [Enabling Azure Blob Storage](../integrations/enabling-azure-resources/enabling-azure-blob-storage), depending on your cloud provider. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/accessing-attributes === # Accessing attributes You can directly access attributes on output promises for lists, dictionaries, dataclasses, and combinations of these types in Flyte. Note that while this functionality may appear to be the normal behavior of Python, code in `@workflow` functions is not actually Python, but rather a Python-like DSL that is compiled by Flyte. Consequently, accessing attributes in this manner is, in fact, a specially implemented feature. This functionality facilitates the direct passing of output attributes within workflows, enhancing the convenience of working with complex data structures. > [!NOTE] > Flytekit version >= v1.14.0 supports Pydantic BaseModel V2, you can do attribute access on Pydantic BaseModel V2 as well. > > To clone and run the example code on this page, see the [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/data_types_and_io/). To begin, import the required dependencies and define a common task for subsequent use: ```python from dataclasses import dataclass import flytekit as fl @fl.task def print_message(message: str): print(message) return ``` ## List You can access an output list using index notation. > [!NOTE] > Flyte currently does not support output promise access through list slicing. ```python @fl.task def list_task() -> list[str]: return ["apple", "banana"] @fl.workflow def list_wf(): items = list_task() first_item = items[0] print_message(message=first_item) ``` ## Dictionary Access the output dictionary by specifying the key. ```python @fl.task def dict_task() -> dict[str, str]: return {"fruit": "banana"} @fl.workflow def dict_wf(): fruit_dict = dict_task() print_message(message=fruit_dict["fruit"]) ``` ## Data class Directly access an attribute of a dataclass. ```python @dataclass class Fruit: name: str @fl.task def dataclass_task() -> Fruit: return Fruit(name="banana") @fl.workflow def dataclass_wf(): fruit_instance = dataclass_task() print_message(message=fruit_instance.name) ``` ## Complex type Combinations of list, dict and dataclass also work effectively. ```python @fl.task def advance_task() -> (dict[str, list[str]], list[dict[str, str]], dict[str, Fruit]): return {"fruits": ["banana"]}, [{"fruit": "banana"}], {"fruit": Fruit(name="banana")} @fl.task def print_list(fruits: list[str]): print(fruits) @fl.task def print_dict(fruit_dict: dict[str, str]): print(fruit_dict) @fl.workflow def advanced_workflow(): dictionary_list, list_dict, dict_dataclass = advance_task() print_message(message=dictionary_list["fruits"][0]) print_message(message=list_dict[0]["fruit"]) print_message(message=dict_dataclass["fruit"].name) print_list(fruits=dictionary_list["fruits"]) print_dict(fruit_dict=list_dict[0]) ``` You can run all the workflows locally as follows: ```python if __name__ == "__main__": list_wf() dict_wf() dataclass_wf() advanced_workflow() ``` ## Failure scenario The following workflow fails because it attempts to access indices and keys that are out of range: ```python from flytekit import WorkflowFailurePolicy @fl.task def failed_task() -> (list[str], dict[str, str], Fruit): return ["apple", "banana"], {"fruit": "banana"}, Fruit(name="banana") @fl.workflow( # The workflow remains unaffected if one of the nodes encounters an error, as long as other executable nodes are still available failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE ) def failed_workflow(): fruits_list, fruit_dict, fruit_instance = failed_task() print_message(message=fruits_list[100]) # Accessing an index that doesn't exist print_message(message=fruit_dict["fruits"]) # Accessing a non-existent key print_message(message=fruit_instance.fruit) # Accessing a non-existent param ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/dataclass === # Dataclass When you've multiple values that you want to send across Flyte entities, you can use a `dataclass`. Flytekit uses the [Mashumaro library](https://github.com/Fatal1ty/mashumaro) to serialize and deserialize dataclasses. With the 1.14 release, `flytekit` adopted `MessagePack` as the serialization format for dataclasses, addressing a major limitation of previous versions that serialized data into a JSON string within a Protobuf `struct`. In earlier versions, Protobufโ€™s `struct` converted integer types to floats, requiring users to write boilerplate code to work around this issue. > [!NOTE] > If you're using Flytekit version < v1.11.1, you will need to add `from dataclasses_json import dataclass_json` to your imports and decorate your dataclass with `@dataclass_json`. > [!NOTE] > Flytekit version < v1.14.0 will produce protobuf `struct` literal for dataclasses. > > Flytekit version >= v1.14.0 will produce msgpack bytes literal for dataclasses. > > If you're using Flytekit version >= v1.14.0 and you want to produce protobuf `struct` literal for dataclasses, you can > set environment variable `FLYTE_USE_OLD_DC_FORMAT` to `true`. > > For more details, you can refer the MSGPACK IDL RFC: https://github.com/flyteorg/flyte/blob/master/rfc/system/5741-binary-idl-with-message-pack.md > [!NOTE] > To clone and run the example code on this page, see the [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/data_types_and_io/). To begin, import the necessary dependencies: ```python import os import tempfile from dataclasses import dataclass import pandas as pd import flytekit as fl from flytekit.types.structured import StructuredDataset ``` Build your custom image with ImageSpec: ```python image_spec = union.ImageSpec( registry="ghcr.io/flyteorg", packages=["pandas", "pyarrow"], ) ``` ## Python types We define a `dataclass` with `int`, `str` and `dict` as the data types. ```python @dataclass class Datum: x: int y: str z: dict[int, str] ``` You can send a `dataclass` between different tasks written in various languages, and input it through the Flyte UI as raw JSON. > [!NOTE] > All variables in a data class should be **annotated with their type**. Failure to do will result in an error. Once declared, a dataclass can be returned as an output or accepted as an input. ```python @fl.task(container_image=image_spec) def stringify(s: int) -> Datum: """ A dataclass return will be treated as a single complex JSON return. """ return Datum(x=s, y=str(s), z={s: str(s)}) @fl.task(container_image=image_spec) def add(x: Datum, y: Datum) -> Datum: x.z.update(y.z) return Datum(x=x.x + y.x, y=x.y + y.y, z=x.z) ``` ## Flyte types We also define a data class that accepts `StructuredDataset`, `FlyteFile` and `FlyteDirectory`. ```python @dataclass class FlyteTypes: dataframe: StructuredDataset file: union.FlyteFile directory: union.FlyteDirectory @fl.task(container_image=image_spec) def upload_data() -> FlyteTypes: df = pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]}) temp_dir = tempfile.mkdtemp(prefix="flyte-") df.to_parquet(temp_dir + "/df.parquet") file_path = tempfile.NamedTemporaryFile(delete=False) file_path.write(b"Hello, World!") fs = FlyteTypes( dataframe=StructuredDataset(dataframe=df), file=union.FlyteFile(file_path.name), directory=union.FlyteDirectory(temp_dir), ) return fs @fl.task(container_image=image_spec) def download_data(res: FlyteTypes): assert pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]}).equals(res.dataframe.open(pd.DataFrame).all()) f = open(res.file, "r") assert f.read() == "Hello, World!" assert os.listdir(res.directory) == ["df.parquet"] ``` A data class supports the usage of data associated with Python types, data classes, FlyteFile, FlyteDirectory and structured dataset. We define a workflow that calls the tasks created above. ```python @fl.workflow def dataclass_wf(x: int, y: int) -> (Datum, FlyteTypes): o1 = add(x=stringify(s=x), y=stringify(s=y)) o2 = upload_data() download_data(res=o2) return o1, o2 ``` To trigger the above task that accepts a dataclass as an input with `pyflyte run`, you can provide a JSON file as an input: ```shell $ pyflyte run dataclass.py add --x dataclass_input.json --y dataclass_input.json ``` Here is another example of triggering a task that accepts a dataclass as an input with `pyflyte run`, you can provide a JSON file as an input: ```shell $ pyflyte run \ https://raw.githubusercontent.com/flyteorg/flytesnacks/69dbe4840031a85d79d9ded25f80397c6834752d/examples/data_types_and_io/data_types_and_io/dataclass.py \ add --x dataclass_input.json --y dataclass_input.json ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/enum === # Enum type At times, you might need to limit the acceptable values for inputs or outputs to a predefined set. This common requirement is usually met by using `Enum` types in programming languages. You can create a Python `Enum` type and utilize it as an input or output for a task. Flytekit will automatically convert it and constrain the inputs and outputs to the predefined set of values. > [!NOTE] > Currently, only string values are supported as valid `Enum` values. > Flyte assumes the first value in the list as the default, and `Enum` types cannot be optional. > Therefore, when defining `Enum`s, it's important to design them with the first value as a valid default. We define an `Enum` and a simple coffee maker workflow that accepts an order and brews coffee โ˜•๏ธ accordingly. The assumption is that the coffee maker only understands `Enum` inputs: ```python # coffee_maker.py from enum import Enum import flytekit as fl class Coffee(Enum): ESPRESSO = "espresso" AMERICANO = "americano" LATTE = "latte" CAPPUCCINO = "cappucccino" @fl.task def take_order(coffee: str) -> Coffee: return Coffee(coffee) @fl.task def prep_order(coffee_enum: Coffee) -> str: return f"Preparing {coffee_enum.value} ..." @fl.workflow def coffee_maker(coffee: str) -> str: coffee_enum = take_order(coffee=coffee) return prep_order(coffee_enum=coffee_enum) # The workflow can also accept an enum value @fl.workflow def coffee_maker_enum(coffee_enum: Coffee) -> str: return prep_order(coffee_enum=coffee_enum) ``` You can specify value for the parameter `coffee_enum` on run: ```shell $ pyflyte run coffee_maker.py coffee_maker_enum --coffee_enum="latte" ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/pickle === # Pickle type Flyte enforces type safety by utilizing type information for compiling tasks and workflows, enabling various features such as static analysis and conditional branching. However, we also strive to offer flexibility to end-users, so they don't have to invest heavily in understanding their data structures upfront before experiencing the value Flyte has to offer. Flyte supports the `FlytePickle` transformer, which converts any unrecognized type hint into `FlytePickle`, enabling the serialization/deserialization of Python values to/from a pickle file. > [!NOTE] > Pickle can only be used to send objects between the exact same Python version. > For optimal performance, it's advisable to either employ Python types that are supported by Flyte > or register a custom transformer, as using pickle types can result in lower performance. This example demonstrates how you can utilize custom objects without registering a transformer. > [!NOTE] > To clone and run the example code on this page, see the [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/data_types_and_io/). ```python import flytekit as fl ``` `Superhero` represents a user-defined complex type that can be serialized to a pickle file by Flytekit and transferred between tasks as both input and output data. > [!NOTE] > Alternatively, you can **Data input/output > Dataclass** for improved performance. > We have used a simple object here for demonstration purposes. ```python class Superhero: def __init__(self, name, power): self.name = name self.power = power @fl.task def welcome_superhero(name: str, power: str) -> Superhero: return Superhero(name, power) @fl.task def greet_superhero(superhero: Superhero) -> str: return f"๐Ÿ‘‹ Hello {superhero.name}! Your superpower is {superhero.power}." @fl.workflow def superhero_wf(name: str = "Thor", power: str = "Flight") -> str: superhero = welcome_superhero(name=name, power=power) return greet_superhero(superhero=superhero) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/pydantic === # Pydantic BaseModel `flytekit` version >=1.14 supports natively the `JSON` format that Pydantic `BaseModel` produces, enhancing the interoperability of Pydantic BaseModels with the Flyte type system. > [!WARNING] > Pydantic BaseModel V2 only works when you are using flytekit version >= v1.14.0. With the 1.14 release, `flytekit` adopted `MessagePack` as the serialization format for Pydantic `BaseModel`, overcoming a major limitation of serialization into a JSON string within a Protobuf `struct` datatype like the previous versions do: to store `int` types, Protobuf's `struct` converts them to `float`, forcing users to write boilerplate code to work around this issue. > [!WARNING] > By default, `flytekit >= 1.14` will produce `msgpack` bytes literals when serializing, preserving the types defined in your `BaseModel` class. > If you're serializing `BaseModel` using `flytekit` version >= v1.14.0 and you want to produce Protobuf `struct` literal instead, you can set environment variable `FLYTE_USE_OLD_DC_FORMAT` to `true`. > > For more details, you can refer the MESSAGEPACK IDL RFC: [https://github.com/flyteorg/flyte/blob/master/rfc/system/5741-binary-idl-with-message-pack.md](https://github.com/flyteorg/flyte/blob/master/rfc/system/5741-binary-idl-with-message-pack) > [!NOTE] > To clone and run the example code on this page, see the [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/data_types_and_io/). > [!NOTE] > You can put Dataclass and FlyteTypes (FlyteFile, FlyteDirectory, FlyteSchema, and StructuredDataset) in a pydantic BaseModel. To begin, import the necessary dependencies: ```python import os import tempfile import pandas as pd from flytekit from flytekit.types.structured import StructuredDataset from pydantic import BaseModel ``` Build your custom image with ImageSpec: ```python image_spec = union.ImageSpec( registry="ghcr.io/flyteorg", packages=["pandas", "pyarrow", "pydantic"], ) ``` ## Python types We define a `pydantic basemodel` with `int`, `str` and `dict` as the data types. ```python class Datum(BaseModel): x: int y: str z: dict[int, str] ``` You can send a `pydantic basemodel` between different tasks written in various languages, and input it through the Flyte console as raw JSON. > [!NOTE] > All variables in a data class should be **annotated with their type**. Failure > to do will result in an error. Once declared, a dataclass can be returned as an output or accepted as an input. ```python @fl.task(container_image=image_spec) def stringify(s: int) -> Datum: """ A Pydantic model return will be treated as a single complex JSON return. """ return Datum(x=s, y=str(s), z={s: str(s)}) @fl.task(container_image=image_spec) def add(x: Datum, y: Datum) -> Datum: x.z.update(y.z) return Datum(x=x.x + y.x, y=x.y + y.y, z=x.z) ``` ## Flyte types We also define a data class that accepts `StructuredDataset`, `FlyteFile` and `FlyteDirectory`. ```python class FlytekitTypes(BaseModel): dataframe: StructuredDataset file: union.FlyteFile directory: union.FlyteDirectory @fl.task(container_image=image_spec) def upload_data() -> FlyteTypes: df = pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]}) temp_dir = tempfile.mkdtemp(prefix="flyte-") df.to_parquet(os.path.join(temp_dir, "df.parquet")) file_path = tempfile.NamedTemporaryFile(delete=False) file_path.write(b"Hello, World!") file_path.close() fs = FlyteTypes( dataframe=StructuredDataset(dataframe=df), file=fl.FlyteFile(file_path.name), directory=fl.FlyteDirectory(temp_dir), ) return fs @fl.task(container_image=image_spec) def download_data(res: FlyteTypes): expected_df = pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]}) actual_df = res.dataframe.open(pd.DataFrame).all() assert expected_df.equals(actual_df), "DataFrames do not match!" with open(res.file, "r") as f: assert f.read() == "Hello, World!", "File contents do not match!" assert os.listdir(res.directory) == ["df.parquet"], "Directory contents do not match!" ``` A data class supports the usage of data associated with Python types, data classes, FlyteFile, FlyteDirectory and StructuredDataset. We define a workflow that calls the tasks created above. ```python @fl.workflow def basemodel_wf(x: int, y: int) -> tuple[Datum, FlytekitTypes]: o1 = add(x=stringify(s=x), y=stringify(s=y)) o2 = upload_data() download_data(res=o2) return o1, o2 ``` To trigger a task that accepts a dataclass as an input with `pyflyte run`, you can provide a JSON file as an input: ``` $ pyflyte run dataclass.py basemodel_wf --x 1 --y 2 ``` To trigger a task that accepts a dataclass as an input with `pyflyte run`, you can provide a JSON file as an input: ```shell $ pyflyte run \ https://raw.githubusercontent.com/flyteorg/flytesnacks/b71e01d45037cea883883f33d8d93f258b9a5023/examples/data_types_and_io/data_types_and_io/pydantic_basemodel.py \ basemodel_wf --x 1 --y 2 ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/pytorch === # PyTorch type Flyte advocates for the use of strongly-typed data to simplify the development of robust and testable pipelines. In addition to its application in data engineering, Flyte is primarily used for machine learning. To streamline the communication between Flyte tasks, particularly when dealing with tensors and models, we have introduced support for PyTorch types. ## Tensors and modules At times, you may find the need to pass tensors and modules (models) within your workflow. Without native support for PyTorch tensors and modules, Flytekit relies on [pickle](https://docs-builder.pages.dev/docs/byoc/user-guide/data-input-output/pickle/) for serializing and deserializing these entities, as well as any unknown types. However, this approach isn't the most efficient. As a result, we've integrated PyTorch's serialization and deserialization support into the Flyte type system. > [!NOTE] > To clone and run the example code on this page, see the [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/data_types_and_io/). ```python @fl.task def generate_tensor_2d() -> torch.Tensor: return torch.tensor([[1.0, -1.0, 2], [1.0, -1.0, 9], [0, 7.0, 3]]) @fl.task def reshape_tensor(tensor: torch.Tensor) -> torch.Tensor: # convert 2D to 3D tensor.unsqueeze_(-1) return tensor.expand(3, 3, 2) @fl.task def generate_module() -> torch.nn.Module: bn = torch.nn.BatchNorm1d(3, track_running_stats=True) return bn @fl.task def get_model_weight(model: torch.nn.Module) -> torch.Tensor: return model.weight class MyModel(torch.nn.Module): def __init__(self): super(MyModel, self).__init__() self.l0 = torch.nn.Linear(4, 2) self.l1 = torch.nn.Linear(2, 1) def forward(self, input): out0 = self.l0(input) out0_relu = torch.nn.functional.relu(out0) return self.l1(out0_relu) @fl.task def get_l1() -> torch.nn.Module: model = MyModel() return model.l1 @fl.workflow def pytorch_native_wf(): reshape_tensor(tensor=generate_tensor_2d()) get_model_weight(model=generate_module()) get_l1() ``` Passing around tensors and modules is no more a hassle! ## Checkpoint `PyTorchCheckpoint` is a specialized checkpoint used for serializing and deserializing PyTorch models. It checkpoints `torch.nn.Module`'s state, hyperparameters and optimizer state. This module checkpoint differs from the standard checkpoint as it specifically captures the module's `state_dict`. Therefore, when restoring the module, the module's `state_dict` must be used in conjunction with the actual module. According to the PyTorch [docs](https://pytorch.org/tutorials/beginner/saving_loading_models.html#save-load-entire-model), it's recommended to store the module's `state_dict` rather than the module itself, although the serialization should work in either case. ```python from dataclasses import dataclass import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from dataclasses_json import dataclass_json from flytekit.extras.pytorch import PyTorchCheckpoint @dataclass_json @dataclass class Hyperparameters: epochs: int loss: float class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16 * 5 * 5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x @fl.task def generate_model(hyperparameters: Hyperparameters) -> PyTorchCheckpoint: bn = Net() optimizer = optim.SGD(bn.parameters(), lr=0.001, momentum=0.9) return PyTorchCheckpoint(module=bn, hyperparameters=hyperparameters, optimizer=optimizer) @fl.task def load(checkpoint: PyTorchCheckpoint): new_bn = Net() new_bn.load_state_dict(checkpoint["module_state_dict"]) optimizer = optim.SGD(new_bn.parameters(), lr=0.001, momentum=0.9) optimizer.load_state_dict(checkpoint["optimizer_state_dict"]) @fl.workflow def pytorch_checkpoint_wf(): checkpoint = generate_model(hyperparameters=Hyperparameters(epochs=10, loss=0.1)) load(checkpoint=checkpoint) ``` > [!NOTE] > `PyTorchCheckpoint` supports serializing hyperparameters of types `dict`, `NamedTuple` and `dataclass`. ## Auto GPU to CPU and CPU to GPU conversion Not all PyTorch computations require a GPU. In some cases, it can be advantageous to transfer the computation to a CPU, especially after training the model on a GPU. To utilize the power of a GPU, the typical construct to use is: `to(torch.device("cuda"))`. When working with GPU variables on a CPU, variables need to be transferred to the CPU using the `to(torch.device("cpu"))` construct. However, this manual conversion recommended by PyTorch may not be very user-friendly. To address this, we added support for automatic GPU to CPU conversion (and vice versa) for PyTorch types. ```python import flytekit as fl from typing import Tuple @fl.task(requests=union.Resources(gpu="1")) def train() -> Tuple[PyTorchCheckpoint, torch.Tensor, torch.Tensor, torch.Tensor]: ... device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = Model(X_train.shape[1]) model.to(device) ... X_train, X_test = X_train.to(device), X_test.to(device) y_train, y_test = y_train.to(device), y_test.to(device) ... return PyTorchCheckpoint(module=model), X_train, X_test, y_test @fl.task def predict( checkpoint: PyTorchCheckpoint, X_train: torch.Tensor, X_test: torch.Tensor, y_test: torch.Tensor, ): new_bn = Model(X_train.shape[1]) new_bn.load_state_dict(checkpoint["module_state_dict"]) accuracy_list = np.zeros((5,)) with torch.no_grad(): y_pred = new_bn(X_test) correct = (torch.argmax(y_pred, dim=1) == y_test).type(torch.FloatTensor) accuracy_list = correct.mean() ``` The `predict` task will run on a CPU, and the device conversion from GPU to CPU will be automatically handled by Flytekit. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/structured-dataset === # StructuredDataset As with most type systems, Python has primitives, container types like maps and tuples, and support for user-defined structures. However, while thereโ€™s a rich variety of DataFrame classes (Pandas, Spark, Pandera, etc.), thereโ€™s no native Python type that represents a DataFrame in the abstract. This is the gap that the `StructuredDataset` type is meant to fill. It offers the following benefits: - Eliminate boilerplate code you would otherwise need to write to serialize/deserialize from file objects into DataFrame instances, - Eliminate additional inputs/outputs that convey metadata around the format of the tabular data held in those files, - Add flexibility around how DataFrame files are loaded, - Offer a range of DataFrame specific functionality - enforce compatibility of different schemas (not only at compile time, but also runtime since type information is carried along in the literal), store third-party schema definitions, and potentially in the future, render sample data, provide summary statistics, etc. ## Usage To use the `StructuredDataset` type, import `pandas` and define a task that returns a Pandas Dataframe. Flytekit will detect the Pandas DataFrame return signature and convert the interface for the task to the `StructuredDataset` type. ## Example This example demonstrates how to work with a structured dataset using Flyte entities. > [!NOTE] > To use the `StructuredDataset` type, you only need to import `pandas`. The other imports specified below are only necessary for this specific example. > [!NOTE] > To clone and run the example code on this page, see the [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/data_types_and_io/). To begin, import the dependencies for the example: ```python import typing from dataclasses import dataclass from pathlib import Path import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import flytekit as fl from flytekit.models import literals from flytekit.models.literals import StructuredDatasetMetadata from flytekit.types.structured.structured_dataset import ( PARQUET, StructuredDataset, StructuredDatasetDecoder, StructuredDatasetEncoder, StructuredDatasetTransformerEngine, ) from typing_extensions import Annotated ``` Define a task that returns a Pandas DataFrame. ```python @fl.task(container_image=image_spec) def generate_pandas_df(a: int) -> pd.DataFrame: return pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [a, 22], "Height": [160, 178]}) ``` Using this simplest form, however, the user is not able to set the additional DataFrame information alluded to above, - Column type information - Serialized byte format - Storage driver and location - Additional third party schema information This is by design as we wanted the default case to suffice for the majority of use-cases, and to require as few changes to existing code as possible. Specifying these is simple, however, and relies on Python variable annotations, which is designed explicitly to supplement types with arbitrary metadata. ## Column type information If you want to extract a subset of actual columns of the DataFrame and specify their types for type validation, you can just specify the column names and their types in the structured dataset type annotation. First, initialize column types you want to extract from the `StructuredDataset`. ```python all_cols = fl.kwtypes(Name=str, Age=int, Height=int) col = fl.kwtypes(Age=int) ``` Define a task that opens a structured dataset by calling `all()`. When you invoke `all()` with ``pandas.DataFrame``, the Flyte engine downloads the parquet file on S3, and deserializes it to `pandas.DataFrame`. Keep in mind that you can invoke ``open()`` with any DataFrame type that's supported or added to structured dataset. For instance, you can use ``pa.Table`` to convert the Pandas DataFrame to a PyArrow table. ```python @fl.task(container_image=image_spec) def get_subset_pandas_df(df: Annotated[StructuredDataset, all_cols]) -> Annotated[StructuredDataset, col]: df = df.open(pd.DataFrame).all() df = pd.concat([df, pd.DataFrame([[30]], columns=["Age"])]) return StructuredDataset(dataframe=df) @fl.workflow def simple_sd_wf(a: int = 19) -> Annotated[StructuredDataset, col]: pandas_df = generate_pandas_df(a=a) return get_subset_pandas_df(df=pandas_df) ``` The code may result in runtime failures if the columns do not match. The input ``df`` has ``Name``, ``Age`` and ``Height`` columns, whereas the output structured dataset will only have the ``Age`` column. ## Serialized byte format You can use a custom serialization format to serialize your DataFrames. Here's how you can register the Pandas to CSV handler, which is already available, and enable the CSV serialization by annotating the structured dataset with the CSV format: ```python from flytekit.types.structured import register_csv_handlers from flytekit.types.structured.structured_dataset import CSV register_csv_handlers() @fl.task(container_image=image_spec) def pandas_to_csv(df: pd.DataFrame) -> Annotated[StructuredDataset, CSV]: return StructuredDataset(dataframe=df) @fl.workflow def pandas_to_csv_wf() -> Annotated[StructuredDataset, CSV]: pandas_df = generate_pandas_df(a=19) return pandas_to_csv(df=pandas_df) ``` ## Storage driver and location By default, the data will be written to the same place that all other pointer-types (FlyteFile, FlyteDirectory, etc.) are written to. This is controlled by the output data prefix option in Flyte which is configurable on multiple levels. That is to say, in the simple default case, Flytekit will, - Look up the default format for say, Pandas DataFrames, - Look up the default storage location based on the raw output prefix setting, - Use these two settings to select an encoder and invoke it. So what's an encoder? To understand that, let's look into how the structured dataset plugin works. ## Inner workings of a structured dataset plugin Two things need to happen with any DataFrame instance when interacting with Flyte: - Serialization/deserialization from/to the Python instance to bytes (in the format specified above). - Transmission/retrieval of those bits to/from somewhere. Each structured dataset plugin (called encoder or decoder) needs to perform both of these steps. Flytekit decides which of the loaded plugins to invoke based on three attributes: - The byte format - The storage location - The Python type in the task or workflow signature. These three keys uniquely identify which encoder (used when converting a DataFrame in Python memory to a Flyte value, e.g. when a task finishes and returns a DataFrame) or decoder (used when hydrating a DataFrame in memory from a Flyte value, e.g. when a task starts and has a DataFrame input) to invoke. However, it is awkward to require users to use `typing.Annotated` on every signature. Therefore, Flytekit has a default byte-format for every registered Python DataFrame type. ## The `uri` argument BigQuery `uri` allows you to load and retrieve data from cloud using the `uri` argument. The `uri` comprises of the bucket name and the filename prefixed with `gs://`. If you specify BigQuery `uri` for structured dataset, BigQuery creates a table in the location specified by the `uri`. The `uri` in structured dataset reads from or writes to S3, GCP, BigQuery or any storage. Before writing DataFrame to a BigQuery table, 1. Create a [GCP account](https://cloud.google.com/docs/authentication/getting-started) and create a service account. 2. Create a project and add the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to your `.bashrc` file. 3. Create a dataset in your project. Here's how you can define a task that converts a pandas DataFrame to a BigQuery table: ```python @fl.task def pandas_to_bq() -> StructuredDataset: df = pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]}) return StructuredDataset(dataframe=df, uri="gs:///") ``` Replace `BUCKET_NAME` with the name of your GCS bucket and `FILE_NAME` with the name of the file the DataFrame should be copied to. ### Note that no format was specified in the structured dataset constructor, or in the signature. So how did the BigQuery encoder get invoked? This is because the stock BigQuery encoder is loaded into Flytekit with an empty format. The Flytekit `StructuredDatasetTransformerEngine` interprets that to mean that it is a generic encoder (or decoder) and can work across formats, if a more specific format is not found. And here's how you can define a task that converts the BigQuery table to a pandas DataFrame: ```python @fl.task def bq_to_pandas(sd: StructuredDataset) -> pd.DataFrame: return sd.open(pd.DataFrame).all() ``` > [!NOTE] > Flyte creates a table inside the dataset in the project upon BigQuery query execution. ## How to return multiple DataFrames from a task? For instance, how would a task return say two DataFrames: - The first DataFrame be written to BigQuery and serialized by one of their libraries, - The second needs to be serialized to CSV and written at a specific location in GCS different from the generic pointer-data bucket If you want the default behavior (which is itself configurable based on which plugins are loaded), you can work just with your current raw DataFrame classes. ```python @fl.task def t1() -> typing.Tuple[StructuredDataset, StructuredDataset]: ... return StructuredDataset(df1, uri="bq://project:flyte.table"), \ StructuredDataset(df2, uri="gs://auxiliary-bucket/data") ``` If you want to customize the Flyte interaction behavior, you'll need to wrap your DataFrame in a `StructuredDataset` wrapper object. ## How to define a custom structured dataset plugin? `StructuredDataset` ships with an encoder and a decoder that handles the conversion of a Python value to a Flyte literal and vice-versa, respectively. Here is a quick demo showcasing how one might build a NumPy encoder and decoder, enabling the use of a 2D NumPy array as a valid type within structured datasets. ### NumPy encoder Extend `StructuredDatasetEncoder` and implement the `encode` function. The `encode` function converts NumPy array to an intermediate format (parquet file format in this case). ```python class NumpyEncodingHandler(StructuredDatasetEncoder): def encode( self, ctx: fl.FlyteContext, structured_dataset: StructuredDataset, structured_dataset_type: union.StructuredDatasetType, ) -> literals.StructuredDataset: df = typing.cast(np.ndarray, structured_dataset.dataframe) name = ["col" + str(i) for i in range(len(df))] table = pa.Table.from_arrays(df, name) path = ctx.file_access.get_random_remote_directory() local_dir = ctx.file_access.get_random_local_directory() local_path = Path(local_dir) / f"{0:05}" pq.write_table(table, str(local_path)) ctx.file_access.upload_directory(local_dir, path) return literals.StructuredDataset( uri=path, metadata=StructuredDatasetMetadata(structured_dataset_type=union.StructuredDatasetType(format=PARQUET)), ) ``` ### NumPy decoder Extend `StructuredDatasetDecoder` and implement the `StructuredDatasetDecoder.decode` function. The `StructuredDatasetDecoder.decode` function converts the parquet file to a `numpy.ndarray`. ```python class NumpyDecodingHandler(StructuredDatasetDecoder): def decode( self, ctx: fl.FlyteContext, flyte_value: literals.StructuredDataset, current_task_metadata: StructuredDatasetMetadata, ) -> np.ndarray: local_dir = ctx.file_access.get_random_local_directory() ctx.file_access.get_data(flyte_value.uri, local_dir, is_multipart=True) table = pq.read_table(local_dir) return table.to_pandas().to_numpy() ``` ### NumPy renderer Create a default renderer for numpy array, then Flytekit will use this renderer to display schema of NumPy array on the Deck. ```python class NumpyRenderer: def to_html(self, df: np.ndarray) -> str: assert isinstance(df, np.ndarray) name = ["col" + str(i) for i in range(len(df))] table = pa.Table.from_arrays(df, name) return pd.DataFrame(table.schema).to_html(index=False) ``` In the end, register the encoder, decoder and renderer with the `StructuredDatasetTransformerEngine`. Specify the Python type you want to register this encoder with (`np.ndarray`), the storage engine to register this against (if not specified, it is assumed to work for all the storage backends), and the byte format, which in this case is `PARQUET`. ```python StructuredDatasetTransformerEngine.register(NumpyEncodingHandler(np.ndarray, None, PARQUET)) StructuredDatasetTransformerEngine.register(NumpyDecodingHandler(np.ndarray, None, PARQUET)) StructuredDatasetTransformerEngine.register_renderer(np.ndarray, NumpyRenderer()) ``` You can now use `numpy.ndarray` to deserialize the parquet file to NumPy and serialize a task's output (NumPy array) to a parquet file. ```python @fl.task(container_image=image_spec) def generate_pd_df_with_str() -> pd.DataFrame: return pd.DataFrame({"Name": ["Tom", "Joseph"]}) @fl.task(container_image=image_spec) def to_numpy(sd: StructuredDataset) -> Annotated[StructuredDataset, None, PARQUET]: numpy_array = sd.open(np.ndarray).all() return StructuredDataset(dataframe=numpy_array) @fl.workflow def numpy_wf() -> Annotated[StructuredDataset, None, PARQUET]: return to_numpy(sd=generate_pd_df_with_str()) ``` > [!NOTE] > `pyarrow` raises an `Expected bytes, got a 'int' object` error when the DataFrame contains integers. You can run the code locally as follows: ```python if __name__ == "__main__": sd = simple_sd_wf() print(f"A simple Pandas DataFrame workflow: {sd.open(pd.DataFrame).all()}") print(f"Using CSV as the serializer: {pandas_to_csv_wf().open(pd.DataFrame).all()}") print(f"NumPy encoder and decoder: {numpy_wf().open(np.ndarray).all()}") ``` ### The nested typed columns Like most storage formats (e.g. Avro, Parquet, and BigQuery), StructuredDataset support nested field structures. > [!NOTE] > Nested field StructuredDataset should be run when flytekit version > 1.11.0. ```python data = [ { "company": "XYZ pvt ltd", "location": "London", "info": {"president": "Rakesh Kapoor", "contacts": {"email": "contact@xyz.com", "tel": "9876543210"}}, }, { "company": "ABC pvt ltd", "location": "USA", "info": {"president": "Kapoor Rakesh", "contacts": {"email": "contact@abc.com", "tel": "0123456789"}}, }, ] @dataclass class ContactsField: email: str tel: str @dataclass class InfoField: president: str contacts: ContactsField @dataclass class CompanyField: location: str info: InfoField company: str MyArgDataset = Annotated[StructuredDataset, union.kwtypes(company=str)] MyTopDataClassDataset = Annotated[StructuredDataset, CompanyField] MyTopDictDataset = Annotated[StructuredDataset, {"company": str, "location": str}] MyDictDataset = Annotated[StructuredDataset, union.kwtypes(info={"contacts": {"tel": str}})] MyDictListDataset = Annotated[StructuredDataset, union.kwtypes(info={"contacts": {"tel": str, "email": str}})] MySecondDataClassDataset = Annotated[StructuredDataset, union.kwtypes(info=InfoField)] MyNestedDataClassDataset = Annotated[StructuredDataset, union.kwtypes(info=union.kwtypes(contacts=ContactsField))] image = union.ImageSpec(packages=["pandas", "pyarrow", "pandas", "tabulate"], registry="ghcr.io/flyteorg") @fl.task(container_image=image) def create_parquet_file() -> StructuredDataset: from tabulate import tabulate df = pd.json_normalize(data, max_level=0) print("original DataFrame: \n", tabulate(df, headers="keys", tablefmt="psql")) return StructuredDataset(dataframe=df) @fl.task(container_image=image) def print_table_by_arg(sd: MyArgDataset) -> pd.DataFrame: from tabulate import tabulate t = sd.open(pd.DataFrame).all() print("MyArgDataset DataFrame: \n", tabulate(t, headers="keys", tablefmt="psql")) return t @fl.task(container_image=image) def print_table_by_dict(sd: MyDictDataset) -> pd.DataFrame: from tabulate import tabulate t = sd.open(pd.DataFrame).all() print("MyDictDataset DataFrame: \n", tabulate(t, headers="keys", tablefmt="psql")) return t @fl.task(container_image=image) def print_table_by_list_dict(sd: MyDictListDataset) -> pd.DataFrame: from tabulate import tabulate t = sd.open(pd.DataFrame).all() print("MyDictListDataset DataFrame: \n", tabulate(t, headers="keys", tablefmt="psql")) return t @fl.task(container_image=image) def print_table_by_top_dataclass(sd: MyTopDataClassDataset) -> pd.DataFrame: from tabulate import tabulate t = sd.open(pd.DataFrame).all() print("MyTopDataClassDataset DataFrame: \n", tabulate(t, headers="keys", tablefmt="psql")) return t @fl.task(container_image=image) def print_table_by_top_dict(sd: MyTopDictDataset) -> pd.DataFrame: from tabulate import tabulate t = sd.open(pd.DataFrame).all() print("MyTopDictDataset DataFrame: \n", tabulate(t, headers="keys", tablefmt="psql")) return t @fl.task(container_image=image) def print_table_by_second_dataclass(sd: MySecondDataClassDataset) -> pd.DataFrame: from tabulate import tabulate t = sd.open(pd.DataFrame).all() print("MySecondDataClassDataset DataFrame: \n", tabulate(t, headers="keys", tablefmt="psql")) return t @fl.task(container_image=image) def print_table_by_nested_dataclass(sd: MyNestedDataClassDataset) -> pd.DataFrame: from tabulate import tabulate t = sd.open(pd.DataFrame).all() print("MyNestedDataClassDataset DataFrame: \n", tabulate(t, headers="keys", tablefmt="psql")) return t @fl.workflow def contacts_wf(): sd = create_parquet_file() print_table_by_arg(sd=sd) print_table_by_dict(sd=sd) print_table_by_list_dict(sd=sd) print_table_by_top_dataclass(sd=sd) print_table_by_top_dict(sd=sd) print_table_by_second_dataclass(sd=sd) print_table_by_nested_dataclass(sd=sd) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/data-input-output/tensorflow === # TensorFlow types This document outlines the TensorFlow types available in Flyte, which facilitate the integration of TensorFlow models and datasets in Flyte workflows. ### Import necessary libraries and modules ```python import flytekit as fl from flytekit.types.directory import TFRecordsDirectory from flytekit.types.file import TFRecordFile custom_image = fl.ImageSpec( packages=["tensorflow", "tensorflow-datasets", "flytekitplugins-kftensorflow"], registry="ghcr.io/flyteorg", ) import tensorflow as tf ``` ## Tensorflow model Flyte supports the TensorFlow SavedModel format for serializing and deserializing `tf.keras.Model` instances. The `TensorFlowModelTransformer` is responsible for handling these transformations. ### Transformer - **Name:** TensorFlow Model - **Class:** `TensorFlowModelTransformer` - **Python Type:** `tf.keras.Model` - **Blob Format:** `TensorFlowModel` - **Dimensionality:** `MULTIPART` ### Usage The `TensorFlowModelTransformer` allows you to save a TensorFlow model to a remote location and retrieve it later in your Flyte workflows. > [!NOTE] > To clone and run the example code on this page, see the [Flytesnacks repo](https://github.com/flyteorg/flytesnacks/tree/master/examples/data_types_and_io/data_types_and_io/tensorflow_type.py). ```python @fl.task(container_image=custom_image) def train_model() -> tf.keras.Model: model = tf.keras.Sequential( [tf.keras.layers.Dense(128, activation="relu"), tf.keras.layers.Dense(10, activation="softmax")] ) model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]) return model @fl.task(container_image=custom_image) def evaluate_model(model: tf.keras.Model, x: tf.Tensor, y: tf.Tensor) -> float: loss, accuracy = model.evaluate(x, y) return accuracy @fl.workflow def training_workflow(x: tf.Tensor, y: tf.Tensor) -> float: model = train_model() return evaluate_model(model=model, x=x, y=y) ``` ## TFRecord files Flyte supports TFRecord files through the `TFRecordFile` type, which can handle serialized TensorFlow records. The `TensorFlowRecordFileTransformer` manages the conversion of TFRecord files to and from Flyte literals. ### Transformer - **Name:** TensorFlow Record File - **Class:** `TensorFlowRecordFileTransformer` - **Blob Format:** `TensorFlowRecord` - **Dimensionality:** `SINGLE` ### Usage The `TensorFlowRecordFileTransformer` enables you to work with single TFRecord files, making it easy to read and write data in TensorFlow's TFRecord format. ```python @fl.task(container_image=custom_image) def process_tfrecord(file: TFRecordFile) -> int: count = 0 for record in tf.data.TFRecordDataset(file): count += 1 return count @fl.workflow def tfrecord_workflow(file: TFRecordFile) -> int: return process_tfrecord(file=file) ``` ## TFRecord directories Flyte supports directories containing multiple TFRecord files through the `TFRecordsDirectory` type. The `TensorFlowRecordsDirTransformer` manages the conversion of TFRecord directories to and from Flyte literals. ### Transformer - **Name:** TensorFlow Record Directory - **Class:** `TensorFlowRecordsDirTransformer` - **Python Type:** `TFRecordsDirectory` - **Blob Format:** `TensorFlowRecord` - **Dimensionality:** `MULTIPART` ### Usage The `TensorFlowRecordsDirTransformer` allows you to work with directories of TFRecord files, which is useful for handling large datasets that are split across multiple files. #### Example ```python @fl.task(container_image=custom_image) def process_tfrecords_dir(dir: TFRecordsDirectory) -> int: count = 0 for record in tf.data.TFRecordDataset(dir.path): count += 1 return count @fl.workflow def tfrecords_dir_workflow(dir: TFRecordsDirectory) -> int: return process_tfrecords_dir(dir=dir) ``` ## Configuration class: `TFRecordDatasetConfig` The `TFRecordDatasetConfig` class is a data structure used to configure the parameters for creating a `tf.data.TFRecordDataset`, which allows for efficient reading of TFRecord files. This class uses the `DataClassJsonMixin` for easy JSON serialization. ### Attributes - **compression_type**: (Optional) Specifies the compression method used for the TFRecord files. Possible values include an empty string (no compression), "ZLIB", or "GZIP". - **buffer_size**: (Optional) Defines the size of the read buffer in bytes. If not set, defaults will be used based on the local or remote file system. - **num_parallel_reads**: (Optional) Determines the number of files to read in parallel. A value greater than one outputs records in an interleaved order. - **name**: (Optional) Assigns a name to the operation for easier identification in the pipeline. This configuration is crucial for optimizing the reading process of TFRecord datasets, especially when dealing with large datasets or when specific performance tuning is required. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/programming === # Programming This section covers the general programming of Flyte. ## Subpages - **Programming > Chaining Entities** - **Programming > Conditionals** - **Programming > Decorating tasks** - **Programming > Decorating workflows** - **Programming > Intratask checkpoints** - **Programming > Waiting for external inputs** - **Programming > Nested parallelism** - **Programming > Failure node** === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/programming/chaining-entities === # Chaining Entities Flyte offers a mechanism for chaining entities using the `>>` operator. This is particularly valuable when chaining tasks and subworkflows without the need for data flow between the entities. ## Tasks Letโ€™s establish a sequence where `t1()` occurs after `t0()`, and `t2()` follows `t1()`. ```python import flytekit as fl @fl.task def t2(): print("Running t2") return @fl.task def t1(): print("Running t1") return @fl.task def t0(): print("Running t0") return # Chaining tasks @fl.workflow def chain_tasks_wf(): t2_promise = t2() t1_promise = t1() t0_promise = t0() t0_promise >> t1_promise t1_promise >> t2_promise ``` ## Subworkflows Just like tasks, you can chain subworkflows. ```python @fl.workflow def sub_workflow_1(): t1() @fl.workflow def sub_workflow_0(): t0() @fl.workflow def chain_workflows_wf(): sub_wf1 = sub_workflow_1() sub_wf0 = sub_workflow_0() sub_wf0 >> sub_wf1 ``` > [!NOTE] > Chaining tasks and subworkflows is not supported in local Python environments. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/programming/conditionals === # Conditionals Flytekit elevates conditions to a first-class construct named `conditional`, providing a powerful mechanism for selectively executing branches in a workflow. Conditions leverage static or dynamic data generated by tasks or received as workflow inputs. While conditions are highly performant in their evaluation, it's important to note that they are restricted to specific binary and logical operators and are applicable only to primitive values. To begin, import the necessary libraries. ```python import random import flytekit as fl from flytekit import conditional from flytekit.core.task import Echo ``` ## Simple branch In this example, we introduce two tasks, `calculate_circle_circumference` and `calculate_circle_area`. The workflow dynamically chooses between these tasks based on whether the input falls within the fraction range (0-1) or not. ```python @fl.task def calculate_circle_circumference(radius: float) -> float: return 2 * 3.14 * radius # Task to calculate the circumference of a circle @fl.task def calculate_circle_area(radius: float) -> float: return 3.14 * radius * radius # Task to calculate the area of a circle @fl.workflow def shape_properties(radius: float) -> float: return ( conditional("shape_properties") .if_((radius >= 0.1) & (radius < 1.0)) .then(calculate_circle_circumference(radius=radius)) .else_() .then(calculate_circle_area(radius=radius)) ) if __name__ == "__main__": radius_small = 0.5 print(f"Circumference of circle (radius={radius_small}): {shape_properties(radius=radius_small)}") radius_large = 3.0 print(f"Area of circle (radius={radius_large}): {shape_properties(radius=radius_large)}") ``` ## Multiple branches We establish an `if` condition with multiple branches, which will result in a failure if none of the conditions is met. It's important to note that any `conditional` statement in Flyte is expected to be complete, meaning that all possible branches must be accounted for. ```python @fl.workflow def shape_properties_with_multiple_branches(radius: float) -> float: return ( conditional("shape_properties_with_multiple_branches") .if_((radius >= 0.1) & (radius < 1.0)) .then(calculate_circle_circumference(radius=radius)) .elif_((radius >= 1.0) & (radius <= 10.0)) .then(calculate_circle_area(radius=radius)) .else_() .fail("The input must be within the range of 0 to 10.") ) ``` > [!NOTE] > Take note of the usage of bitwise operators (`&`). Due to Python's PEP-335, > the logical `and`, `or` and `not` operators cannot be overloaded. > Flytekit employs bitwise `&` and `|` as equivalents for logical `and` and `or` operators, > a convention also observed in other libraries. ## Consuming the output of a conditional Here, we write a task that consumes the output returned by a `conditional`. ```python @fl.workflow def shape_properties_accept_conditional_output(radius: float) -> float: result = ( conditional("shape_properties_accept_conditional_output") .if_((radius >= 0.1) & (radius < 1.0)) .then(calculate_circle_circumference(radius=radius)) .elif_((radius >= 1.0) & (radius <= 10.0)) .then(calculate_circle_area(radius=radius)) .else_() .fail("The input must exist between 0 and 10.") ) return calculate_circle_area(radius=result) if __name__ == "__main__": radius_small = 0.5 print( f"Circumference of circle (radius={radius_small}) x Area of circle (radius={calculate_circle_circumference(radius=radius_small)}): {shape_properties_accept_conditional_output(radius=radius_small)}" ) ``` ## Using the output of a previous task in a conditional You can check if a boolean returned from the previous task is `True`, but unary operations are not supported directly. Instead, use the `is_true`, `is_false` and `is_none` methods on the result. ```python @fl..task def coin_toss(seed: int) -> bool: """ Mimic a condition to verify the successful execution of an operation """ r = random.Random(seed) if r.random() < 0.5: return True return False @fl..task def failed() -> int: """ Mimic a task that handles failure """ return -1 @fl..task def success() -> int: """ Mimic a task that handles success """ return 0 @fl..workflow def boolean_wf(seed: int = 5) -> int: result = coin_toss(seed=seed) return conditional("coin_toss").if_(result.is_true()).then(success()).else_().then(failed()) ``` [!NOTE] > *How do output values acquire these methods?* In a workflow, direct access to outputs is not permitted. > Inputs and outputs are automatically encapsulated in a special object known as `flytekit.extend.Promise`. ## Using boolean workflow inputs in a conditional You can directly pass a boolean to a workflow. ```python @fl.workflow def boolean_input_wf(boolean_input: bool) -> int: return conditional("boolean_input_conditional").if_(boolean_input.is_true()).then(success()).else_().then(failed()) ``` > [!NOTE] > Observe that the passed boolean possesses a method called `is_true`. > This boolean resides within the workflow context and is encapsulated in a specialized Flytekit object. > This special object enables it to exhibit additional behavior. You can run the workflows locally as follows: ```python if __name__ == "__main__": print("Running boolean_wf a few times...") for index in range(0, 5): print(f"The output generated by boolean_wf = {boolean_wf(seed=index)}") print( f"Boolean input: {True if index < 2 else False}; workflow output: {boolean_input_wf(boolean_input=True if index < 2 else False)}" ) ``` ## Nested conditionals You can nest conditional sections arbitrarily inside other conditional sections. However, these nested sections can only be in the `then` part of a `conditional` block. ```python @fl.workflow def nested_conditions(radius: float) -> float: return ( conditional("nested_conditions") .if_((radius >= 0.1) & (radius < 1.0)) .then( conditional("inner_nested_conditions") .if_(radius < 0.5) .then(calculate_circle_circumference(radius=radius)) .elif_((radius >= 0.5) & (radius < 0.9)) .then(calculate_circle_area(radius=radius)) .else_() .fail("0.9 is an outlier.") ) .elif_((radius >= 1.0) & (radius <= 10.0)) .then(calculate_circle_area(radius=radius)) .else_() .fail("The input must be within the range of 0 to 10.") ) if __name__ == "__main__": print(f"nested_conditions(0.4): {nested_conditions(radius=0.4)}") ``` ## Using the output of a task in a conditional Let's write a fun workflow that triggers the `calculate_circle_circumference` task in the event of a "heads" outcome, and alternatively, runs the `calculate_circle_area` task in the event of a "tail" outcome. ```python @fl.workflow def consume_task_output(radius: float, seed: int = 5) -> float: is_heads = coin_toss(seed=seed) return ( conditional("double_or_square") .if_(is_heads.is_true()) .then(calculate_circle_circumference(radius=radius)) .else_() .then(calculate_circle_area(radius=radius)) ) ``` You can run the workflow locally as follows: ```python if __name__ == "__main__": default_seed_output = consume_task_output(radius=0.4) print( f"Executing consume_task_output(0.4) with default seed=5. Expected output: calculate_circle_area => {default_seed_output}" ) custom_seed_output = consume_task_output(radius=0.4, seed=7) print( f"Executing consume_task_output(0.4, seed=7). Expected output: calculate_circle_circumference => {custom_seed_output}" ) ``` ## Running a noop task in a conditional In some cases, you may want to skip the execution of a conditional workflow if a certain condition is not met. You can achieve this by using the `echo` task, which simply returns the input value. > [!NOTE] > To enable the echo plugin in the backend, add the plugin to Flyte's configuration file. > ```yaml > task-plugins: > enabled-plugins: > - echo > ``` ```python echo = Echo(name="echo", inputs={"radius": float}) @fl.workflow def noop_in_conditional(radius: float, seed: int = 5) -> float: is_heads = coin_toss(seed=seed) return ( conditional("noop_in_conditional") .if_(is_heads.is_true()) .then(calculate_circle_circumference(radius=radius)) .else_() .then(echo(radius=radius)) ) ``` ## Run the example on the Flyte cluster To run the provided workflows on the Flyte cluster, use the following commands: ```shell $ pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/656e63d1c8dded3e9e7161c7af6425e9fcd43f56/examples/advanced_composition/advanced_composition/conditional.py \ shape_properties --radius 3.0 ``` ```shell $ pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/656e63d1c8dded3e9e7161c7af6425e9fcd43f56/examples/advanced_composition/advanced_composition/conditional.py \ shape_properties_with_multiple_branches --radius 11.0 ``` ```shell $ pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/656e63d1c8dded3e9e7161c7af6425e9fcd43f56/examples/advanced_composition/advanced_composition/conditional.py \ shape_properties_accept_conditional_output --radius 0.5 ``` ```shell $ pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/656e63d1c8dded3e9e7161c7af6425e9fcd43f56/examples/advanced_composition/advanced_composition/conditional.py \ boolean_wf ``` ```shell $ pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/656e63d1c8dded3e9e7161c7af6425e9fcd43f56/examples/advanced_composition/advanced_composition/conditional.py \ boolean_input_wf --boolean_input ``` ```shell $ pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/656e63d1c8dded3e9e7161c7af6425e9fcd43f56/examples/advanced_composition/advanced_composition/conditional.py \ nested_conditions --radius 0.7 ``` ```shell $ pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/656e63d1c8dded3e9e7161c7af6425e9fcd43f56/examples/advanced_composition/advanced_composition/conditional.py \ consume_task_output --radius 0.4 --seed 7 ``` ```shell $ pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/656e63d1c8dded3e9e7161c7af6425e9fcd43f56/examples/advanced_composition/advanced_composition/conditional.py \ noop_in_conditional --radius 0.4 --seed 5 ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/programming/decorating_tasks === # Decorating tasks You can easily change how tasks behave by using decorators to wrap your task functions. In order to make sure that your decorated function contains all the type annotation and docstring information that Flyte needs, you will need to use the built-in `functools.wraps` decorator. To begin, create a file called `decorating_tasks.py`. Add the imports: ```python import logging import flytekit as fl from functools import partial, wraps ``` Create a logger to monitor the execution's progress. ```python logger = logging.getLogger(__file__) ``` ## Using a single decorator We define a decorator that logs the input and output details for a decorated task. ```python def log_io(fn): @wraps(fn) def wrapper(*args, **kwargs): logger.info(f"task {fn.__name__} called with args: {args}, kwargs: {kwargs}") out = fn(*args, **kwargs) logger.info(f"task {fn.__name__} output: {out}") return out return wrapper ``` We create a task named `t1` that is decorated with `log_io`. > [!NOTE] > The order of invoking the decorators is important. `@task` should always be the outer-most decorator. ```python @fl.task @log_io def t1(x: int) -> int: return x + 1 ``` ## Stacking multiple decorators You can also stack multiple decorators on top of each other as long as `@task` is the outer-most decorator. We define a decorator that verifies if the output from the decorated function is a positive number before it's returned. If this assumption is violated, it raises a `ValueError` exception. ```python def validate_output(fn=None, *, floor=0): @wraps(fn) def wrapper(*args, **kwargs): out = fn(*args, **kwargs) if out <= floor: raise ValueError(f"output of task {fn.__name__} must be a positive number, found {out}") return out if fn is None: return partial(validate_output, floor=floor) return wrapper ``` > [!NOTE] > The output of the `validate_output` task uses `functools.partial` to implement parameterized decorators. We define a function that uses both the logging and validator decorators. ```python @fl.task @log_io @validate_output(floor=10) def t2(x: int) -> int: return x + 10 ``` Finally, we compose a workflow that calls `t1` and `t2`. ```python @fl.workflow def decorating_task_wf(x: int) -> int: return t2(x=t1(x=x)) ``` ## Run the example on Flyte To run the workflow, execute the following command: ```bash pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/69dbe4840031a85d79d9ded25f80397c6834752d/examples/advanced_composition/advanced_composition/decorating_tasks.py \ decorating_task_wf --x 10 ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/programming/decorating_workflows === # Decorating workflows The behavior of workflows can be modified in a lightweight fashion by using the built-in `functools.wraps` decorator pattern, similar to using decorators to **Programming > Decorating workflows > customize task behavior**. However, unlike in the case of tasks, we need to do a little extra work to make sure that the DAG underlying the workflow executes tasks in the correct order. ## Setup-teardown pattern The main use case of decorating `@fl.workflow`-decorated functions is to establish a setup-teardown pattern to execute task before and after your main workflow logic. This is useful when integrating with other external services like [wandb](https://wandb.ai/site) or [clearml](https://clear.ml/), which enable you to track metrics of model training runs. To begin, create a file called `decorating_workflows`. Import the necessary libraries: ```python from functools import partial, wraps from unittest.mock import MagicMock import flytekit as fl from flytekit import FlyteContextManager from flytekit.core.node_creation import create_node ``` Let's define the tasks we need for setup and teardown. In this example, we use the `unittest.mock.MagicMock` class to create a fake external service that we want to initialize at the beginning of our workflow and finish at the end. ```python external_service = MagicMock() @fl.task def setup(): print("initializing external service") external_service.initialize(id=flytekit.current_context().execution_id) @fl.task def teardown(): print("finish external service") external_service.complete(id=flytekit.current_context().execution_id) ``` As you can see, you can even use Flytekit's current context to access the `execution_id` of the current workflow if you need to link Flyte with the external service so that you reference the same unique identifier in both the external service and Flyte. ## Workflow decorator We create a decorator that we want to use to wrap our workflow function. ```python def setup_teardown(fn=None, *, before, after): @wraps(fn) def wrapper(*args, **kwargs): # get the current flyte context to obtain access to the compilation state of the workflow DAG. ctx = FlyteContextManager.current_context() # defines before node before_node = create_node(before) # ctx.compilation_state.nodes == [before_node] # under the hood, flytekit compiler defines and threads # together nodes within the `my_workflow` function body outputs = fn(*args, **kwargs) # ctx.compilation_state.nodes == [before_node, *nodes_created_by_fn] # defines the after node after_node = create_node(after) # ctx.compilation_state.nodes == [before_node, *nodes_created_by_fn, after_node] # compile the workflow correctly by making sure `before_node` # runs before the first workflow node and `after_node` # runs after the last workflow node. if ctx.compilation_state is not None: # ctx.compilation_state.nodes is a list of nodes defined in the # order of execution above workflow_node0 = ctx.compilation_state.nodes[1] workflow_node1 = ctx.compilation_state.nodes[-2] before_node >> workflow_node0 workflow_node1 >> after_node return outputs if fn is None: return partial(setup_teardown, before=before, after=after) return wrapper ``` There are a few key pieces to note in the `setup_teardown` decorator above: 1. It takes a `before` and `after` argument, both of which need to be `@fl.task`-decorated functions. These tasks will run before and after the main workflow function body. 2. The [create_node](https://github.com/flyteorg/flytekit/blob/9e156bb0cf3d1441c7d1727729e8f9b4bbc3f168/flytekit/core/node_creation.py#L18) function to create nodes associated with the `before` and `after` tasks. 3. When `fn` is called, under the hood the system creates all the nodes associated with the workflow function body 4. The code within the `if ctx.compilation_state is not None:` conditional is executed at compile time, which is where we extract the first and last nodes associated with the workflow function body at index `1` and `-2`. 5. The `>>` right shift operator ensures that `before_node` executes before the first node and `after_node` executes after the last node of the main workflow function body. ## Defining the DAG We define two tasks that will constitute the workflow. ```python @fl.task def t1(x: float) -> float: return x - 1 @fl.task def t2(x: float) -> float: return x**2 ``` And then create our decorated workflow: ```python @fl.workflow @setup_teardown(before=setup, after=teardown) def decorating_workflow(x: float) -> float: return t2(x=t1(x=x)) ``` ## Run the example on the Flyte cluster To run the provided workflow on the Flyte cluster, use the following command: ```bash pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/69dbe4840031a85d79d9ded25f80397c6834752d/examples/advanced_composition/advanced_composition/decorating_workflows.py \ decorating_workflow --x 10.0 ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/programming/intratask_checkpoints === # Intratask checkpoints A checkpoint in Flyte serves to recover a task from a previous failure by preserving the task's state before the failure and resuming from the latest recorded state. ## Why intratask checkpoints? The inherent design of Flyte, being a workflow engine, allows users to break down operations, programs or ideas into smaller tasks within workflows. In the event of a task failure, the workflow doesn't need to rerun the previously completed tasks. Instead, it can retry the specific task that encountered an issue. Once the problematic task succeeds, it won't be rerun. Consequently, the natural boundaries between tasks act as implicit checkpoints. However, there are scenarios where breaking a task into smaller tasks is either challenging or undesirable due to the associated overhead. This is especially true when running a substantial computation in a tight loop. In such cases, users may consider splitting each loop iteration into individual tasks using dynamic workflows. Yet, the overhead of spawning new tasks, recording intermediate results, and reconstructing the state can incur additional expenses. ### Use case: Model training An exemplary scenario illustrating the utility of intra-task checkpointing is during model training. In situations where executing multiple epochs or iterations with the same dataset might be time-consuming, setting task boundaries can incur a high bootstrap time and be costly. Flyte addresses this challenge by providing a mechanism to checkpoint progress within a task execution, saving it as a file or set of files. In the event of a failure, the checkpoint file can be re-read to resume most of the state without rerunning the entire task. This feature opens up possibilities to leverage alternate, more cost-effective compute systems, such as [AWS spot instances](https://aws.amazon.com/ec2/spot/), [GCP pre-emptible instances](https://cloud.google.com/compute/docs/instances/preemptible) and others. These instances offer great performance at significantly lower price points compared to their on-demand or reserved counterparts. This becomes feasible when tasks are constructed in a fault-tolerant manner. For tasks running within a short duration, e.g., less than 10 minutes, the likelihood of failure is negligible, and task-boundary-based recovery provides substantial fault tolerance for successful completion. However, as the task execution time increases, the cost of re-running it also increases, reducing the chances of successful completion. This is precisely where Flyte's intra-task checkpointing proves to be highly beneficial. Here's an example illustrating how to develop tasks that leverage intra-task checkpointing. It's important to note that Flyte currently offers the low-level API for checkpointing. Future integrations aim to incorporate higher-level checkpointing APIs from popular training frameworks like Keras, PyTorch, Scikit-learn, and big-data frameworks such as Spark and Flink, enhancing their fault-tolerance capabilities. Create a file called `checkpoint.py`: Import the required libraries: ```python import flytekit as fl from flytekit.exceptions.user import FlyteRecoverableException RETRIES = 3 ``` We define a task to iterate precisely `n_iterations`, checkpoint its state, and recover from simulated failures: ```python # Define a task to iterate precisely `n_iterations`, checkpoint its state, and recover from simulated failures. @fl.task(retries=RETRIES) def use_checkpoint(n_iterations: int) -> int: cp = fl.current_context().checkpoint prev = cp.read() start = 0 if prev: start = int(prev.decode()) # Create a failure interval to simulate failures across 'n' iterations and then succeed after configured retries failure_interval = n_iterations // RETRIES index = 0 for index in range(start, n_iterations): # Simulate a deterministic failure for demonstration. Showcasing how it eventually completes within the given retries if index > start and index % failure_interval == 0: raise FlyteRecoverableException(f"Failed at iteration {index}, failure_interval {failure_interval}.") # Save progress state. It is also entirely possible to save state every few intervals cp.write(f"{index + 1}".encode()) return index ``` The checkpoint system offers additional APIs. The code can be found at [checkpointer code](https://github.com/flyteorg/flytekit/blob/master/flytekit/core/checkpointer.py). Create a workflow that invokes the task: The task will automatically undergo retries in the event of a [FlyteRecoverableException](../../api-reference/flytekit-sdk/packages/flytekit.exceptions.base#flytekitexceptionsbaseflyterecoverableexception) ```python @fl.workflow def checkpointing_example(n_iterations: int) -> int: return use_checkpoint(n_iterations=n_iterations) ``` The local checkpoint is not utilized here because retries are not supported: ```python if __name__ == "__main__": try: checkpointing_example(n_iterations=10) except RuntimeError as e: # noqa : F841 # Since no retries are performed, an exception is expected when run locally pass ``` ## Run the example on the Flyte cluster To run the provided workflow on the Flyte cluster, use the following command: ```bash pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/69dbe4840031a85d79d9ded25f80397c6834752d/examples/advanced_composition/advanced_composition/checkpoint.py \ checkpointing_example --n_iterations 10 ``` ```bash pyflyte run --remote \ https://raw.githubusercontent.com/flyteorg/flytesnacks/69dbe4840031a85d79d9ded25f80397c6834752d/examples/advanced_composition/advanced_composition/checkpoint.py \ checkpointing_example --n_iterations 10 ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/programming/waiting_for_external_inputs === # Waiting for external inputs There are use cases where you may want a workflow execution to pause, only to continue when some time has passed or when it receives some inputs that are external to the workflow execution inputs. You can think of these as execution-time inputs, since they need to be supplied to the workflow after it's launched. Examples of this use case would be: 1. **Model Deployment**: A hyperparameter-tuning workflow that trains `n` models, where a human needs to inspect a report before approving the model for downstream deployment to some serving layer. 2. **Data Labeling**: A workflow that iterates through an image dataset, presenting individual images to a human annotator for them to label. 3. **Active Learning**: An [active learning](https://en.wikipedia.org/wiki/Active_learning_(machine_learning)) workflow that trains a model, shows examples for a human annotator to label based on which examples it's least/most certain about or would provide the most information to the model. These use cases can be achieved in Flyte with the `flytekit.sleep`, `flytekit.wait_for_input`, and `flytekit.approve` workflow nodes. Although all of the examples above are human-in-the-loop processes, these constructs allow you to pass inputs into a workflow from some arbitrary external process (human or machine) in order to continue. > [!NOTE] > These functions can only be used inside `@fl.workflow`-decorated > functions, `@fl.dynamic`-decorated functions, or > imperative workflows. ## Pause executions with the `sleep` node The simplest case is when you want your workflow to `flytekit.sleep` for some specified amount of time before continuing. Though this type of node may not be used often in a production setting, you might want to use it, for example, if you want to simulate a delay in your workflow to mock out the behavior of some long-running computation. ```python from datetime import timedelta import flytekit as fl from flytekit import sleep @fl.task def long_running_computation(num: int) -> int: """A mock task pretending to be a long-running computation.""" return num @fl.workflow def sleep_wf(num: int) -> int: """Simulate a "long-running" computation with sleep.""" # increase the sleep duration to actually make it long-running sleeping = sleep(timedelta(seconds=10)) result = long_running_computation(num=num) sleeping >> result return result ``` As you can see above, we define a simple `add_one` task and a `sleep_wf` workflow. We first create a `sleeping` and `result` node, then order the dependencies with the `>>` operator such that the workflow sleeps for 10 seconds before kicking off the `result` computation. Finally, we return the `result`. > [!NOTE] > You can learn more about the `>>` chaining operator **Programming > Chaining Entities**. Now that you have a general sense of how this works, let's move onto the `flytekit.wait_for_input` workflow node. ## Supply external inputs with `wait_for_input` With the `flytekit.wait_for_input` node, you can pause a workflow execution that requires some external input signal. For example, suppose that you have a workflow that publishes an automated analytics report, but before publishing it you want to give it a custom title. You can achieve this by defining a `wait_for_input` node that takes a `str` input and finalizes the report: ```python import typing from flytekit import wait_for_input @fl.task def create_report(data: typing.List[float]) -> dict: # o0 """A toy report task.""" return { "mean": sum(data) / len(data), "length": len(data), "max": max(data), "min": min(data), } @fl.task def finalize_report(report: dict, title: str) -> dict: return {"title": title, **report} @fl.workflow def reporting_wf(data: typing.List[float]) -> dict: report = create_report(data=data) title_input = wait_for_input("title", timeout=timedelta(hours=1), expected_type=str) return finalize_report(report=report, title=title_input) ``` Let's break down what's happening in the code above: - In `reporting_wf` we first create the raw `report`. - Then, we define a `title` node that will wait for a string to be provided through the Flyte API, which can be done through the Flyte UI or through `FlyteRemote` (more on that later). This node will time out after 1 hour. - Finally, we pass the `title_input` promise into `finalize_report`, which attaches the custom title to the report. > [!NOTE] > The `create_report` task is just a toy example. In a realistic example, this > report might be an HTML file or set of visualizations. This can be rendered > in the Flyte UI with **Development cycle > Decks**. As mentioned in the beginning of this page, this construct can be used for selecting the best-performing model in cases where there isn't a clear single metric to determine the best model, or if you're doing data labeling using a Flyte workflow. ## Continue executions with `approve` Finally, the `flytekit.approve` workflow node allows you to wait on an explicit approval signal before continuing execution. Going back to our report-publishing use case, suppose that we want to block the publishing of a report for some reason (e.g. if they don't appear to be valid): ```python from flytekit import approve @fl.workflow def reporting_with_approval_wf(data: typing.List[float]) -> dict: report = create_report(data=data) title_input = wait_for_input("title", timeout=timedelta(hours=1), expected_type=str) final_report = finalize_report(report=report, title=title_input) # approve the final report, where the output of approve is the final_report # dictionary. return approve(final_report, "approve-final-report", timeout=timedelta(hours=2)) ``` The `approve` node will pass the `final_report` promise through as the output of the workflow, provided that the `approve-final-report` gets an approval input via the Flyte UI or Flyte API. You can also use the output of the `approve` function as a promise, feeding it to a subsequent task. Let's create a version of our report-publishing workflow where the approval happens after `create_report`: ```python @fl.workflow def approval_as_promise_wf(data: typing.List[float]) -> dict: report = create_report(data=data) title_input = wait_for_input("title", timeout=timedelta(hours=1), expected_type=str) # wait for report to run so that the user can view it before adding a custom # title to the report report >> title_input final_report = finalize_report( report=approve(report, "raw-report-approval", timeout=timedelta(hours=2)), title=title_input, ) return final_report ``` ## Working with conditionals The node constructs by themselves are useful, but they become even more useful when we combine them with other Flyte constructs, like **Programming > Conditionals**. To illustrate this, let's extend the report-publishing use case so that we produce an "invalid report" output in case we don't approve the final report: ```python from flytekit import conditional @fl.task def invalid_report() -> dict: return {"invalid_report": True} @fl.workflow def conditional_wf(data: typing.List[float]) -> dict: report = create_report(data=data) title_input = wait_for_input("title-input", timeout=timedelta(hours=1), expected_type=str) # Define a "review-passes" wait_for_input node so that a human can review # the report before finalizing it. review_passed = wait_for_input("review-passes", timeout=timedelta(hours=2), expected_type=bool) report >> review_passed # This conditional returns the finalized report if the review passes, # otherwise it returns an invalid report output. return ( conditional("final-report-condition") .if_(review_passed.is_true()) .then(finalize_report(report=report, title=title_input)) .else_() .then(invalid_report()) ) ``` On top of the `approved` node, which we use in the `conditional` to determine which branch to execute, we also define a `disapprove_reason` gate node, which will be used as an input to the `invalid_report` task. ## Sending inputs to `wait_for_input` and `approve` nodes Assuming that you've registered the above workflows on a Flyte cluster that's been started with **Programming > Waiting for external inputs > flytectl demo start**, there are two ways of using `wait_for_input` and `approve` nodes: ### Using the Flyte UI If you launch the `reporting_wf` workflow on the Flyte UI, you'll see a **Graph** view of the workflow execution like this: ![Reporting workflow wait for input graph](../../_static/images/user-guide/programming/waiting-for-external-inputs/wait-for-input-graph.png) Clicking on the play-circle icon of the `title` task node or the **Resume** button on the sidebar will create a modal form that you can use to provide the custom title input. ![Reporting workflow wait for input form](../../_static/images/user-guide/programming/waiting-for-external-inputs/wait-for-input-form.png) ### Using `FlyteRemote` For many cases it's enough to use Flyte UI to provide inputs/approvals on gate nodes. However, if you want to pass inputs to `wait_for_input` and `approve` nodes programmatically, you can use the `FlyteRemote.set_signal` method. Using the `gate_node_with_conditional_wf` workflow, the example below allows you to set values for `title-input` and `review-passes` nodes. ```python import typing from flytekit.remote.remote import FlyteRemote from flytekit.configuration import Config remote = FlyteRemote( Config.for_sandbox(), default_project="flytesnacks", default_domain="development", ) # First kick off the workflow flyte_workflow = remote.fetch_workflow( name="core.control_flow.waiting_for_external_inputs.conditional_wf" ) # Execute the workflow execution = remote.execute(flyte_workflow, inputs={"data": [1.0, 2.0, 3.0, 4.0, 5.0]}) # Get a list of signals available for the execution signals = remote.list_signals(execution.id.name) # Set a signal value for the "title" node. Make sure that the "title-input" # node is in the `signals` list above remote.set_signal("title-input", execution.id.name, "my report") # Set signal value for the "review-passes" node. Make sure that the "review-passes" # node is in the `signals` list above remote.set_signal("review-passes", execution.id.name, True) ``` === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/programming/nested-parallelism === # Nested parallelism For exceptionally large or complicated workflows that canโ€™t be adequately implemented as dynamic workflows or map tasks, it can be beneficial to have multiple levels of workflow parallelization. This is useful for multiple reasons: - Better code organization - Better code reuse - Better testing - Better debugging - Better monitoring, since each subworkflow can be run independently and monitored independently - Better performance and scale, since each subworkflow is executed as a separate workflow and thus can be distributed among different propeller workers and shards. This allows for better parallelism and scale. ## Nested dynamic workflows You can use nested dynamic workflows to break down a large workflow into smaller workflows and then compose them together to form a hierarchy. In this example, a top-level workflow uses two levels of dynamic workflows to process a list through some simple addition tasks and then flatten the list again. ### Example code ```python """ A core workflow parallelized as six items with a chunk size of two will be structured as follows: multi_wf -> level1 -> level2 -> core_wf -> step1 -> step2 -> core_wf -> step1 -> step2 level2 -> core_wf -> step1 -> step2 -> core_wf -> step1 -> step2 level2 -> core_wf -> step1 -> step2 -> core_wf -> step1 -> step2 """ import flytekit as fl @fl.task def step1(a: int) -> int: return a + 1 @fl.task def step2(a: int) -> int: return a + 2 @fl.workflow def core_wf(a: int) -> int: return step2(a=step1(a=a)) core_wf_lp = fl.LaunchPlan.get_or_create(core_wf) @fl.dynamic def level2(l: list[int]) -> list[int]: return [core_wf_lp(a=a) for a in l] @fl.task def reduce(l: list[list[int]]) -> list[int]: f = [] for i in l: f.extend(i) return f @fl.dynamic def level1(l: list[int], chunk: int) -> list[int]: v = [] for i in range(0, len(l), chunk): v.append(level2(l=l[i:i + chunk])) return reduce(l=v) @fl.workflow def multi_wf(l: list[int], chunk: int) -> list[int]: return level1(l=l, chunk=chunk) ``` Overrides let you add additional arguments to the launch plan you are looping over in the dynamic. Here we add caching: ```python @fl.task def increment(num: int) -> int: return num + 1 @fl.workflow def child(num: int) -> int: return increment(num=num) child_lp = fl.LaunchPlan.get_or_create(child) @fl.dynamic def spawn(n: int) -> list[int]: l = [] for i in [1,2,3,4,5]: l.append(child_lp(num=i).with_overrides(cache=True, cache_version="1.0.0")) # you can also pass l to another task if you want return l ``` ## Mixed parallelism This example is similar to nested dynamic workflows, but instead of using a dynamic workflow to parallelize a core workflow with serial tasks, we use a core workflow to call a map task, which processes both inputs in parallel. This workflow has one less layer of parallelism, so the outputs wonโ€™t be the same as those of the nested parallelization example, but it does still demonstrate how you can mix these different approaches to achieve concurrency. ### Example code ```python """ A core workflow parallelized as six items with a chunk size of two will be structured as follows: multi_wf -> level1 -> level2 -> mappable -> mappable level2 -> mappable -> mappable level2 -> mappable -> mappable """ import flytekit as fl @fl.task def mappable(a: int) -> int: return a + 2 @fl.workflow def level2(l: list[int]) -> list[int]: return fl.map_task(mappable)(a=l) @fl.task def reduce(l: list[list[int]]) -> list[int]: f = [] for i in l: f.extend(i) return f @fl.dynamic def level1(l: list[int], chunk: int) -> list[int]: v = [] for i in range(0, len(l), chunk): v.append(level2(l=l[i : i + chunk])) return reduce(l=v) @fl.workflow def multi_wf(l: list[int], chunk: int) -> list[int]: return level1(l=l, chunk=chunk) ``` ## Design considerations While you can nest even further if needed, or incorporate map tasks if your inputs are all the same type, the design of your workflow should be informed by the actual data youโ€™re processing. For example, if you have a big library of music from which youโ€™d like to extract the lyrics, the first level could loop through all the albums, and the second level could process each song. If youโ€™re just processing an enormous list of the same input, itโ€™s best to keep your code simple and let the scheduler handle optimizing the execution. Additionally, unless you need dynamic workflow features like mixing and matching inputs and outputs, itโ€™s usually most efficient to use a map task, which has the added benefit of keeping the UI clean. You can also choose to limit the scale of parallel execution at a few levels. The max_parallelism attribute can be applied at the workflow level and will limit the number of parallel tasks being executed. (This is set to 25 by default.) Within map tasks, you can specify a concurrency argument, which will limit the number of mapped tasks that can run in parallel at any given time. === PAGE: https://www.union.ai/docs/v1/flyte/user-guide/programming/failure-node === # Failure node The failure node feature enables you to designate a specific node to execute in the event of a failure within your workflow. For example, a workflow involves creating a cluster at the beginning, followed by the execution of tasks, and concludes with the deletion of the cluster once all tasks are completed. However, if any task within the workflow encounters an error, the system will abort the entire workflow and wonโ€™t delete the cluster. This poses a challenge if you still need to clean up the cluster even in a task failure. To address this issue, you can add a failure node into your workflow. This ensures that critical actions, such as deleting the cluster, are executed even in the event of failures occurring throughout the workflow execution. ```python import typing import flytekit as fl from flytekit import WorkflowFailurePolicy from flytekit.types.error.error import FlyteError @fl.task def create_cluster(name: str): print(f"Creating cluster: {name}") ``` Create a task that will fail during execution: ```python # Create a task that will fail during execution @fl.task def t1(a: int, b: str): print(f"{a} {b}") raise ValueError("Fail!") ``` Create a task that will be executed if any of the tasks in the workflow fail: ```python @fl.task def clean_up(name: str, err: typing.Optional[FlyteError] = None): print(f"Deleting cluster {name} due to {err}") ``` Specify the `on_failure` to a cleanup task. This task will be executed if any of the tasks in the workflow fail. The inputs of `clean_up` must exactly match the workflowโ€™s inputs. Additionally, the `err` parameter will be populated with the error message encountered during execution. ```python @fl.workflow def wf(a: int, b: str): create_cluster(name=f"cluster-{a}") t1(a=a, b=b) ``` By setting the failure policy to `FAIL_AFTER_EXECUTABLE_NODES_COMPLETE` to ensure that the `wf1` is executed even if the subworkflow fails. In this case, both parent and child workflows will fail, resulting in the `clean_up` task being executed twice: ```python # In this case, both parent and child workflows will fail, # resulting in the `clean_up` task being executed twice. @fl.workflow(on_failure=clean_up, failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE) def wf1(name: str = "my_cluster"): c = create_cluster(name=name) subwf(name="another_cluster") t = t1(a=1, b="2") d = delete_cluster(name=name) c >> t >> d ``` You can also set the `on_failure` to a workflow. This workflow will be executed if any of the tasks in the workflow fail: ```python @fl.workflow(on_failure=clean_up_wf) def wf2(name: str = "my_cluster"): c = create_cluster(name=name) t = t1(a=1, b="2") d = delete_cluster(name=name) c >> t >> d ``` === PAGE: https://www.union.ai/docs/v1/flyte/tutorials === # Tutorials This section provides tutorials that walk you through the process of building AI/ML applications on Flyte. The example applications range from training XGBoost models in tabular datasets to fine-tuning large language models for text generation tasks. ### ๐Ÿ”— **Retrieval Augmented Generation > Lance Db Rag** Power your RAG app with Flyte Serving. ### ๐Ÿ”— **Compound AI Systems > Enterprise Rag Blueprint** Serve models and run background jobs like data ingestion โ€” all within Flyte using Flyte Serving and Flyte Workflows. {{< /grid >}} {{< /variant>}} ## Subpages - **Bioinformatics** - **Feature engineering** - **Flytelab** - **Model training** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/bioinformatics === # Bioinformatics Bioinformatics encompasses all the ways we aim to solve biological problems by computational means. Flyte provides a number of excellent abstractions and features for solving such problems in a reliable, reproducible and ergonomic way. ## Subpages - **Bioinformatics > Nucleotide Sequence Querying with BLASTX** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/bioinformatics/blast === # Nucleotide Sequence Querying with BLASTX This tutorial demonstrates the integration of computational biology and Flyte. The focus will be on searching a nucleotide sequence against a local protein database to identify possible homologues. The steps include: - Data loading - Creation of a {ref}`ShellTask ` to execute the BLASTX search command - Loading of BLASTX results and plotting a graph of "e-value" vs "pc identity" This tutorial is based on the reference guide ["Using BLAST+ Programmatically with Biopython"](https://widdowquinn.github.io/2018-03-06-ibioic/02-sequence_databases/03-programming_for_blast.html). ## BLAST The Basic Local Alignment Search Tool (BLAST) is a program that identifies similar regions between sequences. It compares nucleotide or protein sequences with sequence databases and evaluates the statistical significance of the matches. BLAST can be used to deduce functional and evolutionary relationships between sequences and identify members of gene families. For additional information, visit the [BLAST Homepage](https://blast.ncbi.nlm.nih.gov/Blast.cgi). ### BLASTX BLASTx is a useful tool for searching genes and predicting their functions or relationships with other gene sequences. It is commonly employed to find protein-coding genes in genomic DNA or cDNA, as well as to determine whether a new nucleotide sequence encodes a protein or to identify proteins encoded by transcripts or transcript variants. This tutorial will demonstrate how to perform a BLASTx search. ## Data The database used in this example consists of predicted gene products from five Kitasatospora genomes. The query is a single nucleotide sequence of a predicted penicillin-binding protein from Kitasatospora sp. CB01950. > [!NOTE] > To run the example locally, you need to download BLAST. > You can find OS-specific installation instructions in the [user manual](https://www.ncbi.nlm.nih.gov/books/NBK569861/). ## Dockerfile ```dockerfile FROM ubuntu:focal ENV VENV /opt/venv ENV LANG C.UTF-8 ENV LC_ALL C.UTF-8 ENV PYTHONPATH /root RUN apt-get update \ && apt-get install -y software-properties-common \ && add-apt-repository ppa:deadsnakes/ppa \ && apt-get install -y \ && apt-get update \ && apt-get install -y \ cmake \ curl \ python3.8 \ python3.8-venv \ python3.8-dev \ make \ build-essential \ libssl-dev \ libffi-dev \ python3-pip \ zlib1g-dev \ vim \ wget # Install the AWS cli separately to prevent issues with boto being written over RUN pip3 install awscli WORKDIR /opt RUN curl https://sdk.cloud.google.com > install.sh RUN bash /opt/install.sh --install-dir=/opt ENV PATH $PATH:/opt/google-cloud-sdk/bin WORKDIR /root # Virtual environment ENV VENV /opt/venv RUN python3 -m venv ${VENV} ENV PATH="${VENV}/bin:$PATH" # Download BLAST RUN wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.13.0/ncbi-blast-2.13.0+-x64-linux.tar.gz && \ tar -xzf ncbi-blast-2.13.0+-x64-linux.tar.gz # Set the working directory WORKDIR /root # Install Python dependencies COPY requirements.in /root RUN ${VENV}/bin/pip install -r /root/requirements.in # Copy data # COPY blast/kitasatospora /root/kitasatospora # Copy the actual code COPY . /root/ # Copy over the helper script that the SDK relies on RUN cp ${VENV}/bin/flytekit_venv /usr/local/bin/ RUN chmod a+x /usr/local/bin/flytekit_venv # Check if BLAST is installed ENV PATH=$PATH:/root/ncbi-blast-2.13.0+/bin RUN echo $PATH RUN output="$(which blastx)" && echo $output # This tag is supplied by the build script and will be used to determine the version # when registering tasks, workflows, and launch plans ARG tag ENV FLYTE_INTERNAL_IMAGE $tag ``` Initiate the workflow on the Flyte backend by executing the following two commands in the "bioinformatics" directory: ```shell $ pyflyte --pkgs blast package --image ghcr.io/flyteorg/flytecookbook:blast-latest $ flytectl register files --project flytesnacks --domain development --archive flyte-package.tgz --version v1 ``` ## Subpages - **Bioinformatics > Nucleotide Sequence Querying with BLASTX > Blastx Example** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/bioinformatics/blast/blastx-example === --- **Source**: tutorials/bioinformatics/blast/blastx-example.md **URL**: /docs/v1/flyte/tutorials/bioinformatics/blast/blastx-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering === # Feature engineering **Feature Engineering** is an essential part of Machine Learning. It is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. Explore how features can be engineered with the power of Flyte. | Feature Engineering Task | Description | |---------------------------|-------------| | [EDA and Feature Engineering With Papermill](exploratory_data_analysis) | How to use Jupyter notebook within Flyte | | [Data Cleaning and Feature Serving With Feast](feast_integration) | How to use Feast to serve data in Flyte | ## Subpages - **Feature engineering > EDA, Feature Engineering, and Modeling With Papermill** - **Feature engineering > Feast Integration** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis === # EDA, Feature Engineering, and Modeling With Papermill Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations. EDA cannot be solely implemented within Flyte as it requires visual analysis of the data. In such scenarios, we are inclined towards using a Jupyter notebook as it helps visualize and feature engineer the data. **Now the question is, how do we leverage the power of Jupyter Notebook within Flyte to perform EDA on the data?** ## Papermill [Papermill](https://papermill.readthedocs.io/en/latest/) is a tool for parameterizing and executing Jupyter Notebooks. Papermill lets you: - parameterize notebooks - execute notebooks We have a pre-packaged version of Papermill with Flyte that lets you leverage the power of Jupyter Notebook within Flyte pipelines. To install the plugin, run the following command: ```shell $ pip install flytekitplugins-papermill ``` ## Examples There are three code examples that you can refer to in this tutorial: - Run the whole pipeline (EDA + Feature Engineering + Modeling) in one notebook - Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset), and model the data as a Flyte task by sending the dataset as an argument - Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset), and model the data in another notebook by sending the dataset as an argument ### Notebook Etiquette - If you want to send inputs and receive outputs, your Jupyter notebook has to have `parameters` and `outputs` tags, respectively. To set up tags in a notebook, follow this [guide](https://jupyterbook.org/content/metadata.html#adding-tags-using-notebook-interfaces). - `parameters` cell must only have the input variables. - `outputs` cell looks like the following: ```python from flytekitplugins.papermill import record_outputs record_outputs(variable_name=variable_name) ``` Of course, you can have any number of variables! - The `inputs` and `outputs` variable names in the `NotebookTask` must match the variable names in the notebook. > [!NOTE] > You will see three outputs on running the Python code files, although a single output is returned. > One output is the executed notebook, and the other is the rendered HTML of the notebook. ## Subpages - **Feature engineering > EDA, Feature Engineering, and Modeling With Papermill > Notebook And Task** - **Feature engineering > EDA, Feature Engineering, and Modeling With Papermill > Notebooks As Tasks** - **Feature engineering > EDA, Feature Engineering, and Modeling With Papermill > Notebook** - **Feature engineering > EDA, Feature Engineering, and Modeling With Papermill > Supermarket Regression 2 Notebook** - **Feature engineering > EDA, Feature Engineering, and Modeling With Papermill > Supermarket Regression 2 Notebook** - **Feature engineering > EDA, Feature Engineering, and Modeling With Papermill > Supermarket Regression Notebook** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis/notebook-and-task === --- **Source**: tutorials/feature-engineering/exploratory-data-analysis/notebook-and-task.md **URL**: /docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis/notebook-and-task/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis/notebooks-as-tasks === --- **Source**: tutorials/feature-engineering/exploratory-data-analysis/notebooks-as-tasks.md **URL**: /docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis/notebooks-as-tasks/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis/notebook === --- **Source**: tutorials/feature-engineering/exploratory-data-analysis/notebook.md **URL**: /docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis/notebook/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis/supermarket_regression_1 === # Supermarket Regression 2 Notebook ```python dataset = "" import json import numpy as np import pandas as pd from sklearn.model_selection import train_test_split dataset = pd.DataFrame.from_dict(json.loads(dataset)) y_target = dataset['Product_Supermarket_Sales'] dataset.drop(['Product_Supermarket_Sales'], axis=1, inplace=True) X_train, X_test, y_train, y_test = train_test_split(dataset, y_target, test_size = 0.3) print("Training data is", X_train.shape) print("Training target is", y_train.shape) print("test data is", X_test.shape) print("test target is", y_test.shape) from sklearn.preprocessing import RobustScaler, StandardScaler scaler = RobustScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) X_train[:5, :5] from sklearn.metrics import mean_absolute_error from sklearn.model_selection import KFold, cross_val_score def cross_validate(model, nfolds, feats, targets): score = -1 * (cross_val_score(model, feats, targets, cv=nfolds, scoring='neg_mean_absolute_error')) return np.mean(score) n_estimators=150 max_depth=3 max_features='sqrt' min_samples_split=4 random_state=2 from sklearn.ensemble import GradientBoostingRegressor gb_model = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features, min_samples_split=min_samples_split, random_state=random_state) mae_score = cross_validate(gb_model, 10, X_train, y_train) print("MAE Score: ", mae_score) from flytekitplugins.papermill import record_outputs record_outputs(mae_score=float(mae_score)) ``` === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis/supermarket_regression_2 === {{< right mb="2rem" >}} ๐Ÿ“ฅ [Download this notebook](https://github.com/unionai/unionai-examples/blob/main/v1/flyte-tutorials/exploratory_data_analysis/exploratory_data_analysis/supermarket_regression_2.ipynb) {{< /right >}} # Supermarket Regression 2 Notebook ```python dataset = "" ``` ```python import json import numpy as np import pandas as pd from sklearn.model_selection import train_test_split dataset = pd.DataFrame.from_dict(json.loads(dataset)) y_target = dataset['Product_Supermarket_Sales'] dataset.drop(['Product_Supermarket_Sales'], axis=1, inplace=True) X_train, X_test, y_train, y_test = train_test_split(dataset, y_target, test_size = 0.3) print("Training data is", X_train.shape) print("Training target is", y_train.shape) print("test data is", X_test.shape) print("test target is", y_test.shape) ``` ```python from sklearn.preprocessing import RobustScaler, StandardScaler scaler = RobustScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) X_train[:5, :5] ``` ```python from sklearn.metrics import mean_absolute_error from sklearn.model_selection import KFold, cross_val_score def cross_validate(model, nfolds, feats, targets): score = -1 * (cross_val_score(model, feats, targets, cv=nfolds, scoring='neg_mean_absolute_error')) return np.mean(score) ``` ```python n_estimators=150 max_depth=3 max_features='sqrt' min_samples_split=4 random_state=2 ``` ```python from sklearn.ensemble import GradientBoostingRegressor gb_model = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features, min_samples_split=min_samples_split, random_state=random_state) mae_score = cross_validate(gb_model, 10, X_train, y_train) print("MAE Score: ", mae_score) ``` ```python from flytekitplugins.papermill import record_outputs record_outputs(mae_score=float(mae_score)) ``` === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/exploratory-data-analysis/supermarket_regression === {{< right mb="2rem" >}} ๐Ÿ“ฅ [Download this notebook](https://github.com/unionai/unionai-examples/blob/main/v1/flyte-tutorials/exploratory_data_analysis/exploratory_data_analysis/supermarket_regression.ipynb) {{< /right >}} # Supermarket Regression Notebook ```python # reference: https://github.com/risenW/medium_tutorial_notebooks/blob/master/supermarket_regression.ipynb import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # makes graph display in notebook %matplotlib inline ``` ```python supermarket_data = pd.read_csv('https://raw.githubusercontent.com/risenW/medium_tutorial_notebooks/master/train.csv') ``` ```python supermarket_data.head() ``` | | Product_Identifier | Supermarket_Identifier | Product_Supermarket_Identifier | Product_Weight | Product_Fat_Content | Product_Shelf_Visibility | Product_Type | Product_Price | Supermarket_Opening_Year | Supermarket _Size | Supermarket_Location_Type | Supermarket_Type | Product_Supermarket_Sales | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | DRA12 | CHUKWUDI010 | DRA12_CHUKWUDI010 | 11.6 | Low Fat | 0.068535 | Soft Drinks | 357.54 | 2005 | NaN | Cluster 3 | Grocery Store | 709.08 | | DRA12 | CHUKWUDI013 | DRA12_CHUKWUDI013 | 11.6 | Low Fat | 0.040912 | Soft Drinks | 355.79 | 1994 | High | Cluster 3 | Supermarket Type1 | 6381.69 | | DRA12 | CHUKWUDI017 | DRA12_CHUKWUDI017 | 11.6 | Low Fat | 0.041178 | Soft Drinks | 350.79 | 2014 | NaN | Cluster 2 | Supermarket Type1 | 6381.69 | | DRA12 | CHUKWUDI018 | DRA12_CHUKWUDI018 | 11.6 | Low Fat | 0.041113 | Soft Drinks | 355.04 | 2016 | Medium | Cluster 3 | Supermarket Type2 | 2127.23 | | DRA12 | CHUKWUDI035 | DRA12_CHUKWUDI035 | 11.6 | Ultra Low fat | 0.000000 | Soft Drinks | 354.79 | 2011 | Small | Cluster 2 | Supermarket Type1 | 2481.77 | ```python supermarket_data.describe() ``` | | Product_Weight | Product_Shelf_Visibility | Product_Price | Supermarket_Opening_Year | Product_Supermarket_Sales | | :--- | :--- | :--- | :--- | :--- | :--- | | 4188.000000 | 4990.000000 | 4990.000000 | 4990.000000 | 4990.000000 | | 12.908838 | 0.066916 | 391.803796 | 2004.783567 | 6103.520164 | | 4.703256 | 0.053058 | 119.378259 | 8.283151 | 4447.333835 | | 4.555000 | 0.000000 | 78.730000 | 1992.000000 | 83.230000 | | 8.767500 | 0.027273 | 307.890000 | 1994.000000 | 2757.660000 | | 12.600000 | 0.053564 | 393.860000 | 2006.000000 | 5374.675000 | | 17.100000 | 0.095358 | 465.067500 | 2011.000000 | 8522.240000 | | 21.350000 | 0.328391 | 667.220000 | 2016.000000 | 32717.410000 | ```python # remove ID columns cols_2_remove = ['Product_Identifier', 'Supermarket_Identifier', 'Product_Supermarket_Identifier'] newdata = supermarket_data.drop(cols_2_remove, axis=1) ``` ```python newdata.head() ``` | | Product_Weight | Product_Fat_Content | Product_Shelf_Visibility | Product_Type | Product_Price | Supermarket_Opening_Year | Supermarket _Size | Supermarket_Location_Type | Supermarket_Type | Product_Supermarket_Sales | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | 11.6 | Low Fat | 0.068535 | Soft Drinks | 357.54 | 2005 | NaN | Cluster 3 | Grocery Store | 709.08 | | 11.6 | Low Fat | 0.040912 | Soft Drinks | 355.79 | 1994 | High | Cluster 3 | Supermarket Type1 | 6381.69 | | 11.6 | Low Fat | 0.041178 | Soft Drinks | 350.79 | 2014 | NaN | Cluster 2 | Supermarket Type1 | 6381.69 | | 11.6 | Low Fat | 0.041113 | Soft Drinks | 355.04 | 2016 | Medium | Cluster 3 | Supermarket Type2 | 2127.23 | | 11.6 | Ultra Low fat | 0.000000 | Soft Drinks | 354.79 | 2011 | Small | Cluster 2 | Supermarket Type1 | 2481.77 | ```python cat_cols = ['Product_Fat_Content','Product_Type', 'Supermarket _Size', 'Supermarket_Location_Type', 'Supermarket_Type' ] num_cols = ['Product_Weight', 'Product_Shelf_Visibility', 'Product_Price', 'Supermarket_Opening_Year', 'Product_Supermarket_Sales'] ``` ```python # bar plot for categorial features for col in cat_cols: fig = plt.figure(figsize=(6,6)) # define plot area ax = fig.gca() # define axis counts = newdata[col].value_counts() # find the counts for each unique category counts.plot.bar(ax = ax) # use the plot.bar method on the counts data frame ax.set_title('Bar plot for ' + col) ``` ![png](./supermarket_regression.gen_files/supermarket_regression.gen_8_0.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_8_1.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_8_2.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_8_3.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_8_4.png) ```python # scatter plot for numerical features for col in num_cols: fig = plt.figure(figsize=(6,6)) # define plot area ax = fig.gca() # define axis newdata.plot.scatter(x = col, y = 'Product_Supermarket_Sales', ax = ax) ``` ![png](./supermarket_regression.gen_files/supermarket_regression.gen_9_0.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_9_1.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_9_2.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_9_3.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_9_4.png) ```python # box plot for categorial features for col in cat_cols: sns.boxplot(x=col, y='Product_Supermarket_Sales', data=newdata) plt.xlabel(col) plt.ylabel('Product Supermarket Sales') plt.show() ``` ![png](./supermarket_regression.gen_files/supermarket_regression.gen_10_0.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_10_1.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_10_2.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_10_3.png) ![png](./supermarket_regression.gen_files/supermarket_regression.gen_10_4.png) ```python # correlation matrix corrmat = newdata.corr() f,ax = plt.subplots(figsize=(5,4)) sns.heatmap(corrmat, square=True) ``` ![png](./supermarket_regression.gen_files/supermarket_regression.gen_11_1.png) ```python # pair plot of columns without missing values import warnings warnings.filterwarnings('ignore') cat_cols_pair = ['Product_Fat_Content','Product_Type','Supermarket_Location_Type'] cols_2_pair = ['Product_Fat_Content', 'Product_Shelf_Visibility', 'Product_Type', 'Product_Price', 'Supermarket_Opening_Year', 'Supermarket_Location_Type', 'Supermarket_Type', 'Product_Supermarket_Sales'] for col in cat_cols_pair: sns.set() plt.figure() sns.pairplot(data=newdata[cols_2_pair], height=3.0, hue=col) plt.show() ```
![png](./supermarket_regression.gen_files/supermarket_regression.gen_12_1.png)
![png](./supermarket_regression.gen_files/supermarket_regression.gen_12_3.png)
![png](./supermarket_regression.gen_files/supermarket_regression.gen_12_5.png) ```python # FEATURE ENGINEERING # print all unique values newdata['Product_Fat_Content'].unique() ``` array(['Low Fat', 'Ultra Low fat', 'Normal Fat'], dtype=object) ```python fat_content_dict = {'Low Fat': 0, 'Ultra Low fat': 0, 'Normal Fat': 1} newdata['is_normal_fat'] = newdata['Product_Fat_Content'].map(fat_content_dict) # preview the values newdata['is_normal_fat'].value_counts() ``` 0 3217 1 1773 Name: is_normal_fat, dtype: int64 ```python # assign year 2000 and above as 1, 1996 and below as 0 def cluster_open_year(year): if year <= 1996: return 0 else: return 1 newdata['open_in_the_2000s'] = newdata['Supermarket_Opening_Year'].apply(cluster_open_year) ``` ```python # preview feature newdata[['Supermarket_Opening_Year', 'open_in_the_2000s']].head(4) ``` | | Supermarket_Opening_Year | open_in_the_2000s | | :--- | :--- | :--- | | 2005 | 1 | | 1994 | 0 | | 2014 | 1 | | 2016 | 1 | ```python # get the unique categories in the column as a list prod_type_cats = list(newdata['Product_Type'].unique()) # remove the class 1 categories prod_type_cats.remove('Health and Hygiene') prod_type_cats.remove('Household') prod_type_cats.remove('Others') def cluster_prod_type(product): if product in prod_type_cats: return 0 else: return 1 newdata['Product_type_cluster'] = newdata['Product_Type'].apply(cluster_prod_type) ``` ```python newdata[['Product_Type', 'Product_type_cluster']].tail(10) ``` | | Product_Type | Product_type_cluster | | :--- | :--- | :--- | | Health and Hygiene | 1 | | Health and Hygiene | 1 | | Health and Hygiene | 1 | | Household | 1 | | Household | 1 | | Household | 1 | | Household | 1 | | Household | 1 | | Household | 1 | | Household | 1 | ```python # transforming skewed features fig, ax = plt.subplots(1,2) # plot of normal Product_Supermarket_Sales on the first axis sns.histplot(data=newdata['Product_Supermarket_Sales'], bins=15, ax=ax[0]) # transform the Product_Supermarket_Sales and plot on the second axis newdata['Product_Supermarket_Sales'] = np.log1p(newdata['Product_Supermarket_Sales']) sns.histplot(data=newdata['Product_Supermarket_Sales'], bins=15, ax=ax[1]) plt.tight_layout() plt.title("Transformation of Product_Supermarket_Sales feature") ``` Text(0.5, 1.0, 'Transformation of Product_Supermarket_Sales feature') ![png](./supermarket_regression.gen_files/supermarket_regression.gen_19_1.png) ```python # next, let's transform Product_Shelf_Visibility fig, ax = plt.subplots(1,2) # plot of normal Product_Supermarket_Sales on the first axis sns.histplot(data=newdata['Product_Shelf_Visibility'], bins=15, ax=ax[0]) # transform the Product_Supermarket_Sales and plot on the second axis newdata['Product_Shelf_Visibility'] = np.log1p(newdata['Product_Shelf_Visibility']) sns.histplot(data=newdata['Product_Shelf_Visibility'], bins=15, ax=ax[1]) plt.tight_layout() plt.title("Transformation of Product_Shelf_Visibility feature") ``` Text(0.5, 1.0, 'Transformation of Product_Shelf_Visibility feature') ![png](./supermarket_regression.gen_files/supermarket_regression.gen_20_1.png) ```python # feature encoding for col in cat_cols: print('Value Count for', col) print(newdata[col].value_counts()) print("---------------------------") ``` Value Count for Product_Fat_Content Low Fat 3039 Normal Fat 1773 Ultra Low fat 178 Name: Product_Fat_Content, dtype: int64 --------------------------- Value Count for Product_Type Snack Foods 758 Fruits and Vegetables 747 Household 567 Frozen Foods 457 Canned 376 Dairy 350 Baking Goods 344 Health and Hygiene 307 Meat 264 Soft Drinks 261 Breads 137 Hard Drinks 134 Others 100 Starchy Foods 81 Breakfast 66 Seafood 41 Name: Product_Type, dtype: int64 --------------------------- Value Count for Supermarket _Size Medium 1582 Small 1364 High 594 Name: Supermarket _Size, dtype: int64 --------------------------- Value Count for Supermarket_Location_Type Cluster 3 1940 Cluster 2 1581 Cluster 1 1469 Name: Supermarket_Location_Type, dtype: int64 --------------------------- Value Count for Supermarket_Type Supermarket Type1 3304 Grocery Store 724 Supermarket Type2 500 Supermarket Type3 462 Name: Supermarket_Type, dtype: int64 --------------------------- ```python # save the target value to a new variable y_target = newdata['Product_Supermarket_Sales'] newdata.drop(['Product_Supermarket_Sales'], axis=1, inplace=True) # one hot encode using pandas dummy() function dummified_data = pd.get_dummies(newdata) dummified_data.head() ``` | | Product_Weight | Product_Shelf_Visibility | Product_Price | Supermarket_Opening_Year | is_normal_fat | open_in_the_2000s | Product_type_cluster | Product_Fat_Content_Low Fat | Product_Fat_Content_Normal Fat | Product_Fat_Content_Ultra Low fat | ... | Supermarket _Size_High | Supermarket _Size_Medium | Supermarket _Size_Small | Supermarket_Location_Type_Cluster 1 | Supermarket_Location_Type_Cluster 2 | Supermarket_Location_Type_Cluster 3 | Supermarket_Type_Grocery Store | Supermarket_Type_Supermarket Type1 | Supermarket_Type_Supermarket Type2 | Supermarket_Type_Supermarket Type3 | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | 11.6 | 0.066289 | 357.54 | 2005 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | | 11.6 | 0.040097 | 355.79 | 1994 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | | 11.6 | 0.040352 | 350.79 | 2014 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | | 11.6 | 0.040290 | 355.04 | 2016 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | | 11.6 | 0.000000 | 354.79 | 2011 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 5 rows ร— 36 columns ```python # fill-in missing values # print null columns dummified_data.isnull().sum() ``` Product_Weight 802 Product_Shelf_Visibility 0 Product_Price 0 Supermarket_Opening_Year 0 is_normal_fat 0 open_in_the_2000s 0 Product_type_cluster 0 Product_Fat_Content_Low Fat 0 Product_Fat_Content_Normal Fat 0 Product_Fat_Content_Ultra Low fat 0 Product_Type_Baking Goods 0 Product_Type_Breads 0 Product_Type_Breakfast 0 Product_Type_Canned 0 Product_Type_Dairy 0 Product_Type_Frozen Foods 0 Product_Type_Fruits and Vegetables 0 Product_Type_Hard Drinks 0 Product_Type_Health and Hygiene 0 Product_Type_Household 0 Product_Type_Meat 0 Product_Type_Others 0 Product_Type_Seafood 0 Product_Type_Snack Foods 0 Product_Type_Soft Drinks 0 Product_Type_Starchy Foods 0 Supermarket _Size_High 0 Supermarket _Size_Medium 0 Supermarket _Size_Small 0 Supermarket_Location_Type_Cluster 1 0 Supermarket_Location_Type_Cluster 2 0 Supermarket_Location_Type_Cluster 3 0 Supermarket_Type_Grocery Store 0 Supermarket_Type_Supermarket Type1 0 Supermarket_Type_Supermarket Type2 0 Supermarket_Type_Supermarket Type3 0 dtype: int64 ```python # compute the mean mean_pw = dummified_data['Product_Weight'].mean() # fill the missing values with calculated mean dummified_data['Product_Weight'].fillna(mean_pw, inplace=True) ``` ```python # check if filling is successful dummified_data.isnull().sum() ``` Product_Weight 0 Product_Shelf_Visibility 0 Product_Price 0 Supermarket_Opening_Year 0 is_normal_fat 0 open_in_the_2000s 0 Product_type_cluster 0 Product_Fat_Content_Low Fat 0 Product_Fat_Content_Normal Fat 0 Product_Fat_Content_Ultra Low fat 0 Product_Type_Baking Goods 0 Product_Type_Breads 0 Product_Type_Breakfast 0 Product_Type_Canned 0 Product_Type_Dairy 0 Product_Type_Frozen Foods 0 Product_Type_Fruits and Vegetables 0 Product_Type_Hard Drinks 0 Product_Type_Health and Hygiene 0 Product_Type_Household 0 Product_Type_Meat 0 Product_Type_Others 0 Product_Type_Seafood 0 Product_Type_Snack Foods 0 Product_Type_Soft Drinks 0 Product_Type_Starchy Foods 0 Supermarket _Size_High 0 Supermarket _Size_Medium 0 Supermarket _Size_Small 0 Supermarket_Location_Type_Cluster 1 0 Supermarket_Location_Type_Cluster 2 0 Supermarket_Location_Type_Cluster 3 0 Supermarket_Type_Grocery Store 0 Supermarket_Type_Supermarket Type1 0 Supermarket_Type_Supermarket Type2 0 Supermarket_Type_Supermarket Type3 0 dtype: int64 ```python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(dummified_data, y_target, test_size = 0.3) print("Training data is", X_train.shape) print("Training target is", y_train.shape) print("test data is", X_test.shape) print("test target is", y_test.shape) ``` Training data is (3493, 36) Training target is (3493,) test data is (1497, 36) test target is (1497,) ```python from sklearn.preprocessing import RobustScaler, StandardScaler scaler = RobustScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) X_train[:5, :5] ``` array([[ 1.11222151, 0.77329048, -0.10167541, -0.05882353, 1. ], [ 1.03420733, 0.64446093, 0.29696892, 0.58823529, 1. ], [ 1.10512931, -0.19777034, -0.09898964, 0. , 0. ], [-0.94948062, -0.03939268, 1.10116383, 0.47058824, 1. ], [ 0. , 0.5364253 , 0.00690625, -0.82352941, 0. ]]) ```python from sklearn.metrics import mean_absolute_error from sklearn.model_selection import KFold, cross_val_score def cross_validate(model, nfolds, feats, targets): score = -1 * (cross_val_score(model, feats, targets, cv=nfolds, scoring='neg_mean_absolute_error')) return np.mean(score) ``` ```python n_estimators=150 max_depth=3 max_features='sqrt' min_samples_split=4 random_state=2 ``` ```python from sklearn.ensemble import GradientBoostingRegressor gb_model = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features, min_samples_split=min_samples_split, random_state=random_state) mae_score = cross_validate(gb_model, 10, X_train, y_train) print("MAE Score: ", mae_score) ``` MAE Score: 0.4078268922230158 ```python from flytekitplugins.papermill import record_outputs record_outputs(mae_score=float(mae_score)) ``` literals { key: "mae_score" value { scalar { primitive { float_value: 0.4078268922230158 } } } } === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/feast-integration === # Feast Integration [Feast](https://feast.dev/) is an operational data system for managing and serving machine learning features to models in production. Flyte provides a way to train models and perform feature engineering as a single pipeline. But it provides no way to serve these features to production when the model matures and is ready to be served in production. This is where Feast can be helpful. On leveraging the collective capabilities, Flyte enables engineering the features, and Feast provides the feature registry and online feature serving system. Moreover, Flyte can help ensure incremental development of features and enables to turn on the sync to online stores only when one is confident about the features. In this tutorial, we'll walk through how Feast can be used to store and retrieve features to train and test the model through a pipeline curated using Flyte. ## Dataset We'll use the horse colic dataset to determine if the lesion of the horse is surgical or not. This is a modified version of the original dataset. The dataset will have the following columns: * surgery * age * hospital number * rectal temperature * pulse * respiratory rate * temperature of extremities * peripheral pulse * mucous membranes * capillary refill time * pain * peristalsis * abdominal distension * nasogastric tube * nasogastric reflux * nasogastric reflux PH * rectal examination * abdomen * packed cell volume * total protein * abdominocentesis appearance * abdomcentesis total protein * outcome * surgical lesion * timestamp The horse colic dataset will be a compressed zip file consisting of the SQLite DB. For this example, we wanted a dataset available online, but this could be easily plugged into another dataset/data management system like Snowflake, Athena, Hive, BigQuery, or Spark, all of which are supported by Flyte. ## Takeaways 1. Source data is from SQL-like data sources 2. Procreated feature transforms 3. Serve features to production using Feast ## Subpages - **Feature engineering > Feast Integration > Feature Eng Tasks** - **Feature engineering > Feast Integration > Feast Workflow** - **Feature engineering > Feast Integration > How to Trigger the Feast Workflow using FlyteRemote** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/feast-integration/feature_eng_tasks === --- **Source**: tutorials/feature-engineering/feast-integration/feature_eng_tasks.md **URL**: /docs/v1/flyte/tutorials/feature-engineering/feast-integration/feature_eng_tasks/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/feast-integration/feast_workflow === --- **Source**: tutorials/feature-engineering/feast-integration/feast_workflow.md **URL**: /docs/v1/flyte/tutorials/feature-engineering/feast-integration/feast_workflow/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/feature-engineering/feast-integration/feast_flyte_remote === {{< right mb="2rem" >}} ๐Ÿ“ฅ [Download this notebook](https://github.com/unionai/unionai-examples/blob/main/v1/flyte-tutorials/feast_integration/feast_integration/feast_flyte_remote.ipynb) {{< /right >}} # How to Trigger the Feast Workflow using FlyteRemote The goal of this notebook is to train a simple [Gaussian Naive Bayes model using sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) on a modified [Horse-Colic dataset from UCI](https://archive.ics.uci.edu/ml/datasets/Horse+Colic). The model aims to classify if the lesion of the horse is surgical or not. Let's get started! Set the AWS environment variables before importing Flytekit. ```python import os os.environ["FLYTE_AWS_ENDPOINT"] = os.environ["FEAST_S3_ENDPOINT_URL"] = "http://localhost:30084/" os.environ["FLYTE_AWS_ACCESS_KEY_ID"] = os.environ["AWS_ACCESS_KEY_ID"] = "minio" os.environ["FLYTE_AWS_SECRET_ACCESS_KEY"] = os.environ["AWS_SECRET_ACCESS_KEY"] = "miniostorage" ``` ## 01. Register the code We've used Flytekit to express the pipeline in pure Python. You can use [FlyteConsole](https://github.com/flyteorg/flyteconsole) to launch, monitor, and introspect Flyte executions. However here, let's use `flytekit.remote` to interact with the Flyte backend. ```python from flytekit.remote import FlyteRemote from flytekit.configuration import Config # The `for_sandbox` method instantiates a connection to the demo cluster. remote = FlyteRemote( config=Config.for_sandbox(), default_project="flytesnacks", default_domain="development" ) ``` /Users/samhitaalla/.pyenv/versions/3.9.9/envs/flytesnacks/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm The ``register_script`` method can be used to register the workflow. ```python from flytekit.configuration import ImageConfig from feast_workflow import feast_workflow wf = remote.register_script( feast_workflow, image_config=ImageConfig.from_images( "ghcr.io/flyteorg/flytecookbook:feast_integration-latest" ), version="v2", source_path="../", module_name="feast_workflow", ) ``` ## 02: Launch an execution FlyteRemote provides convenient methods to retrieve version of the pipeline from the remote server. **NOTE**: It is possible to get a specific version of the workflow and trigger a launch for that, but let's just get the latest. ```python lp = remote.fetch_launch_plan(name="feast_integration.feast_workflow.feast_workflow") lp.id.version ``` 'v1' The ``execute`` method can be used to execute a Flyte entity โ€” a launch plan in our case. ```python execution = remote.execute( lp, inputs={"num_features_univariate": 5}, wait=True ) ``` ## 03. Sync an execution You can sync an execution to retrieve the workflow's outputs. ``sync_nodes`` is set to True to retrieve the intermediary nodes' outputs as well. **NOTE**: It is possible to fetch an existing execution or simply retrieve an already commenced execution. Also, if you launch an execution with the same name, Flyte will respect that and not restart a new execution! ```python from flytekit.models.core.execution import WorkflowExecutionPhase synced_execution = remote.sync(execution, sync_nodes=True) print(f"Execution {synced_execution.id.name} is in {WorkflowExecutionPhase.enum_to_string(synced_execution.closure.phase)} phase") ``` Execution f218aba055ba34a3fb75 is in SUCCEEDED phase ## 04. Retrieve the output Fetch the model and the model prediction. ```python model = synced_execution.outputs["o0"] prediction = synced_execution.outputs["o1"] prediction ``` /var/folders/6r/9pdkgpkd5nx1t34ndh1f_3q80000gn/T/flyteaqx6tlyu/control_plane_metadata/local_flytekit/e1a690494fe33da04a4dca7737096234/0c81c76dc3a029267a96f275431b5bc5.npy **NOTE**: The output model is available locally as a JobLibSerialized file, which can be downloaded and loaded. ```python model ``` /var/folders/6r/9pdkgpkd5nx1t34ndh1f_3q80000gn/T/flyteaqx6tlyu/control_plane_metadata/local_flytekit/91246ef2160dde99a7512ab3aa9aa2ce/model.joblib.dat Fetch the ``repo_config``. ```python repo_config = synced_execution.node_executions["n0"].outputs["o0"] ``` ## 05. Generate predictions Re-use the `predict` function from the workflow to generate predictions โ€” Flytekit will automatically manage the IO for you! ### Load features from the online feature store ```python import os from feast_workflow import predict, FEAST_FEATURES, retrieve_online inference_point = retrieve_online( repo_config=repo_config, online_store=synced_execution.node_executions["n4"].outputs["o0"], data_point=533738, ) inference_point ``` {'total protein': [70.0], 'peripheral pulse': [3.0], 'nasogastric reflux PH': [4.718545454545455], 'surgical lesion': ['1'], 'rectal temperature': [38.17717842323652], 'nasogastric tube': ['1.751269035532995'], 'Hospital Number': ['533738'], 'packed cell volume': [43.0], 'outcome': ['1'], 'abdominal distension': [4.0]} ### Generate a prediction ```python predict(model_ser=model, features=inference_point) ``` array(['2'], dtype=' Weather Forecasting** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/flytelab/weather-forecasting === # Weather Forecasting Learn how to build an online weather forecasting system that updates a model daily and displays hourly forecasts on a web UI, using Pandera and Streamlit. The video below briefly shows how the Weather Forecasting app is made, a few launch plans, and a Streamlit demo. Find the [complete video](https://youtu.be/c-X1u42uK-g) on YouTube. ๐Ÿ“บ [Watch on YouTube](https://www.youtube.com/watch?v=aovn_01bzwU) === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training === # Model training Understand how machine learning models can be trained from within Flyte, with an added advantage of orchestration benefits. ## Subpages - **Model training > Diabetes Classification** - **Model training > Forecasting Rossman Store Sales with Horovod and Spark** - **Model training > House Price Regression** - **Model training > MNIST Classification With PyTorch and W&B** - **Model training > NLP Processing** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/pima-diabetes === # Diabetes Classification The workflow demonstrates how to train an XGBoost model. The workflow is designed for the [Pima Indian Diabetes dataset](https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.names). An example dataset is available [here](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv). ## Why a Workflow? One common question when you read through the example would be whether it is really required to split the training of XGBoost into multiple steps. The answer is complicated, but let us try and understand the pros and cons of doing so. ### Pros: - Each task/step is standalone and can be used for other pipelines - Each step can be unit tested - Data splitting, cleaning and processing can be done using a more scalable system like Spark - State is always saved between steps, so it is cheap to recover from failures, especially if ``caching=True`` - High visibility ### Cons: - Performance for small datasets is a concern because the intermediate data is durably stored and the state is recorded, and each step is essentially a checkpoint ## Steps of the Pipeline 1. Gather data and split it into training and validation sets 2. Fit the actual model 3. Run a set of predictions on the validation set. The function is designed to be more generic, it can be used to simply predict given a set of observations (dataset) 4. Calculate the accuracy score for the predictions ## Takeaways - Usage of FlyteSchema Type. Schema type allows passing a type safe vector from one task to task. The vector is directly loaded into a pandas dataframe. We could use an unstructured Schema (By simply omitting the column types). This will allow any data to be accepted by the training algorithm. - We pass the file (that is auto-loaded) as a CSV input. Run workflows in this directory with the custom-built base image: ```shell $ pyflyte run --remote --image ghcr.io/flyteorg/flytecookbook:pima_diabetes-latest diabetes.py diabetes_xgboost_model ``` ## Subpages - **Model training > Diabetes Classification > Diabetes** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/pima-diabetes/diabetes === --- **Source**: tutorials/model-training/pima-diabetes/diabetes.md **URL**: /docs/v1/flyte/tutorials/model-training/pima-diabetes/diabetes/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/forecasting-sales === # Forecasting Rossman Store Sales with Horovod and Spark The problem statement we will be looking at is forecasting sales using [rossmann store sales](https://www.kaggle.com/c/rossmann-store-sales) data. Our example is an adaptation of the [Horovod-Spark example](https://github.com/horovod/horovod/blob/master/examples/spark/keras/keras_spark_rossmann_estimator.py). Here's how the code is streamlined: - Fetch the rossmann sales data - Perform complicated data pre-processing using Flyte-Spark plugin - Define a Keras model and perform distributed training using Horovod on Spark - Generate predictions and store them in a submission file ## About Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. It uses the all-reduce algorithm for fast distributed training instead of a parameter server approach. It builds on top of low-level frameworks like MPI and NCCL and provides optimized algorithms for sharing data between parallel training processes. [Parameter server vs. all-reduce](https://raw.githubusercontent.com/flyteorg/static-resources/main/flytesnacks/tutorials/horovod/all_reduce.png) ## About Spark Spark is a data processing and analytics engine to deal with large-scale data. Here's an interesting fact about Spark integration: **Spark integrates with both Horovod and Flyte** ### Horovod and Spark Horovod implicitly supports Sparkโ€”it provides the `horovod.spark` package. It facilitates running distributed jobs on the Spark cluster. In our example, we use an Estimator API. An estimator API abstracts the data processing, model training and checkpointing, and distributed training, which makes it easy to integrate and run our example code. Since we use the Keras deep learning library, here's how we install the relevant Horovod packages: ```shell HOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod[spark,tensorflow]==0.22.1 ``` The installation includes enabling MPI and TensorFlow environments. ### Flyte and Spark Flyte can execute Spark jobs natively on a Kubernetes Cluster, which manages a virtual cluster's lifecycle, spin-up, and tear down. It leverages the open-sourced Spark On K8s Operator and can be enabled without signing up for any service. This is like running a transient spark clusterโ€”a type of cluster spun up for a specific Spark job and torn down after completion. To install the Spark plugin on Flyte, we use the following command: ```shell $ pip install flytekitplugins-spark ``` [Flyte-Spark plugin](https://raw.githubusercontent.com/flyteorg/static-resources/main/flytesnacks/tutorials/horovod/flyte_spark.png) The plugin requires configuring the Flyte backend as well. Refer to {ref}`Kubernetes Spark Jobs ` for setup instructions. In a nutshell, here's how Horovod-Spark-Flyte can be beneficial: Horovod provides the distributed framework, Spark enables extracting, preprocessing, and partitioning data, Flyte can stitch the former two pieces together, e.g., by connecting the data output of a Spark transform to a training system using Horovod while ensuring high utilization of GPUs! Run workflows in this directory with the custom-built base image like so: ```shell $ pyflyte run --remote forecasting_sales/keras_spark_rossmann_estimator.py horovod_spark_wf --image ghcr.io/flyteorg/flytecookbook:spark_horovod-latest ``` ## Subpages - **Model training > Forecasting Rossman Store Sales with Horovod and Spark > Keras Spark Rossmann Estimator** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/forecasting-sales/keras-spark-rossmann-estimator === --- **Source**: tutorials/model-training/forecasting-sales/keras-spark-rossmann-estimator.md **URL**: /docs/v1/flyte/tutorials/model-training/forecasting-sales/keras-spark-rossmann-estimator/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/house-price-prediction === # House Price Regression House Price Regression refers to the prediction of house prices based on various factors, using the XGBoost Regression model (in our case). In this example, we will train our data on the XGBoost model to predict house prices in multiple regions. ## Where Does Flyte Fit In? - Orchestrates the machine learning pipeline. - Helps cache the output state between tasks. - Easier backtracking to the error source. - Provides a Rich UI to view and manage the pipeline. House price prediction pipeline for one region doesn't require a `flytekit:flytekit.dynamic` workflow. When multiple regions are involved, to iterate through the regions at run-time and thereby build the DAG, Flyte workflow has to be `flytekit:flytekit.dynamic`. ## Dataset We will create a custom dataset to build our model by referring to the [SageMaker example](https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.ipynb). The dataset will have the following columns: - Price - House Size - Number of Bedrooms - Year Built - Number of Bathrooms - Number of Garage Spaces - Lot Size ## Takeaways - An in-depth dive into dynamic workflows - How the Flyte type-system works ## Subpages - **Model training > House Price Regression > House Price Predictor** - **Model training > House Price Regression > Multiregion House Price Predictor** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/house-price-prediction/house-price-predictor === --- **Source**: tutorials/model-training/house-price-prediction/house-price-predictor.md **URL**: /docs/v1/flyte/tutorials/model-training/house-price-prediction/house-price-predictor/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/house-price-prediction/multiregion-house-price-predictor === --- **Source**: tutorials/model-training/house-price-prediction/multiregion-house-price-predictor.md **URL**: /docs/v1/flyte/tutorials/model-training/house-price-prediction/multiregion-house-price-predictor/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/mnist-classifier === # MNIST Classification With PyTorch and W&B ## PyTorch [Pytorch](https://pytorch.org) is a machine learning framework that accelerates the path from research prototyping to production deployment. You can build *Tensors* and *Dynamic neural networks* in Python with strong GPU acceleration using PyTorch. In a nutshell, it is a Python package that provides two high-level features: - Tensor computation (like NumPy) with strong GPU acceleration - Deep neural networks built on a tape-based autograd system Flyte directly has no unique understanding of PyTorch. As per Flyte, PyTorch is just a Python library. However, when merged with Flyte, the combo helps utilize and bootstrap the infrastructure for PyTorch and ensures that things work well! Additionally, it also offers other benefits of using tasks and workflows -- checkpointing, separation of concerns, and auto-memoization. ## Model Development Some basics of model development are outlined in the following video, in addition to: - Bias Variance trade-off - Model families - Data parallelism - Model parallelism, and - PyTorch parallelism ๐Ÿ“บ [Watch on YouTube](https://www.youtube.com/watch?v=FuMtJOMh5uQ) ## Specify GPU Requirement One of the necessary directives applicable when working on deep learning models is explicitly requesting one or more GPUs. This can be done by giving a simple directive to the task declaration as follows: ```python from flytekit import Resources, task @task(requests=Resources(gpu="1"), limits=Resources(gpu="1")) def my_deep_learning_task(): ... ``` It is recommended to use the same `requests` and `limits` for a GPU as automatic GPU scaling is not supported. Moreover, to utilize the power of a GPU, ensure that your Flyte backend has GPU nodes provisioned. ## Distributed Data-Parallel Training Flyte also supports distributed training for PyTorch models using the **Native backend plugins > PyTorch Distributed**. ## Weights & Biases Integration [Weights & Biases](https://wandb.ai/site), or simply, `wandb` helps build better models faster with experiment tracking, dataset versioning, and model management. We'll use `wandb` alongside PyTorch to track our ML experiment and its concerned model parameters. > [!NOTE] > Before running the example, create a `wandb` account and log in to access the API. > If you're running the code locally, run the command `wandb login`. > If it's a remote cluster, you have to include the API key in the Dockerfile. ## PyTorch Dockerfile for Deployment It is essential to build the Dockerfile with GPU support to use a GPU within PyTorch. The example in this section uses a simple `nvidia-supplied GPU Docker image` as the base, and the rest of the construction is similar to the other Dockerfiles. ```dockerfile FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime LABEL org.opencontainers.image.source https://github.com/flyteorg/flytesnacks WORKDIR /root ENV LANG C.UTF-8 ENV LC_ALL C.UTF-8 ENV PYTHONPATH /root # Set your wandb API key and user name. Get the API key from https://wandb.ai/authorize. # ENV WANDB_API_KEY # ENV WANDB_USERNAME # Install the AWS cli for AWS support RUN pip install awscli # Install gcloud for GCP RUN apt-get update && apt-get install -y make build-essential libssl-dev curl # Virtual environment ENV VENV /opt/venv RUN python3 -m venv ${VENV} ENV PATH="${VENV}/bin:$PATH" # Install Python dependencies COPY requirements.in /root RUN pip install -r /root/requirements.in # Copy the actual code COPY . /root/ # This tag is supplied by the build script and will be used to determine the version # when registering tasks, workflows, and launch plans ARG tag ENV FLYTE_INTERNAL_IMAGE $tag ``` > [!NOTE] > Run your code in the `ml_training` directory, both locally and within the sandbox. ## Subpages - **Model training > MNIST Classification With PyTorch and W&B > Pytorch Single Node Multi Gpu** - **Model training > MNIST Classification With PyTorch and W&B > Pytorch Single Node And Gpu** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/mnist-classifier/pytorch-single-node-multi-gpu === --- **Source**: tutorials/model-training/mnist-classifier/pytorch-single-node-multi-gpu.md **URL**: /docs/v1/flyte/tutorials/model-training/mnist-classifier/pytorch-single-node-multi-gpu/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/mnist-classifier/pytorch-single-node-and-gpu === --- **Source**: tutorials/model-training/mnist-classifier/pytorch-single-node-and-gpu.md **URL**: /docs/v1/flyte/tutorials/model-training/mnist-classifier/pytorch-single-node-and-gpu/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/nlp-processing === # NLP Processing This tutorial will demonstrate how to process text data and generate word embeddings and visualizations as part of a Flyte workflow. It's an adaptation of the official Gensim [Word2Vec tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html). ## About Gensim Gensim is a popular open-source natural language processing (NLP) library used to process large corpora (can be larger than RAM). It has efficient multicore implementations of a number of algorithms such as [Latent Semantic Analysis](http://lsa.colorado.edu/papers/dp1.LSAintro.pdf), [Latent Dirichlet Allocation (LDA)](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf), [Word2Vec deep learning](https://arxiv.org/pdf/1301.3781.pdf) to perform complex tasks including understanding document relationships, topic modeling, learning word embeddings, and more. You can read more about Gensim [here](https://radimrehurek.com/gensim/). ## Data The dataset used for this tutorial is the open-source [Lee Background Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor) that comes with the Gensim library. ## Step-by-Step Process The following points outline the modelling process: - Returns a preprocessed (tokenized, stop words excluded, lemmatized) corpus from the custom iterator. - Trains the Word2vec model on the preprocessed corpus. - Generates a bag of words from the corpus and trains the LDA model. - Saves the LDA and Word2Vec models to disk. - Deserializes the Word2Vec model, runs word similarity and computes word movers distance. - Reduces the dimensionality (using tsne) and plots the word embeddings. Let's dive into the code! ## Subpages - **Model training > NLP Processing > Word2Vec And Lda** === PAGE: https://www.union.ai/docs/v1/flyte/tutorials/model-training/nlp-processing/word2vec-and-lda === --- **Source**: tutorials/model-training/nlp-processing/word2vec-and-lda.md **URL**: /docs/v1/flyte/tutorials/model-training/nlp-processing/word2vec-and-lda/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations === # Integrations Flyte is designed to be highly extensible and can be customized in multiple ways. > [!NOTE] > Want to contribute an integration example? Check out the [contribution guide](../community/contribute/contribute-examples). ## Connectors Flyte supports [the following connectors out-of-the-box](./connectors/_index). If you don't see the connector you need below, have a look at **Connectors > Creating a new connector**. | Agent | Description | |-------|-------------| | [SageMaker connector](./connectors/sagemaker-inference-connector/_index) | Deploy models and create, as well as trigger inference endpoints on AWS SageMaker. | | [Airflow connector](./connectors/airflow-connector/_index) | Run Airflow jobs in your workflows with the Airflow connector. | | [BigQuery connector](./connectors/bigquery-connector/_index) | Run BigQuery jobs in your workflows with the BigQuery connector. | | [ChatGPT connector](./connectors/chatgpt-connector/_index) | Run ChatGPT jobs in your workflows with the ChatGPT connector. | | [Databricks connector](./connectors/databricks-connector/_index) | Run Databricks jobs in your workflows with the Databricks connector. | | [Memory Machine Cloud connector](./connectors/mmcloud-connector/_index) | Execute tasks using the MemVerge Memory Machine Cloud connector. | | [OpenAI Batch connector](./connectors/openai-batch-connector/_index) | Submit requests for asynchronous batch processing on OpenAI. | | [Perian connector](./connectors/perian-connector/_index) | Execute tasks on Perian Job Platform. | | [Sensor connector](./connectors/sensor/_index) | Run sensor jobs in your workflows with the sensor connector. | | [Slurm connector](./connectors/slurm-connector/_index) | Run Slurm jobs in your workflows with the Slurm connector. | | [Snowflake connector](./connectors//snowflake-connector/_index) | Run Snowflake jobs in your workflows with the Snowflake connector. | ## Flytekit plugins Flytekit plugins can be implemented purely in Python, unit tested locally, and allow extending Flytekit functionality. For comparison, these plugins can be thought of like [Airflow operators](https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/index.html). | Plugin | Description | |--------|-------------| | [Comet](./flytekit-plugins/comet-ml-plugin/_index) | `comet-ml`: Cometโ€™s machine learning platform. | | [DBT](./flytekit-plugins/dbt-plugin/_index) | Run and test your `dbt` pipelines in Flyte. | | [Dolt](./flytekit-plugins/dolt-plugin/_index) | Version your SQL database with `dolt`. | | [DuckDB](./flytekit-plugins/duckdb-plugin/_index) | Run analytical queries using DuckDB. | | [Great Expectations](./flytekit-plugins/greatexpectations-plugin/_index) | Validate data with `great_expectations`. | | [Memray](./flytekit-plugins/memray-plugin/_index) | `memray`: Memory profiling with memray. | | [MLFlow](./flytekit-plugins/mlflow-plugin/_index) | `mlflow`: the open standard for model tracking. | | [Modin](./flytekit-plugins/modin-plugin/_index) | Scale pandas workflows with `modin`. | | [Neptune](./flytekit-plugins/neptune-plugin/_index) | `neptune`: Neptune is the MLOps stack component for experiment tracking. | | [NIM](./flytekit-plugins/nim-plugin/_index) | Serve optimized model containers with NIM. | | [Ollama](./flytekit-plugins/ollama-plugin/_index) | Serve fine-tuned LLMs with Ollama in a Flyte workflow. | | [ONNX](./flytekit-plugins/onnx-plugin/_index) | Convert ML models to ONNX models seamlessly. | | [Pandera](./flytekit-plugins/pandera-plugin/_index) | Validate pandas dataframes with `pandera`. | | [Papermill](./flytekit-plugins/papermill-plugin/_index) | Execute Jupyter Notebooks with `papermill`. | | [SQL](./flytekit-plugins/sql-plugin/_index) | Execute SQL queries as tasks. | | [Weights and Biases](./flytekit-plugins/wandb-plugin/_index) | `wandb`: Machine learning platform to build better models faster. | | [WhyLogs](./flytekit-plugins/whylogs-plugin/_index) | `whylogs`: the open standard for data logging. | ### Using Flytekit plugins Data is automatically marshalled and unmarshalled in and out of the plugin. Users should mostly implement the `flytekit.core.base-task.PythonTask` API defined in Flytekit. Flytekit plugins are lazily loaded and can be released independently like libraries. The naming convention is `flytekitplugins-*`, where `*` indicates the package to be integrated into Flytekit. For example, `flytekitplugins-papermill` enables users to author Flytekit tasks using [Papermill](https://papermill.readthedocs.io/en/latest/). You can find the plugins maintained by the core Flyte team [here](https://github.com/flyteorg/flytekit/tree/master/plugins). ## Native backend plugins Native backend plugins can be executed without any external service dependencies because the compute is orchestrated by Flyte itself, within its provisioned Kubernetes clusters. | Plugin | Description | |--------|-------------| | [Kubeflow PyTorch](./native-backend-plugins/kfpytorch-plugin/_index) | Run distributed PyTorch training jobs using `Kubeflow`. | | [Kubeflow TensorFlow](./native-backend-plugins/kftensorflow-plugin/_index) | Run distributed TensorFlow training jobs using `Kubeflow`. | | [Kubernetes cluster Dask jobs](./native-backend-plugins/k8s-dask-plugin/_index) | Run Dask jobs on a Kubernetes Cluster. | | [Kubernetes cluster Spark jobs](./native-backend-plugins/k8s-spark-plugin/_index) | Run Spark jobs on a Kubernetes Cluster. | | [MPI Operator](./native-backend-plugins/kfmpi-plugin/_index) | Run distributed deep learning training jobs using Horovod and MPI. | | [Ray](./native-backend-plugins/ray-plugin/_index) | Run Ray jobs on a K8s Cluster. | ## External service backend plugins As the term suggests, these plugins rely on external services to handle the workload defined in the Flyte task that uses the plugin. | Plugin | Description | |--------|-------------| | [AWS Athena](./external-service-backend-plugins/athena-plugin/_index) | Execute queries using AWS Athena | | [AWS Batch](./external-service-backend-plugins/aws-batch-plugin/_index) | Running tasks and workflows on AWS batch service | | [Flyte Interactive](./external-service-backend-plugins/flyteinteractive-plugin/_index) | Execute tasks using Flyte Interactive to debug. | | [Hive](./external-service-backend-plugins/hive-plugin/_index) | Run Hive jobs in your workflows. | ## Enabling backend plugins To enable a backend plugin, you must add the `ID` of the plugin to the enabled plugins list. The `enabled-plugins` is available under the `tasks > task-plugins` section of FlytePropeller's configuration. The plugin configuration structure is defined [here](https://pkg.go.dev/github.com/flyteorg/flytepropeller@v0.6.1/pkg/controller/nodes/task/config#TaskPluginConfig). An example of the config follows: ```yaml tasks: task-plugins: enabled-plugins: - container - sidecar - k8s-array default-for-task-types: container: container sidecar: sidecar container_array: k8s-array ``` **Finding the `ID` of the backend plugin** To find the `ID` of the backend plugin, look at the source code of the plugin. For examples, in the case of Spark, the value of `ID` is used [here](https://github.com/flyteorg/flyteplugins/blob/v0.5.25/go/tasks/plugins/k8s/spark/spark.go#L424), defined as [spark](https://github.com/flyteorg/flyteplugins/blob/v0.5.25/go/tasks/plugins/k8s/spark/spark.go#L41). ## Flyte operators Flyte can be integrated with other orchestrators to help you leverage Flyte's constructs natively within other orchestration tools. | Operator | Description | |----------|-------------| | [Airflow](./flyte-operators/airflow-plugin/_index) | Trigger Flyte executions from Airflow. | ## Subpages - **Connectors** - **Flytekit plugins** - **Native backend plugins** - **External service backend plugins** - **Flyte operators** - **Deprecated integrations** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors === # Connectors Connectors are long-running, stateless services that receive execution requests via gRPC and initiate jobs with appropriate external or internal services. Each connector service is a Kubernetes deployment that receives gRPC requests when users trigger a particular type of task. (For example, the BigQuery connector is tiggered by the invocation of a BigQuery tasks.) The connector service then initiates a job with the appropriate service. Connectors can be run locally as long as the appropriate connection secrets are locally available, since they are spawned in-process. Connectors are designed to be scalable and can handle large workloads efficiently, and decrease load on the core system, since they run outside it. You can also test connectors locally without having to change the backend configuration, streamlining workflow development. Connectors enable two key use cases: * **Asynchronously** launching jobs on hosted platforms (e.g. Databricks or Snowflake). * Calling external **synchronous** services, such as access control, data retrieval, or model inferencing. This section covers all currently available connectors: * [Airflow connector](./airflow-connector/_index) * [BigQuery connector](./bigquery-connector/_index) * [OpenAI ChatGPT connector](./chatgpt-connector/_index) * [OpenAI Batch connector](./openai-batch-connector/_index) * [Databricks connector](./databricks-connector/_index) * [Memory Machine Cloud connector](./mmcloud-connector/_index) * [Perian connector](./perian-connector/_index) * [Sagemaker connector](./sagemaker-inference-connector/_index) * [Sensor connector](./sensor/_index) * [Slurm connector](./slurm-connector/_index) * [Snowflake connector](./snowflake-connector/_index) ## Creating a new connector If none of the existing connectors meet your needs, you can implement your own connector. There are two types of connectors: **async** and **sync**. * **Async connectors** enable long-running jobs that execute on an external platform over time. They communicate with external services that have asynchronous APIs that support `create`, `get`, and `delete` operations. The vast majority of connectors are async connectors. * **Sync connectors** enable request/response services that return immediate outputs (e.g. calling an internal API to fetch data or communicating with the OpenAI API). > [!NOTE] > While connectors can be written in any programming language since they use a protobuf interface, > we currently only support Python connectors. > We may support other languages in the future. ### Async connector interface specification To create a new async connector, extend the `AsyncConnectorBase` and implement `create`, `get`, and `delete` methods. These methods must be idempotent. - `create`: This method is used to initiate a new job. Users have the flexibility to use gRPC, REST, or an SDK to create a job. - `get`: This method retrieves the job resource (job ID or output literal) associated with the task, such as a BigQuery job ID or Databricks task ID. - `delete`: Invoking this method will send a request to delete the corresponding job. For an example implementation, see the [BigQuery connector code](https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-bigquery/flytekitplugins/bigquery/connector.py). ### Sync connector interface specification To create a new sync connector, extend the `SyncConnectorBase` class and implement a `do` method. This method must be idempotent. - `do`: This method is used to execute the synchronous task, and the worker in Flyte will be blocked until the method returns. For an example implementation, see the [ChatGPT connector code](https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-openai/flytekitplugins/openai/chatgpt/connector.py). ### Testing your connector locally To test your connector locally, create a class for the connector task that inherits from [`AsyncConnectorExecutorMixin`](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L354). This mixin can handle both asynchronous tasks and synchronous tasks and allows Flytekit to mimic the system's behavior in calling the connector. For testing examples, see the **Connectors > BigQuery connector > Local testing** and **Connectors > Databricks connector > Local testing** documentation. ## Enabling a connector in your Flyte deployment For information on setting up a connector in your Flyte deployment, see [Deployment > Connector setup](../../deployment/flyte-connectors/_index) ## Subpages - **Connectors > Airflow connector** - **Connectors > BigQuery connector** - **Connectors > ChatGPT connector** - **Connectors > Databricks connector** - **Connectors > Memory Machine Cloud connector** - **Connectors > OpenAI Batch connector** - **Connectors > Perian connector** - **Connectors > SageMaker connector** - **Connectors > Sensor connector** - **Connectors > Slurm connector** - **Connectors > Snowflake connector** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/airflow-connector === # Airflow connector [Apache Airflow](https://airflow.apache.org) is a widely used open source platform for managing workflows with a robust ecosystem. Flyte provides an Airflow plugin that allows you to run Airflow tasks as Flyte tasks. This allows you to use the Airflow plugin ecosystem in conjunction with Flyte's powerful task execution and orchestration capabilities. > [!NOTE] > The Airflow connector does not support all [Airflow operators](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/operators.html). > We have tested many, but if you run into issues, > please [file a bug report](https://github.com/flyteorg/flyte/issues/new?assignees=&labels=bug%2Cuntriaged&projects=&template=bug_report.yaml&title=%5BBUG%5D+). ## Installation To install the plugin, run the following command: `pip install flytekitplugins-airflow` This plugin has two components: * **Airflow compiler:** This component compiles Airflow tasks to Flyte tasks, so Airflow tasks can be directly used inside the Flyte workflow. * **Airflow connector:** This component allows you to execute Airflow tasks either locally or on a Flyte cluster. > [!NOTE] > You don't need an Airflow cluster to run Airflow tasks, since flytekit will > automatically compile Airflow tasks to Flyte tasks and execute them on the Airflow connector. ## Example usage For an example query, see **Connectors > Airflow connector > Airflow Connector Example Usage** ## Local testing Airflow doesn't support local execution natively. However, Flyte compiles Airflow tasks to Flyte tasks, which enables you to test Airflow tasks locally in flytekit's local execution mode. > [!NOTE] > In some cases, you will need to store credentials in your local environment when testing locally. ## Flyte deployment configuration To enable the Airflow connector in your Flyte deployment, see the **Connector setup > Airflow connector**. ## Subpages - **Connectors > Airflow connector > Airflow Connector Example Usage** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/airflow-connector/airflow-connector-example-usage === --- **Source**: integrations/connectors/airflow-connector/airflow-connector-example-usage.md **URL**: /docs/v1/flyte/integrations/connectors/airflow-connector/airflow-connector-example-usage/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/bigquery-connector === # BigQuery connector ## Installation To install the BigQuery connector, run the following command: This connector is purely a spec. Since SQL is completely portable, there is no need to build a Docker container. ## Example usage For an example query, see **Connectors > BigQuery connector > Bigquery Connector Example Usage** ## Local testing To test the BigQuery connector locally, create a class for the connector task that inherits from [AsyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L354). This mixin can handle asynchronous tasks and allows the SDK to mimic the system's behavior in calling the connector. For more information, see **Connectors > Creating a new connector > Testing your connector locally**. > [!NOTE] > In some cases, you will need to store credentials in your local environment when testing locally. ## Flyte deployment configuration To enable the BigQuery connector in your Flyte deployment, see the **Connector setup > Google BigQuery connector** ## Subpages - **Connectors > BigQuery connector > Bigquery Connector Example Usage** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/bigquery-connector/bigquery-connector-example-usage === --- **Source**: integrations/connectors/bigquery-connector/bigquery-connector-example-usage.md **URL**: /docs/v1/flyte/integrations/connectors/bigquery-connector/bigquery-connector-example-usage/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/chatgpt-connector === # ChatGPT connector ## Installation To install the ChatGPT connector, run the following command: ```shell $ pip install flytekitplugins-openai ``` ## Example usage For an example query, see **Connectors > ChatGPT connector > Chatgpt Connector Example Usage** ## Local testing To test the ChatGPT connector locally, create a class for the connector task that inherits from [SyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L304). This mixin can handle synchronous tasks and allows the SDK to mimic the system's behavior in calling the connector. For more information, see **Connectors > Creating a new connector > Testing your connector locally**. > [!NOTE] > In some cases, you will need to store credentials in your local environment when testing locally. ## Flyte deployment configuration To enable the ChatGPT connector in your Flyte deployment, see the **Connector setup > ChatGPT connector**. ## Subpages - **Connectors > ChatGPT connector > Chatgpt Connector Example Usage** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/chatgpt-connector/chatgpt-connector-example-usage === --- **Source**: integrations/connectors/chatgpt-connector/chatgpt-connector-example-usage.md **URL**: /docs/v1/flyte/integrations/connectors/chatgpt-connector/chatgpt-connector-example-usage/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/databricks-connector === # Databricks connector Flyte can be integrated with the [Databricks](https://www.databricks.com/) service, enabling you to submit Spark jobs to the Databricks platform. ## Installation The Databricks connector comes bundled with the Spark plugin. To install the Spark plugin, run the following command: ```shell $ pip install flytekitplugins-spark ``` ## Example usage For an example query, see **Connectors > Databricks connector > Databricks Connector Example Usage** ## Local testing To test the Databricks connector locally, create a class for the connector task that inherits from [AsyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L354). This mixin can handle asynchronous tasks and allows the SDK to mimic the system's behavior in calling the connector. For more information, see **Connectors > Creating a new connector > Testing your connector locally**. > [!NOTE] > In some cases, you will need to store credentials in your local environment when testing locally. ## Flyte deployment configuration To enable the Databricks connector in your Flyte deployment, see the **Connector setup > Databricks connector**. ## Subpages - **Connectors > Databricks connector > Databricks Connector Example Usage** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/databricks-connector/databricks-connector-example-usage === --- **Source**: integrations/connectors/databricks-connector/databricks-connector-example-usage.md **URL**: /docs/v1/flyte/integrations/connectors/databricks-connector/databricks-connector-example-usage/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/mmcloud-connector === # Memory Machine Cloud connector [MemVerge](https://memverge.com/) [Memory Machine Cloud](https://www.mmcloud.io/) (MMCloud)โ€”available on AWS, GCP, and AliCloudโ€”empowers users to continuously optimize cloud resources during runtime, safely execute stateful tasks on spot instances, and monitor resource usage in real time. These capabilities make it an excellent fit for long-running batch workloads. Flyte can be integrated with MMCloud, allowing you to execute Flyte tasks using MMCloud. ## Installation To install the connector, run the following command: ```shell $ pip install flytekitplugins-mmcloud ``` To get started with Memory Machine Cloud, see the [Memory Machine Cloud user guide](https://docs.memverge.com/MMCloud/latest/User%20Guide/about). ## Example usage For an example query, see **Connectors > Memory Machine Cloud connector > Mmcloud Connector Example Usage** ## Local testing To test the MMCloud connector locally, create a class for the connector task that inherits from [AsyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L354). This mixin can handle asynchronous tasks and allows the SDK to mimic the system's behavior in calling the connector. For more information, see **Connectors > Creating a new connector > Testing your connector locally**. > [!NOTE] > In some cases, you will need to store credentials in your local environment when testing locally. ## Flyte deployment configuration To enable the Memory Machine Cloud connector in your Flyte deployment, see the **Connector setup > MMCloud Connector**. ## Subpages - **Connectors > Memory Machine Cloud connector > Mmcloud Connector Example Usage** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/mmcloud-connector/mmcloud-connector-example-usage === --- **Source**: integrations/connectors/mmcloud-connector/mmcloud-connector-example-usage.md **URL**: /docs/v1/flyte/integrations/connectors/mmcloud-connector/mmcloud-connector-example-usage/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/openai-batch-connector === # OpenAI Batch connector The Batch API connector allows you to submit requests for asynchronous batch processing on OpenAI. You can provide either a JSONL file or a JSON iterator, and the connector handles the upload to OpenAI, creation of the batch, and downloading of the output and error files. ## Installation To use the OpenAI Batch connector, run the following command: ```shell $ pip install flytekitplugins-openai ``` ## Example usage For an example query, see **Connectors > OpenAI Batch connector > Openai Batch Connector Example Usage** ## Local testing To test an connector locally, create a class for the connector task that inherits from [SyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L304) or [AsyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L354). These mixins can handle synchronous and synchronous tasks, respectively, and allow the SDK to mimic the system's behavior in calling the connector. For more information, see **Connectors > Creating a new connector > Testing your connector locally**. ## Flyte deployment configuration To enable the OpenAI Batch connector in your Flyte deployment, refer to the **Connector setup > OpenAI Batch Connector** ## Subpages - **Connectors > OpenAI Batch connector > Openai Batch Connector Example Usage** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/openai-batch-connector/openai-batch-connector-example-usage === --- **Source**: integrations/connectors/openai-batch-connector/openai-batch-connector-example-usage.md **URL**: /docs/v1/flyte/integrations/connectors/openai-batch-connector/openai-batch-connector-example-usage/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/perian-connector === # Perian connector The Perian connector enables you to execute Flyte tasks on the [Perian Sky Platform](https://perian.io/). Perian allows the execution of any task on servers aggregated from multiple cloud providers. To get started with Perian, see the [Perian documentation](https://perian.io/docs/overview) and the [Perian connector documentation](https://perian.io/docs/flyte-getting-started). ## Example usage For an example, see **Connectors > Perian connector > Example** ## Connector setup Consult the [PERIAN connector setup guide](https://perian.io/docs/flyte-setup-guide). ## Subpages - **Connectors > Perian connector > Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/perian-connector/example === --- **Source**: integrations/connectors/perian-connector/example.md **URL**: /docs/v1/flyte/integrations/connectors/perian-connector/example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/sagemaker-inference-connector === # SageMaker connector The SageMaker connector allows you to deploy models, and create and trigger inference endpoints. You can also fully remove the SageMaker deployment. ## Installation To use the SageMaker connector, run the following command: ```shell $ pip install flytekitplugins-awssagemaker ``` ## Example usage For an example query, see **Connectors > SageMaker connector > Sagemaker Inference Connector Example Usage** ## Local testing To test an connector locally, create a class for the connector task that inherits from [SyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L304) or [AsyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L354). These mixins can handle synchronous and synchronous tasks, respectively, and allow the SDK to mimic the system's behavior in calling the connector. For more information, see **Connectors > Creating a new connector > Testing your connector locally**. ## Flyte deployment configuration To enable the AWS SageMaker inference connector in your Flyte deployment, refer to the **Connector setup > SageMaker Inference Connector**. ## Subpages - **Connectors > SageMaker connector > Sagemaker Inference Connector Example Usage** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/sagemaker-inference-connector/sagemaker-inference-connector-example-usage === --- **Source**: integrations/connectors/sagemaker-inference-connector/sagemaker-inference-connector-example-usage.md **URL**: /docs/v1/flyte/integrations/connectors/sagemaker-inference-connector/sagemaker-inference-connector-example-usage/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/sensor === # Sensor connector ## Example usage For an example query, see **Connectors > Sensor connector > File Sensor Example** ## Flyte deployment configuration To enable the Sensor connector in your Flyte deployment, see the **Connector setup > Sensor connector**. ## Subpages - **Connectors > Sensor connector > File Sensor Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/sensor/file-sensor-example === --- **Source**: integrations/connectors/sensor/file-sensor-example.md **URL**: /docs/v1/flyte/integrations/connectors/sensor/file-sensor-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/slurm-connector === # Slurm connector ## Installation To install the Slurm connector, run the following command: ```shell $ pip install flytekitplugins-slurm ``` ## Example usage For an example query, see **Connectors > Slurm connector > Slurm Connector Example Usage** ## Local testing To test the Slurm connector locally, create a class for the connector task that inherits from [AsyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L354). This mixin can handle asynchronous tasks and allows the SDK to mimic the system's behavior in calling the connector. For more information, see **Connectors > Creating a new connector > Testing your connector locally**. > [!NOTE] > In some cases, you will need to store credentials in your local environment when testing locally. ## Flyte deployment configuration To enable the Slurm connector in your Flyte deployment, see the **Connector setup > Slurm connector**. ## Subpages - **Connectors > Slurm connector > Slurm Connector Example Usage** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/slurm-connector/slurm-connector-example-usage === --- **Source**: integrations/connectors/slurm-connector/slurm-connector-example-usage.md **URL**: /docs/v1/flyte/integrations/connectors/slurm-connector/slurm-connector-example-usage/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/snowflake-connector === # Snowflake connector Flyte can be seamlessly integrated with the [Snowflake](https://www.snowflake.com) service, providing you with a straightforward means to query data in Snowflake. ## Installation To use the Snowflake connector, run the following command: ```shell $ pip install flytekitplugins-snowflake ``` ## Example usage For an example query, see **Connectors > Snowflake connector > Snowflake Connector Example Usage** ## Local testing To test the Snowflake connector locally, create a class for the connector task that inherits from [AsyncConnectorExecutorMixin](https://github.com/flyteorg/flytekit/blob/1bc8302bb7a6cf4c7048a7f93627ee25fc6b88c4/flytekit/extend/backend/base_connector.py#L354). This mixin can handle asynchronous tasks and allows the SDK to mimic the system's behavior in calling the connector. For more information, see **Connectors > Creating a new connector > Testing your connector locally**. > [!NOTE] > In some cases, you will need to store credentials in your local environment when testing locally. ## Flyte deployment configuration To enable the Snowflake connector in your Flyte deployment, see the **Connector setup > Snowflake connector**. ## Subpages - **Connectors > Snowflake connector > Snowflake Connector Example Usage** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/connectors/snowflake-connector/snowflake-connector-example-usage === --- **Source**: integrations/connectors/snowflake-connector/snowflake-connector-example-usage.md **URL**: /docs/v1/flyte/integrations/connectors/snowflake-connector/snowflake-connector-example-usage/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins === # Flytekit plugins This section covers Flytekit plugins. ## Subpages - **Flytekit plugins > Comet ML** - **Flytekit plugins > DBT** - **Flytekit plugins > Dolt** - **Flytekit plugins > Great Expectations** - **Flytekit plugins > Memray Profiling** - **Flytekit plugins > MLflow** - **Flytekit plugins > Modin** - **Flytekit plugins > Neptune** - **Flytekit plugins > NIM** - **Flytekit plugins > Ollama** - **Flytekit plugins > ONNX** - **Flytekit plugins > Pandera** - **Flytekit plugins > Papermill** - **Flytekit plugins > DuckDB** - **Flytekit plugins > SQL** - **Flytekit plugins > Weights and Biases** - **Flytekit plugins > WhyLogs** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/comet-ml-plugin === # Comet ML Cometโ€™s machine learning platform integrates with your existing infrastructure and tools so you can manage, visualize, and optimize models from training runs to production monitoring. This plugin integrates Flyte with Comet by configuring links between the two platforms. To install the plugin, run: ```shell $ pip install flytekitplugins-comet-ml ``` Comet requires an API key to authenticate with their platform. In the above example, a secret is created using **Platform configuration > Secrets**. To enable linking from the Flyte side panel to Comet.ml, add the following to Flyte's configuration: ```yaml plugins: logs: dynamic-log-links: - comet-ml-execution-id: displayName: Comet templateUris: "{{ .taskConfig.host }}/{{ .taskConfig.workspace }}/{{ .taskConfig.project_name }}/{{ .executionName }}{{ .nodeId }}{{ .taskRetryAttempt }}{{ .taskConfig.link_suffix }}" - comet-ml-custom-id: displayName: Comet templateUris: "{{ .taskConfig.host }}/{{ .taskConfig.workspace }}/{{ .taskConfig.project_name }}/{{ .taskConfig.experiment_key }}" ``` ## Subpages - **Flytekit plugins > Comet ML > Comet Ml Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/comet-ml-plugin/comet-ml-example === --- **Source**: integrations/flytekit-plugins/comet-ml-plugin/comet-ml-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/comet-ml-plugin/comet-ml-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/dbt-plugin === # DBT [dbt](https://www.getdbt.com/) is one of the widely-used data transformation tools for working with data directly in a data warehouse. It's optimized for analytics use cases and can be used for business intelligence, operational analytics, and even machine learning. The Flytekit `dbt` plugin is a Python module that provides an easy way to invoke basic `dbt` CLI commands from within a Flyte task. The plugin supports the commands [dbt run](https://docs.getdbt.com/reference/commands/run), [dbt test](https://docs.getdbt.com/reference/commands/test), and [dbt source freshness](https://docs.getdbt.com/reference/commands/source). ## Prerequisities To use the `dbt` plugin you'll need to install the `flytekitplugins-dbt` plugin. > [!NOTE] > See [the PyPi page here](https://pypi.org/project/flytekitplugins-dbt/). ```shell $ pip install flytekitplugins-dbt ``` Then install dbt itself. You will have to install `dbt-core` as well as the correct adapter for the database that you are accessing. For example, if you are using a Postgres database you would do: This will install `dbt-core` and `dbt-postgres`, but not any of the other adapters, `dbt-redshift`, `dbt-snowflake`, or `dbt-bigquery`. See [the official installation page](https://docs.getdbt.com/docs/get-started/pip-install) for details. ## Running the Example We use a Postgres database installed on the cluster and an example project from dbt, called [jaffle-shop](https://github.com/dbt-labs/jaffle_shop). To run the example on your local machine, do the following. > [!IMPORTANT] > The example below is not designed to run directly in your local > python environment. It must be run in a Kubernetes cluster, either locally on > your machine using the `flytectl demo start` command or on a cloud cluster. Start up the demo cluster on your local machine: ```shell $ flytectl demo start ``` Pull the pre-built image for this example: ```shell $ docker pull ghcr.io/flyteorg/flytecookbook:dbt_example-latest ``` This image is built using the following `Dockerfile` and contains: - The `flytekitplugins-dbt` and `dbt-postgres` Python dependencies. - The `jaffle-shop` example. - A postgres database. This Dockerfile can be found in the ``flytesnacks/examples`` directory under the filepath listed in the code block title below. ```dockerfile FROM python:3.8-slim-buster WORKDIR /root ENV VENV /opt/venv ENV LANG C.UTF-8 ENV LC_ALL C.UTF-8 ENV PYTHONPATH /root RUN apt-get update && apt-get install -y build-essential git postgresql-client libpq-dev # Install the AWS cli separately to prevent issues with boto being written over RUN pip3 install awscli ENV VENV /opt/venv # Virtual environment RUN python3 -m venv ${VENV} ENV PATH="${VENV}/bin:$PATH" # Install Python dependencies COPY requirements.in /root/ RUN pip install -r /root/requirements.in # psycopg2-binary is a dependency of the dbt-postgres adapter, but that doesn't work on mac M1s. # As per https://github.com/psycopg/psycopg2/issues/1360, we install psycopg to circumvent this. RUN pip uninstall -y psycopg2-binary && pip install psycopg2 # Copy the actual code COPY . /root/ # Copy dbt-specific files COPY profiles.yml /root/dbt-profiles/ RUN git clone https://github.com/dbt-labs/jaffle_shop.git # This tag is supplied by the build script and will be used to determine the version # when registering tasks, workflows, and launch plans ARG tag ENV FLYTE_INTERNAL_IMAGE $tag ENV FLYTE_SDK_LOGGING_LEVEL 10 ``` To run this example, copy the code in the **dbt example** below into a file called `dbt_example.py`, then run it on your local container using the provided image: ```shell $ pyflyte run --remote \ --image ghcr.io/flyteorg/flytecookbook:dbt_example-latest \ dbt_plugin/dbt_example.py wf ``` Alternatively, you can clone the `flytesnacks` repo and run the example directly: ```shell $ git clone https://github.com/flyteorg/flytesnacks $ cd flytesnacks/examples/dbt_example $ pyflyte run --remote \ --image ghcr.io/flyteorg/flytecookbook:dbt_example-latest \ dbt_plugin/dbt_example.py wf ``` ## Subpages - **Flytekit plugins > DBT > Dbt Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/dbt-plugin/dbt-example === --- **Source**: integrations/flytekit-plugins/dbt-plugin/dbt-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/dbt-plugin/dbt-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/dolt-plugin === # Dolt The `DoltTable` plugin is a wrapper that uses [Dolt](https://github.com/dolthub/dolt) to move data between `pandas.DataFrame`'s at execution time and database tables at rest. ## Installation The dolt plugin and dolt command line tool are required to run these examples: ```shell $ pip install flytekitplugins.dolt $ sudo bash -c 'curl -L https://github.com/dolthub/dolt/releases/latest/download/install.sh | sudo bash' ``` Dolt requires a user configuration to run `init`: ```shell $ dolt config --global --add user.email $ dolt config --global --add user.name ``` These demos assume a `foo` database has been created locally: ```shell $ mkdir foo $ cd foo $ dolt init ``` ## Subpages - **Flytekit plugins > Dolt > Dolt Branch Example** - **Flytekit plugins > Dolt > Dolt Quickstart Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/dolt-plugin/dolt-branch-example === --- **Source**: integrations/flytekit-plugins/dolt-plugin/dolt-branch-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/dolt-plugin/dolt-branch-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/dolt-plugin/dolt-quickstart-example === --- **Source**: integrations/flytekit-plugins/dolt-plugin/dolt-quickstart-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/dolt-plugin/dolt-quickstart-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/greatexpectations-plugin === # Great Expectations **Great Expectations** is a Python-based open-source library for validating, documenting, and profiling your data. It helps maintain data quality and improve communication about data between teams. The goodness of data validation in Great Expectations can be integrated with Flyte to validate the data moving in and out of the pipeline entities you may have defined in Flyte. This helps establish stricter boundaries around your data to ensure that everything is as you expected and hence, data will not crash your pipelines anymore unexpectedly! ## How to Define the Integration Great Expectations supports native execution of expectations against various [Datasources](https://docs.greatexpectations.io/docs/terms/datasource/), such as Pandas dataframes, Spark dataframes, and SQL databases via SQLAlchemy. We're supporting two Flyte types that should suit Great Expectations' `Datasources`: - `flytekit.types.file.FlyteFile`: `FlyteFile` represents an automatic persistence object in Flyte. It can represent files in remote storage and Flyte transparently materializes them in every task execution. - `flytekit.types.structured.StructuredDataset`: `StructuredDataset` supports pandas dataframes, which the plugin will convert into a parquet file and validate the data using Great Expectations. > [!NOTE] > Flyte types are added because, in Great Expectations, we have the privilege to give a non-string (Pandas/Spark DataFrame) when using a`RuntimeDataConnector` > but not when using an `InferredAssetFilesystemDataConnector` or a `ConfiguredAssetFilesystemDataConnector`. > For the latter case, with the integration of Flyte types, we can give a Pandas/Spark DataFrame or a remote URI as the dataset. The datasources can be well-integrated with the plugin using the following two modes: - **Flyte Task**: A Flyte task defines the task prototype that one could use within a task or a workflow to validate data using Great Expectations. - **Flyte Type**: A Flyte type helps attach the `GreatExpectationsType` to any dataset. Under the hood, `GreatExpectationsType` can be assumed as a combination of Great Expectations and Flyte types where every data is validated against the expectations, much like the OpenAPI Spec or the gRPC validator. ### Data Validation Failure If the data validation fails, the plugin will raise a `GreatExpectationsValidationError`. For example, this is how the error message looks on the Flyte UI: ```shell Traceback (most recent call last): ... great_expectations.marshmallow__shade.exceptions.ValidationError: Validation failed! COLUMN FAILED EXPECTATION passenger_count -> expect_column_min_to_be_between passenger_count -> expect_column_mean_to_be_between passenger_count -> expect_column_quantile_values_to_be_between passenger_count -> expect_column_values_to_be_in_set passenger_count -> expect_column_proportion_of_unique_values_to_be_between trip_distance -> expect_column_max_to_be_between trip_distance -> expect_column_mean_to_be_between trip_distance -> expect_column_median_to_be_between trip_distance -> expect_column_quantile_values_to_be_between trip_distance -> expect_column_proportion_of_unique_values_to_be_between rate_code_id -> expect_column_max_to_be_between rate_code_id -> expect_column_mean_to_be_between rate_code_id -> expect_column_proportion_of_unique_values_to_be_between ``` ## Plugin Parameters - **datasource_name**: Data source, in general, is the "name" we use in the Great Expectations config file. A Datasource brings together a way of interacting with data (like a database or Spark cluster) and some specific data (like a CSV file, or a database table). Moreover, data source assists in building batches out of data (for validation). - **expectation_suite_name**: Defines the data validation. - **data_connector_name**: Tells how the data batches have to be identified. ### Optional Parameters - **context_root_dir**: Sets the path of the great expectations config directory. - **checkpoint_params**: Optional `SimpleCheckpoint` class parameters. - **batch_request_config**: Additional batch request configuration parameters. - data_connector_query: Query to request a data batch - runtime_parameters: Parameters to be sent at run-time - batch_identifiers: Batch identifiers - batch_spec_passthrough: Reader method if your file doesn't have an extension - **data_asset_name**: Name of the data asset (to be used for `RuntimeBatchRequest`) - **local_file_path**: Helpful to download the given dataset to the user-given path > [!NOTE] > You may always want to mention the **context_root_dir** parameter, as providing a path means no harm! > Moreover, **local_file_path** is essential when using `FlyteFile` and `FlyteSchema`. ## Plugin Installation To use the Great Expectations Flyte plugin, run the following command: ```shell $ pip install flytekitplugins-great_expectations ``` > [!NOTE] > Make sure to run workflows from the `flytekit_plugins` directory. ## Subpages - **Flytekit plugins > Great Expectations > Task Example** - **Flytekit plugins > Great Expectations > Type Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/greatexpectations-plugin/task-example === --- **Source**: integrations/flytekit-plugins/greatexpectations-plugin/task-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/greatexpectations-plugin/task-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/greatexpectations-plugin/type-example === --- **Source**: integrations/flytekit-plugins/greatexpectations-plugin/type-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/greatexpectations-plugin/type-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/memray-plugin === # Memray Profiling Memray tracks and reports memory allocations, both in python code and in compiled extension modules. This Memray Profiling plugin enables memory tracking on the Flyte task level and renders a memgraph profiling graph on Flyte Deck. First, install the Memray plugin: ```bash $ pip install flytekitplugins-memray ``` ## Subpages - **Flytekit plugins > Memray Profiling > Memray Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/memray-plugin/memray-example === --- **Source**: integrations/flytekit-plugins/memray-plugin/memray-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/memray-plugin/memray-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/mlflow-plugin === # MLflow The MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results First, install the Flyte MLflow plugin: ```shell $ pip install flytekitplugins-mlflow ``` To log the metrics and parameters to Flyte deck, add `@mlflow_autolog` to the task. For example ```python @task(enable_deck=True) @mlflow_autolog(framework=mlflow.keras) def train_model(epochs: int): ... ``` To log the metric and parameters to a remote MLflow server, add default environment variable [MLFLOW_TRACKING_URI](https://mlflow.org/docs/latest/tracking.html#logging-to-a-tracking-server) to the flytepropeller config map. ```shell $ kubectl edit cm flyte-propeller-config ``` ```yaml plugins: k8s: default-cpus: 100m default-env-vars: - MLFLOW_TRACKING_URI: postgresql+psycopg2://postgres:@postgres.flyte.svc.cluster.local:5432/flyteadmin ``` ![MLflow UI](../../../_static/images/integrations/flytekit-plugins/mlflow-plugin/mlflow-ui.png) ## Subpages - **Flytekit plugins > MLflow > Mlflow Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/mlflow-plugin/mlflow-example === --- **Source**: integrations/flytekit-plugins/mlflow-plugin/mlflow-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/mlflow-plugin/mlflow-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/modin-plugin === # Modin Modin is a pandas-accelerator that helps handle large datasets. Pandas works gracefully with small datasets since it is inherently single-threaded, and designed to work on a single CPU core. With large datasets, the performance of pandas drops (becomes slow or runs out of memory) due to single core usage. This is where Modin can be helpful. Instead of optimizing pandas workflows for a specific setup, we can speed up pandas workflows by utilizing all the resources (cores) available in the system using the concept of `parallelism`, which is possible through modin. [Here](https://modin.readthedocs.io/en/stable/getting_started/why_modin/pandas.html#scalablity-of-implementation) is a visual representation of how the cores are utilized in case of Pandas and Modin. ## Installation ```shell $ pip install flytekitplugins-modin ``` ## How is Modin different? Modin **scales** the Pandas workflows by changing only a **single line of code**. The plugin supports the usage of Modin DataFrame as an input to and output of a task/workflow, similar to how a pandas DataFrame can be used. ## Subpages - **Flytekit plugins > Modin > Knn Classifier** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/modin-plugin/knn-classifier === --- **Source**: integrations/flytekit-plugins/modin-plugin/knn-classifier.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/modin-plugin/knn-classifier/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/neptune-plugin === # Neptune [Neptune](https://neptune.ai/) is an experiment tracker for large-scale model training. It allows AI researchers to monitor their model training in real time, visualize and compare experiments, and collaborate on them with a team. This plugin enables seamless use of Neptune within Flyte by configuring links between the two platforms. You can find more information about how to use Neptune in their [documentation](https://docs.neptune.ai/). ## Installation To install the Flyte Neptune plugin, run the following command: ```shell $ pip install flytekitplugins-neptune ``` ## Local testing To run {doc}`Neptune example ` locally: 1. Neptune Scale is available to select customers. You can access it [here](https://neptune.ai/free-trial). 2. Create a project on Neptune. 3. In the example, set `NEPTUNE_PROJECT` to your project name. 4. Add a secret using **Platform configuration > Secrets** with `key="neptune-api-token"` and `group="neptune-api-group"` 5. If you want to see the dynamic log links in the UI, then add the configuration available in the next section. ## Flyte deployment configuration To enable dynamic log links, add the plugin to Flyte's configuration file as follows: ```yaml plugins: logs: dynamic-log-links: - neptune-scale-run: displayName: Neptune Run templateUris: - "https://scale.neptune.ai/{{ .taskConfig.project }}/-/run/?customId={{ .podName }}" - neptune-scale-custom-id: displayName: Neptune Run templateUris: - "https://scale.neptune.ai/{{ .taskConfig.project }}/-/run/?customId={{ .taskConfig.id }}" ``` ## Subpages - **Flytekit plugins > Neptune > Neptune Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/neptune-plugin/neptune-example === --- **Source**: integrations/flytekit-plugins/neptune-plugin/neptune-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/neptune-plugin/neptune-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/nim-plugin === # NIM Serve optimized model containers with NIM in a Flyte task. [NVIDIA NIM](https://www.nvidia.com/en-in/ai/), part of NVIDIA AI Enterprise, provides a streamlined path for developing AI-powered enterprise applications and deploying AI models in production. It includes an out-of-the-box optimization suite, enabling AI model deployment across any cloud, data center, or workstation. Since NIM can be self-hosted, there is greater control over cost, data privacy, and more visibility into behind-the-scenes operations. With NIM, you can invoke the model's endpoint as if it is hosted locally, minimizing network overhead. ## Installation To use the NIM plugin, run the following command: ```shell $ pip install flytekitplugins-inference ``` ## Example usage For a usage example, see **Flytekit plugins > NIM > Serve Nim Container**. > [!NOTE] > NIM can only be run in a Flyte cluster as it must be deployed as a sidecar service in a Kubernetes pod. ## Subpages - **Flytekit plugins > NIM > Serve Nim Container** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/nim-plugin/serve-nim-container === --- **Source**: integrations/flytekit-plugins/nim-plugin/serve-nim-container.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/nim-plugin/serve-nim-container/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/ollama-plugin === # Ollama Serve large language models (LLMs) in a Flyte task. [Ollama](https://ollama.com/) simplifies the process of serving fine-tuned LLMs. Whether you're generating predictions from a customized model or deploying it across different hardware setups, Ollama enables you to encapsulate the entire workflow in a single pipeline. ## Installation To use the Ollama plugin, run the following command: ```shell $ pip install flytekitplugins-inference ``` ## Example usage For a usage example, see **Flytekit plugins > Ollama > Serve Llm** > [!NOTE] > Ollama can only be run in a Flyte cluster as it must be deployed as a sidecar service in a Kubernetes pod. ## Subpages - **Flytekit plugins > Ollama > Serve Llm** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/ollama-plugin/serve-llm === --- **Source**: integrations/flytekit-plugins/ollama-plugin/serve-llm.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/ollama-plugin/serve-llm/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/onnx-plugin === # ONNX Open Neural Network Exchange ([ONNX](https://github.com/onnx/onnx)) is an open standard format for representing machine learning and deep learning models. It enables interoperability between different frameworks and streamlines the path from research to production. The flytekit onnx type plugin comes in three flavors: > [!NOTE] > If you'd like to add support for a new framework, please create an issue and submit a pull request to the flytekit repo. > You can find the ONNX plugin source code [here](https://github.com/flyteorg/flytekit/tree/master/plugins). ## Subpages - **Flytekit plugins > ONNX > Pytorch Onnx** - **Flytekit plugins > ONNX > Scikitlearn Onnx** - **Flytekit plugins > ONNX > Tensorflow Onnx** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/onnx-plugin/pytorch-onnx === --- **Source**: integrations/flytekit-plugins/onnx-plugin/pytorch-onnx.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/onnx-plugin/pytorch-onnx/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/onnx-plugin/scikitlearn-onnx === --- **Source**: integrations/flytekit-plugins/onnx-plugin/scikitlearn-onnx.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/onnx-plugin/scikitlearn-onnx/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/onnx-plugin/tensorflow-onnx === --- **Source**: integrations/flytekit-plugins/onnx-plugin/tensorflow-onnx.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/onnx-plugin/tensorflow-onnx/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/pandera-plugin === # Pandera Flytekit python natively supports {ref}`many data types `, including a `FlyteSchema` type for type-annotating pandas dataframes. The flytekit pandera plugin provides an alternative for defining dataframe schemas by integrating with [pandera](https://pandera.readthedocs.io/en/stable/), which is a runtime data validation tool for pandas dataframes. ## Installation ```shell $ pip install flytekitplugins-pandera ``` ## Quick Start Pandera provides a flexible and expressive interface for defining schemas for tabular data, where you can define the types and other statistical properties of a column. ```python import pandas as pd import pandera as pa from pandera.typing import DataFrame, Series class Schema(pa.SchemaModel): column_1: Series[int] = pa.Field(ge=0) column_2: Series[float] = pa.Field(gt=0, lt=100) column_3: Series[str] = pa.Field(str_startswith="prefix") @pa.check("column_3") def check_str_length(cls, series): return series.str.len() > 5 @pa.check_types def processing_fn(df: DataFrame[Schema]) -> DataFrame[Schema]: df["column_1"] = df["column_1"] * 2 df["column_2"] = df["column_2"] * 0.5 df["column_3"] = df["column_3"] + "_suffix" return df raw_df = pd.DataFrame({ "column_1": [1, 2, 3], "column_2": [1.5, 2.21, 3.9], "column_3": ["prefix_a", "prefix_b", "prefix_c"], }) processed_df = processing_fn(raw_df) print(processed_df) ``` ```shell column_1 column_2 column_3 0 2 0.750 prefix_a_suffix 1 4 1.105 prefix_b_suffix 2 6 1.950 prefix_c_suffix ``` Informative errors are raised if invalid data is passed into `processing_fn`, indicating the failure case and the index where they were found in the dataframe: ```python invalid_df = pd.DataFrame({ "column_1": [-1, 2, -3], "column_2": [1.5, 2.21, 3.9], "column_3": ["prefix_a", "prefix_b", "prefix_c"], }) processing_fn(invalid_df) ``` ```shell Traceback (most recent call last): ... pandera.errors.SchemaError: error in check_types decorator of function 'processing_fn': )> failed element-wise validator 0: failure cases: index failure_case 0 0 -1 1 2 -3 ``` ## Subpages - **Flytekit plugins > Pandera > Basic Schema Example** - **Flytekit plugins > Pandera > Validating And Testing Ml Pipelines** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/pandera-plugin/basic-schema-example === --- **Source**: integrations/flytekit-plugins/pandera-plugin/basic-schema-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/pandera-plugin/basic-schema-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/pandera-plugin/validating-and-testing-ml-pipelines === --- **Source**: integrations/flytekit-plugins/pandera-plugin/validating-and-testing-ml-pipelines.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/pandera-plugin/validating-and-testing-ml-pipelines/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/papermill-plugin === # Papermill It is possible to run a Jupyter notebook as a Flyte task using [papermill](https://github.com/nteract/papermill). Papermill executes the notebook as a whole, so before using this plugin, it is essential to construct your notebook as recommended by papermill. When using this plugin, there are a few important things to keep in mind: 1. This plugin can be used for any task - type. : - It can be python code, which can be a tensorflow model, a data transformation, etc - but things that run in a container and you would typically write in a `@task`. - It can be a `flytekit.dynamic` workflow. - It can be a any other plugin like `Spark`, `SageMaker` etc, **ensure that the plugin is installed as well** 2. Flytekit will execute the notebook and capture the output notebook as an *.ipynb* file and an HTML rendered notebook as well 3. Flytekit will pass the inputs into the notebook as long as you have the first cell annotated as `parameters` and inputs are specified 4. Flytekit will read the outputs from the notebook, as long as you use annotate the notebook with `outputs` and outputs are specified ## Installation To use the flytekit papermill plugin simply run the following: ```shell $ pip install flytekitplugins-papermill ``` ## Subpages - **Flytekit plugins > Papermill > Simple** - **Flytekit plugins > Papermill > Simple Papermill Notebook** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/papermill-plugin/simple === --- **Source**: integrations/flytekit-plugins/papermill-plugin/simple.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/papermill-plugin/simple/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/papermill-plugin/nb-simple === {{< right mb="2rem" >}} ๐Ÿ“ฅ [Download this notebook](https://github.com/unionai/unionai-examples/blob/main/v1/flyte-integrations/flytekit-plugins/papermill_plugin/papermill_plugin/nb_simple.ipynb) {{< /right >}} # Simple Papermill Notebook This notebook is used in the previous example as a `NotebookTask` to demonstrate how to use the papermill plugin. The cell below has the `parameters` tag, defining the inputs to the notebook. ```python v = 3.14 ``` Then, we do some computation: ```python square = v*v print(square) ``` Finally, we use the `flytekitplugins.papermill` package to record the outputs so that flyte understands which state to serialize and pass into a downstream task. ```python from flytekitplugins.papermill import record_outputs record_outputs(square=square) ``` === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/duckdb-plugin === # DuckDB [DuckDB](https://duckdb.org/) is an in-process SQL OLAP database management system that is explicitly designed to achieve high performance in analytics. The Flytekit DuckDB plugin facilitates the efficient execution of intricate analytical queries within your workflow. To install the Flytekit DuckDB plugin, run the following command: ```shell $ pip install flytekitplugins-duckdb ``` The Flytekit DuckDB plugin includes the `flytekitplugins:flytekitplugins.duckdb.DuckDBQuery` task, which allows you to specify the following parameters: - `query`: The DuckDB query to execute. - `inputs`: The query parameters to be used during query execution. This can be a StructuredDataset, a string or a list. ## Subpages - **Flytekit plugins > DuckDB > Duckdb Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/duckdb-plugin/duckdb-example === --- **Source**: integrations/flytekit-plugins/duckdb-plugin/duckdb-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/duckdb-plugin/duckdb-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/sql-plugin === # SQL Flyte tasks are not always restricted to running user-supplied containers, nor even containers at all. Indeed, this is one of the most important design decisions in Flyte. Non-container tasks can have arbitrary targets for execution -- an API that executes SQL queries like SnowFlake, BigQuery, a synchronous WebAPI, etc. ## Subpages - **Flytekit plugins > SQL > Sql Alchemy** - **Flytekit plugins > SQL > Sqlite3 Integration** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/sql-plugin/sql-alchemy === --- **Source**: integrations/flytekit-plugins/sql-plugin/sql-alchemy.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/sql-plugin/sql-alchemy/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/sql-plugin/sqlite3-integration === --- **Source**: integrations/flytekit-plugins/sql-plugin/sqlite3-integration.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/sql-plugin/sqlite3-integration/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/wandb-plugin === # Weights and Biases The Weights and Biases MLOps platform helps AI developers streamline their ML workflows from end to end. This plugin enables seamless use of Weights & Biases within Flyte by configuring links between the two platforms. First, install the Flyte Weights & Biases plugin: ```shell $ pip install flytekitplugins-wandb ``` ## Subpages - **Flytekit plugins > Weights and Biases > Wandb Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/wandb-plugin/wandb-example === --- **Source**: integrations/flytekit-plugins/wandb-plugin/wandb-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/wandb-plugin/wandb-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/whylogs-plugin === # WhyLogs whylogs is an open source software that allows you to log and inspect differents aspects of your data and ML models. It creates efficient and mergeable statistical summaries of your datasets, called profiles, that have similar properties to logs produced by regular software applications. ## whylogs with Flyte The integration we've built consists on a type registration and also renderers. ### whylogs Flyte Type The first part of the integration consists on the ability to pass a `DatasetProfileView` in and out of the desired tasks. whylogs' DatasetProfileView is the representation of a snapshot of your dataset. With this integration under Flyte's type system, users will benefit from profiling their desired dataset once and be able to use the statistical representation for many validations and drift detection systems downstream. To be able use it, pass in a `pandas.DataFrame` to a task and call: ```python @task def profiling_task(data: pd.DataFrame) -> DatasetProfileView: results = why.log(data) return results.view() ``` This will grant any downstream task the ability to ingest the profiled dataset and use basically anything from whylogs' api, such as transforming it back to a pandas DataFrame: ```python @task def consume_profile_view(profile_view: DatasetProfileView) -> pd.DataFrame: return profile_view.to_pandas() ``` ## Renderers The Summary Drift Report is a neat HTML report containing information on the distribution and drift detection of a target and a reference profile. It makes it easy for users to compare a new read dataset against the one that was used to train the model that's in production. To use it, simply take in the two desired `pandas.DataFrame` objects and call: ```python renderer = WhylogsSummaryDriftRenderer() report = renderer.to_html(target_data=new_data, reference_data=reference_data) flytekit.Deck("summary drift", report) ``` The other report that can be generated with our integration is the Constraints Report. With it, users will have a neat view on a Flyte Deck that will give intuition on which are the passed and failing constraints, enabling them to act quicker on potentially wrong results. ```python from whylogs.core.constraints.factories import greater_than_number @task def constraints_report(profile_view: DatasetProfileView) -> bool: builder = ConstraintsBuilder(dataset_profile_view=profile_view) builder.add_constraint(greater_than_number(column_name="my_column", number=10.0)) constraints = builder.build() renderer = WhylogsConstraintsRenderer() flytekit.Deck("constraints", renderer.to_html(constraints=constraints)) return constraints.validate() ``` Since we need a `Constraints` object to create this report, users can also return a boolean to whether their dataset passed the validations or not, and take actions depending on this result downstream, as the code snippet above showed. Other use-case would be to return the constraints report itself and parse it to provide more information to other systems automatically. ```python constraints = builder.build() constraints.report() >> [('my_column greater than number 10.0', 0, 1)] ``` ## Installing the plugin In order to have the whylogs plugin installed, simply run: ```shell $ pip install flytekitplugins-whylogs ``` And you should then have it available to use on your environment! ## Subpages - **Flytekit plugins > WhyLogs > Whylogs Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flytekit-plugins/whylogs-plugin/whylogs-example === --- **Source**: integrations/flytekit-plugins/whylogs-plugin/whylogs-example.md **URL**: /docs/v1/flyte/integrations/flytekit-plugins/whylogs-plugin/whylogs-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins === # Native backend plugins This section covers native backend plugins. ## Subpages - **Native backend plugins > Dask** - **Native backend plugins > MPI** - **Native backend plugins > PyTorch Distributed** - **Native backend plugins > Ray** - **Native backend plugins > Spark** - **Native backend plugins > TensorFlow Distributed** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/k8s-dask-plugin === # Dask Flyte can natively execute [Dask](https://www.dask.org/) jobs on a Kubernetes Cluster, effortlessly managing the lifecycle of a virtual Dask cluster. This functionality is achieved by leveraging the open-sourced [Dask Kubernetes Operator](https://kubernetes.dask.org/en/latest/operator.html), and no additional sign-ups for services are required. The process is akin to running an ephemeral Dask cluster, which is created specifically for the Flyte task and subsequently torn down upon completion. In the Flyte Kubernetes environment, the cost is amortized due to faster pod creation compared to machines. However, the performance may be affected by the penalty of downloading Docker images. Additionally, it's essential to keep in mind that starting a pod is not as swift as running a process. Flytekit enables writing Dask code natively as a task, with the `Dask()` config automatically configuring the Dask cluster. The example provided in this section offers a hands-on tutorial for writing Dask Flyte tasks. ## Why use Kubernetes Dask? Managing Python dependencies can be challenging, but Flyte simplifies the process by enabling easy versioning and management of dependencies through containers. The Kubernetes Dask plugin extends the benefits of containerization to Dask without requiring the management of specialized Dask clusters. Pros: 1. Simple to get started, providing complete isolation between workloads. 2. Each job runs in isolation with its own virtual cluster, eliminating the complexities of dependency management. 3. Flyte takes care of all the management tasks. Cons: 1. Short-running, bursty jobs may not be the best fit due to container overhead. 2. Interactive Dask capabilities are not available with Flyte Kubernetes Dask; instead, it is better suited for running adhoc and scheduled jobs. ## Install the plugin Install `flytekitplugins-dask` using `pip` in your environment. ```shell $ pip install flytekitplugins-dask ``` > [!NOTE] > To enable Flyte to build the Docker image for you using `ImageSpec`, install `flytekitplugins-envd`. ## Implementation details ### Local execution When executing the Dask task on your local machine, it will utilize a local [distributed client](https://distributed.dask.org/en/stable/client.html). If you intend to link to a remote cluster during local development, simply define the `DASK_SCHEDULER_ADDRESS` environment variable with the URL of the remote scheduler. The `Client()` will then automatically connect to the cluster. ### Remote execution #### Step 1: Deploy Dask plugin in the Flyte backend Flyte Dask utilizes the [Dask Kubernetes operator](https://kubernetes.dask.org/en/latest/operator.html) in conjunction with a custom-built [Flyte Dask plugin](https://pkg.go.dev/github.com/flyteorg/flyteplugins@v1.0.28/go/tasks/plugins/k8s/dask). To leverage this functionality, you need to enable the backend plugin in your deployment. You can follow the steps mentioned in the {ref}`deployment-plugin-setup-k8s` section to enable the Flyte Dask plugin for your deployment. #### Step 2: Compute setup Ensure that your Kubernetes cluster has sufficient resources available. Depending on the resource requirements of your Dask job (including the job runner, scheduler and workers), you may need to adjust the resource quotas for the namespace accordingly. > [!NOTE] > When working with [Dask's custom resources](https://kubernetes.dask.org/en/latest/operator_resources.html#custom-resources), > your Flyte service account needs explicit > permissions. To that end, you need to create and bind a Cluster role. ##### Sample Cluster Role ```yaml apiVersion: kind: ClusterRole metadata: name: dask-dask-kubernetes-operator-role-cluster labels: : Helm annotations: : dask : dask rules: - verbs: - list - watch apiGroups: - resources: - customresourcedefinitions - verbs: - get - list - watch - patch - create - delete apiGroups: - resources: - daskclusters - daskworkergroups - daskjobs - daskjobs/status - daskautoscalers - daskworkergroups/scale ``` ##### Binding command ```shell $ kubectl create clusterrolebinding flyte-dask-cluster-role-binding --clusterrole=dask-dask-kubernetes-operator-role-cluster --serviceaccount= ``` ### Resource specification It's recommended to define `limits` as this will establish the `--nthreads` and `--memory-limit` parameters for the workers, in line with the suggested practices by Dask (refer to [best practices](https://kubernetes.dask.org/en/latest/kubecluster.html?highlight=--nthreads#best-practices)). When configuring resources, the subsequent hierarchy is observed across all components of the Dask job, which encompasses the job-runner pod, scheduler pod, and worker pods: 1. In the absence of specified resources, the [platform resources](https://github.com/flyteorg/flyte/blob/1e3d515550cb338c2edb3919d79c6fa1f0da5a19/charts/flyte-core/values.yaml#L520-L531) will be used. 2. When employing task resources, those will be enforced across all segments of the Dask job. You can achieve this using the following code snippet: ```python from flytekit import Resources, task from flytekitplugins.dask import Dask @task( task_config=Dask(), limits=Resources(cpu="1", mem="10Gi") # Applied to all components ) def my_dask_task(): ... ``` 3. When resources are designated for individual components, they hold the highest precedence. ```python from flytekit import Resources, task from flytekitplugins.dask import Dask, Scheduler, WorkerGroup @task( task_config=Dask( scheduler=Scheduler( limits=Resources(cpu="1", mem="2Gi"), # Applied to the job pod ), workers=WorkerGroup( limits=Resources(cpu="4", mem="10Gi"), # Applied to the scheduler and worker pods ), ), ) def my_dask_task(): ... ``` ### Images By default, all components of the deployed `dask` job (job runner pod, scheduler pod and worker pods) will all use the the image that was used while registering (this image should have `dask[distributed]` installed in its Python environment). This helps keeping the Python environments of all cluster components in sync. However, there is the possibility to specify different images for the components. This allows for use cases such as using different images between tasks of the same workflow. While it is possible to use different images for the different components of the `dask` job, it is not advised, as this can quickly lead to Python environments getting our of sync. As the default behavior, all components of the deployed Dask job, including the job runner pod, scheduler pod and worker pods, will employ the image that was utilized during registration. This image must have `dask[distributed]` installed in its Python environment, ensuring consistency across the Python environments of all cluster components. However, there exists the option to specify distinct images for these components. This accommodation caters to scenarios where diverse images are required for tasks within the same workflow. It is important to note that while it is technically possible to implement varying images for different components of the dask job, this approach is not recommended. Doing so can rapidly lead to discrepancies in Python environments. ```python from flytekit import Resources, task from flytekitplugins.dask import Dask, Scheduler, WorkerGroup @task( task_config=Dask( scheduler=Scheduler( image="my_image:0.1.0", # Will be used by the job pod ), workers=WorkerGroup( image="my_image:0.1.0", # Will be used by the scheduler and worker pods ), ), ) def my_dask_task(): ... ``` ### Environment variables Environment variables configured within the `@task` decorator will be propagated to all components of the Dask job, encompassing the job runner pod, scheduler pod and worker pods. ```python from flytekit import Resources, task from flytekitplugins.dask import Dask @task( task_config=Dask(), env={"FOO": "BAR"} # Will be applied to all components ) def my_dask_task(): ... ``` ### Labels and annotations Labels and annotations specified within a {ref}`launch plan ` will be inherited by all components of the dask job, which include the job runner pod, scheduler pod and worker pods. ```python from flytekit import Resources, task, workflow, Labels, Annotations from flytekitplugins.dask import Dask @task(task_config=Dask()) def my_dask_task(): ... @workflow def my_dask_workflow(): my_dask_task() # Labels and annotations will be passed on to all dask cluster components my_launch_plan = my_dask_workflow.create_launch_plan( labels=Labels({"myexecutionlabel": "bar", ...}), annotations=Annotations({"region": "SEA", ...}), ) ``` ### Interruptible tasks The Dask backend plugin offers support for execution on interruptible nodes. When `interruptible==True`, the plugin will incorporate the specified tolerations and node selectors into all worker pods. It's important to be aware that neither the job runner nor the scheduler will be deployed on interruptible nodes. ```python from flytekit import Resources, task, workflow, Labels, Annotations from flytekitplugins.dask import Dask @task( task_config=Dask(), interruptible=True, ) def my_dask_task(): ... ``` ## Run the example on the Flyte cluster To run the provided example on the Flyte cluster, use the following command: ```shell $ pyflyte run --remote dask_example.py \ hello_dask --size 1000 ``` ## Subpages - **Native backend plugins > Dask > Dask Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/k8s-dask-plugin/dask-example === --- **Source**: integrations/native-backend-plugins/k8s-dask-plugin/dask-example.md **URL**: /docs/v1/flyte/integrations/native-backend-plugins/k8s-dask-plugin/dask-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/kfmpi-plugin === # MPI In this section, you'll find a demonstration of running Horovod code with the Kubeflow MPI API. ## Horovod [Horovod](http://horovod.ai/) stands as a distributed deep learning training framework compatible with TensorFlow, Keras, PyTorch and Apache MXNet. Its primary objective is to enhance the speed and usability of distributed deep learning through the implementation of ring-allreduce. This technique necessitates just a few minimal modifications to the user's code, thereby simplifying the process of distributed deep learning. ## MPI (Message Passing Interface) The Flyte platform employs the [Kubeflow training operator](https://github.com/kubeflow/training-operator), to facilitate streamlined execution of all-reduce-style distributed training on Kubernetes. This integration offers a straightforward interface for conducting distributed training through the utilization of MPI. The combined power of MPI and Horovod can be harnessed to streamline the complexities of distributed training. The MPI API serves as a convenient encapsulation to execute Horovod scripts, thereby enhancing the overall efficiency of the process. ## Install the plugin Install the MPI plugin by running the following command: ```shell $ pip install flytekitplugins-kfmpi ``` ## Build a Docker image The Dockerfile should include installation commands for various components, including MPI and Horovod. ```dockerfile FROM ubuntu:focal LABEL org.opencontainers.image.source https://github.com/flyteorg/flytesnacks WORKDIR /root ENV VENV /opt/venv ENV LANG C.UTF-8 ENV LC_ALL C.UTF-8 ENV PYTHONPATH /root ENV DEBIAN_FRONTEND=noninteractive # Install Python3.10 and other libraries RUN apt-get update \ && apt-get install -y software-properties-common \ && add-apt-repository ppa:ubuntu-toolchain-r/test \ && add-apt-repository -y ppa:deadsnakes/ppa \ && apt-get install -y \ build-essential \ cmake \ g++-7 \ curl \ git \ wget \ python3.10 \ python3.10-venv \ python3.10-dev \ make \ libssl-dev \ python3-pip \ python3-wheel \ libuv1 # Virtual environment ENV VENV /opt/venv RUN python3.10 -m venv ${VENV} ENV PATH="${VENV}/bin:$PATH" ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python # Install wheel after venv is activated RUN pip3 install wheel # Install Open MPI RUN wget --progress=dot:mega -O /tmp/openmpi-4.1.4-bin.tar.gz https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz && \ cd /tmp && tar -zxf /tmp/openmpi-4.1.4-bin.tar.gz && \ mkdir openmpi-4.1.4/build && cd openmpi-4.1.4/build && ../configure --prefix=/usr/local && \ make -j all && make install && ldconfig && \ mpirun --version # Allow OpenSSH to talk to containers without asking for confirmation RUN mkdir -p /var/run/sshd RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \ echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \ mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Install Python dependencies COPY requirements.in /root RUN pip install -r /root/requirements.in # Install TensorFlow # In case you encounter the "The TensorFlow library was compiled to use AVX instructions, which are not present on your machine" error, # you can resolve it by installing TensorFlow using the following RUN instruction: # RUN wget https://tf.novaal.de/westmere/tensorflow-2.8.0-cp310-cp310-linux_x86_64.whl && pip install tensorflow-2.8.0-cp310-cp310-linux_x86_64.whl # Otherwise: RUN pip install tensorflow==2.8.0 # Enable GPU # ENV HOROVOD_GPU_OPERATIONS NCCL RUN HOROVOD_WITH_MPI=1 pip install --no-cache-dir horovod==0.28.1 # Copy the actual code COPY . /root/ # This tag is supplied by the build script and will be used to determine the version # when registering tasks, workflows, and launch plans ARG tag ENV FLYTE_INTERNAL_IMAGE $tag ``` ## Run the example on the Flyte cluster To run the provided example on the Flyte cluster, use the following command: ```shell $ pyflyte run --remote \ --image ghcr.io/flyteorg/flytecookbook:kfmpi_plugin-latest \ https://raw.githubusercontent.com/flyteorg/flytesnacks/master/examples/kfmpi_plugin/kfmpi_plugin/mpi_mnist.py \ horovod_training_wf ``` ## MPI Plugin Troubleshooting Guide This section covers common issues encountered during the setup of the MPI operator for distributed training jobs on Flyte. **Worker Pods Failing to Start (Insufficient Resources)** MPI worker pods may fail to start or exhibit scheduling issues, leading to job timeouts or failures. This often occurs due to resource constraints (CPU, memory, or GPU) in the cluster. 1. Adjust Resource Requests: Ensure that each worker pod has sufficient resources. You can adjust the resource requests in your task definition: ```python requests=Resources(cpu="", mem="") ``` Modify the CPU and memory values according to your cluster's available resources. This helps prevent pod scheduling failures caused by resource constraints. 2. Check Pod Logs for Errors: If the worker pods still fail to start, check the logs for any related errors: ```shell $ kubectl logs -n ``` Look for resource allocation or worker communication errors. **Workflow Registration Method Errors (Timeouts or Deadlocks)** If your MPI workflow hangs or times out, it may be caused by an incorrect workflow registration method. 1. Verify Registration Method: When using a custom image, refer to the Flyte documentation on **Development cycle > Running your code** to ensure you're following the correct registration method. ## Subpages - **Native backend plugins > MPI > Mpi Mnist** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/kfmpi-plugin/mpi-mnist === --- **Source**: integrations/native-backend-plugins/kfmpi-plugin/mpi-mnist.md **URL**: /docs/v1/flyte/integrations/native-backend-plugins/kfmpi-plugin/mpi-mnist/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/kfpytorch-plugin === # PyTorch Distributed The Kubeflow PyTorch plugin leverages the [Kubeflow training operator](https://github.com/kubeflow/training-operator) to offer a highly streamlined interface for conducting distributed training using different PyTorch backends. ## Install the plugin To use the PyTorch plugin, run the following command: ```shell $ pip install flytekitplugins-kfpytorch ``` To enable the plugin in the backend, follow instructions outlined in the {ref}`deployment-plugin-setup-k8s` guide. ## Run the example on the Flyte cluster To run the provided examples, use the following commands: Distributed pytorch training: ```shell $ pyflyte run --remote pytorch_mnist.py pytorch_training_wf ``` Pytorch lightning training: ```shell $ pyflyte run --remote pytorch_lightning_mnist_autoencoder.py train_workflow ``` ## Subpages - **Native backend plugins > PyTorch Distributed > Pytorch Mnist** - **Native backend plugins > PyTorch Distributed > Pytorch Lightning Mnist Autoencoder** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/kfpytorch-plugin/pytorch-mnist === --- **Source**: integrations/native-backend-plugins/kfpytorch-plugin/pytorch-mnist.md **URL**: /docs/v1/flyte/integrations/native-backend-plugins/kfpytorch-plugin/pytorch-mnist/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/kfpytorch-plugin/pytorch-lightning-mnist-autoencoder === --- **Source**: integrations/native-backend-plugins/kfpytorch-plugin/pytorch-lightning-mnist-autoencoder.md **URL**: /docs/v1/flyte/integrations/native-backend-plugins/kfpytorch-plugin/pytorch-lightning-mnist-autoencoder/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/ray-plugin === # Ray [KubeRay](https://github.com/ray-project/kuberay) is an open-source toolkit designed to facilitate the execution of Ray applications on Kubernetes. It offers a range of tools that enhance the operational aspects of running and overseeing Ray on Kubernetes. Key components include: - Ray Operator - Backend services for cluster resource creation and deletion - Kubectl plugin/CLI for CRD object management - Seamless integration of Jobs and Serving functionality with Clusters ## Install the plugin To install the Ray plugin, run the following command: ```shell $ pip install flytekitplugins-ray ``` To enable the plugin in the backend, refer to the instructions provided in the **Plugins > Kubernetes Plugins** guide. ## Implementation details ### Submit a Ray job to existing cluster ```python import ray from flytekit import task from flytekitplugins.ray import RayJobConfig @ray.remote def f(x): return x * x @task( task_config=RayJobConfig( address= runtime_env={"pip": ["numpy", "pandas"]} ) ) def ray_task() -> typing.List[int]: futures = [f.remote(i) for i in range(5)] return ray.get(futures) ``` ### Create a Ray cluster managed by Flyte and run a Ray Job on the cluster ```python import ray from flytekit import task from flytekitplugins.ray import RayJobConfig, WorkerNodeConfig, HeadNodeConfig @task(task_config=RayJobConfig(worker_node_config=[WorkerNodeConfig(group_name="test-group", replicas=10)])) def ray_task() -> typing.List[int]: futures = [f.remote(i) for i in range(5)] return ray.get(futures) ``` ## Run the example on the Flyte cluster To run the provided example on the Flyte cluster, use the following command: ```shell $ pyflyte run --remote ray_example.py \ ray_workflow --n 10 ``` ## Subpages - **Native backend plugins > Ray > Ray Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/ray-plugin/ray-example === --- **Source**: integrations/native-backend-plugins/ray-plugin/ray-example.md **URL**: /docs/v1/flyte/integrations/native-backend-plugins/ray-plugin/ray-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/k8s-spark-plugin === # Spark Flyte has the capability to directly execute Spark jobs on a Kubernetes Cluster. The cluster handles the lifecycle, initiation and termination of virtual clusters. It harnesses the open-source [Spark on Kubernetes operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) and can be enabled without requiring any service subscription. This functionality is akin to operating a transient spark cluster โ€” a cluster type established specifically for a Spark job and taken down upon completion. While these clusters are optimal for production workloads, they do come with the additional cost of setup and teardown. In the Flyte environment, this cost is spread out over time due to the swiftness of creating pods compared to a full machine. However, keep in mind that the performance might be impacted by the need to download Docker images, and starting a pod is not as immediate as running a process. With Flytekit, you can compose PySpark code natively as a task. The Spark cluster will be automatically configured using the specified Spark configuration. The examples provided in this section offer a hands-on tutorial for writing PySpark tasks. > [!NOTE] > This plugin has been rigorously tested at scale, successfully managing more than 100,000 Spark Jobs through Flyte at Lyft. > However, please bear in mind that this functionality requires a significant Kubernetes capacity and meticulous configuration. For optimal results, we highly recommend adopting the **Platform configuration > Optimizing Performance > Multi-Cluster mode**. Additionally, consider enabling resource quotas for Spark Jobs that are both large in scale and executed frequently. Nonetheless, it is important to note that extremely short-duration jobs might not be the best fit for this setup. In such cases, utilizing a pre-spawned cluster could be more advantageous. A job can be considered "short" if its runtime is less than 2 to 3 minutes. In these situations, the cost of initializing pods might outweigh the actual execution cost. ## Why use Kubernetes Spark? Managing Python dependencies can be challenging, but Flyte simplifies the process by enabling easy versioning and management of dependencies through containers. The Kubernetes Spark plugin extends the benefits of containerization to Spark without requiring the management of specialized Spark clusters. Pros: 1. Simple to get started, providing complete isolation between workloads. 2. Each job runs in isolation with its own virtual cluster, eliminating the complexities of dependency management. 3. Flyte takes care of all the management tasks. Cons: 1. Short-running, bursty jobs may not be the best fit due to container overhead. 2. Interactive Spark capabilities are not available with Flyte Kubernetes Dask; instead, it is better suited for running adhoc and scheduled jobs. ## Implementation details ### Step 1: Deploy Spark plugin in the Flyte backend Flyte Spark employs the Spark on K8s operator in conjunction with a bespoke [Flyte Spark Plugin](https://pkg.go.dev/github.com/flyteorg/flyteplugins@v0.5.25/go/tasks/plugins/k8s/spark). This plugin serves as a backend component and necessitates activation within your deployment. To enable it, follow the instructions outlined in **Plugins > Kubernetes Plugins**. > [!NOTE] > Refer to [this guide](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/gcp.md) to use GCP instead of AWS. ### Step 2: Environment Setup Install `flytekitplugins-spark` using `pip` in your environment. ```shell $ pip install flytekitplugins-spark ``` > [!NOTE] > To enable Flyte to build the Docker image for you using `ImageSpec`, install `flytekitplugins-envd`. Ensure that your Kubernetes cluster has sufficient resources available. Depending on the resource requirements of your Spark job across the driver and executors, you may need to adjust the resource quotas for the namespace accordingly. ### Step 3: Optionally, set up visibility Whenever a Spark job is executed, you have the opportunity to access a Spark application UI link for real-time job monitoring. Additionally, for past executions, you can leverage the Spark history server to access the stored history of Spark executions. Furthermore, Flyte offers the capability to generate direct links to both the Spark driver logs and individual Spark executor logs. These Spark-related features, including the Spark history server and Spark UI links, are seamlessly displayed on the Flyte Console. Their availability is contingent upon the following configuration settings: #### Configure the Spark history link within the UI To access the Spark history UI link within the Flyte Console, it's necessary to configure a variable in the Spark section of the Flyteplugins configuration. Here's an example of how to set it up: ```yaml plugins: spark: spark-history-server-url: ``` You can explore various configuration options by referring to [this link](https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/plugins/k8s/spark/config.go). #### Configure the Spark application UI To obtain a link for the ongoing Spark drivers and the Spark application UI, you must set up Kubernetes to allow wildcard ingress access using `*.my-domain.net`. Additionally, you should configure the Spark on Kubernetes operator to establish a new ingress route for each application. This can be achieved through the `ingress-url-format` command-line option of the Spark Operator. You can find more details about this option in the source code [here](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/d38c904a4dd84e849408153cdf4d7a30a7be5a07/main.go#L62). #### Configure the Spark driver and executor logs The logs can be configured by adjusting the `logs` configuration within the Spark plugin settings. The Spark plugin utilizes the same default log configuration outlined in the section on **Platform configuration > Configuring logging links in the UI**. The SparkPlugin offers the capability to segregate user (Spark user code) and system (Spark core logs) logs, thus enhancing visibility into Spark operations. This is, however, feasible only if you can route the spark user logs separately from the core logs. It's important to note that Flyte does not perform automatic log separation. You can review the configuration structure [here](https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/plugins/k8s/spark/config.go#L31-L36). - _Mixed_: Provides unseparated logs from the Spark driver (combining both user and system logs), following the standard structure of all log plugins. You can obtain links to the Kubernetes dashboard or a preferred log aggregator as long as it can generate standardized links. - _User_: Offers logs from the driver with separation (subject to log separation availability). - _System_: Covers logs from executors, typically without individual links for each executor; instead, it provides a prefix where all executor logs are accessible. - _AllUser_: Encompasses all user logs across spark-submit, driver and executor. Log configuration example: ```yaml plugins: spark: logs: user: kubernetes-enabled: true kubernetes-url: mixed: cloudwatch-enabled: true cloudwatch-template-uri: "https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=;prefix=var.log.containers.{{.podName}};streamFilter=typeLogStreamPrefix" system: cloudwatch-enabled: true cloudwatch-template-uri: "https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=;prefix=system_log.var.log.containers.{{.podName}};streamFilter=typeLogStreamPrefix" all-user: cloudwatch-enabled: true cloudwatch-template-uri: "https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=;prefix=var.log.containers.{{.podName}};streamFilter=typeLogStreamPrefix" ``` #### Additional configuration The Spark plugin provides support for a range of extended configuration options. For instance, if you wish to enable specific Spark features as defaults for all Spark applications, you can apply default Spark configurations. For more comprehensive information, please consult the [configuration structure](https://github.com/flyteorg/flyteplugins/blob/c528bb88937b4732c9cb5537ed8ea6943ff4fb56/go/tasks/plugins/k8s/spark/config.go#L24-L29). ## Run the examples on the Flyte cluster To run the provided examples on the Flyte cluster, use any of the following commands: ```shell $ pyflyte run --remote pyspark_pi.py my_spark ``` ```shell $ pyflyte run --remote dataframe_passing.py my_smart_structured_dataset ``` ## Subpages - **Native backend plugins > Spark > Dataframe Passing** - **Native backend plugins > Spark > Pyspark Pi** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/k8s-spark-plugin/dataframe-passing === --- **Source**: integrations/native-backend-plugins/k8s-spark-plugin/dataframe-passing.md **URL**: /docs/v1/flyte/integrations/native-backend-plugins/k8s-spark-plugin/dataframe-passing/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/k8s-spark-plugin/pyspark-pi === --- **Source**: integrations/native-backend-plugins/k8s-spark-plugin/pyspark-pi.md **URL**: /docs/v1/flyte/integrations/native-backend-plugins/k8s-spark-plugin/pyspark-pi/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/kftensorflow-plugin === # TensorFlow Distributed TensorFlow operator is useful to natively run distributed TensorFlow training jobs on Flyte. It leverages the [Kubeflow training operator](https://github.com/kubeflow/training-operator). ## Install the plugin To install the Kubeflow TensorFlow plugin, run the following command: ```shell $ pip install flytekitplugins-kftensorflow ``` To enable the plugin in the backend, follow instructions outlined in the {ref}`deployment-plugin-setup-k8s` guide. ## Run the example on the Flyte cluster To run the provided example on the Flyte cluster, use the following command: ```shell $ pyflyte run --remote tf_mnist.py \ mnist_tensorflow_workflow ``` ## Subpages - **Native backend plugins > TensorFlow Distributed > Tf Mnist** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/native-backend-plugins/kftensorflow-plugin/tf-mnist === --- **Source**: integrations/native-backend-plugins/kftensorflow-plugin/tf-mnist.md **URL**: /docs/v1/flyte/integrations/native-backend-plugins/kftensorflow-plugin/tf-mnist/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins === # External service backend plugins This section covers external service backend plugins. ## Subpages - **External service backend plugins > AWS Athena** - **External service backend plugins > AWS Batch** - **External service backend plugins > FlyteInteractive** - **External service backend plugins > Hive** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins/athena-plugin === # AWS Athena ## Executing Athena Queries Flyte backend can be connected with Athena. Once enabled, it allows you to query AWS Athena service (Presto + ANSI SQL Support) and retrieve typed schema (optionally). This plugin is purely a spec and since SQL is completely portable, it has no need to build a container. Thus this plugin example does not have any Dockerfile. ### Installation To use the flytekit Athena plugin, simply run the following: ```shell $ pip install flytekitplugins-athena ``` Now let's dive into the code. ## Subpages - **External service backend plugins > AWS Athena > Athena** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins/athena-plugin/athena === --- **Source**: integrations/external-service-backend-plugins/athena-plugin/athena.md **URL**: /docs/v1/flyte/integrations/external-service-backend-plugins/athena-plugin/athena/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins/aws-batch-plugin === # AWS Batch ## Executing Batch Job Flyte backend can be connected with batch. Once enabled, it allows you to run regular task on AWS batch. This section provides a guide on how to use the AWS Batch Plugin using flytekit python. ### Installation To use the flytekit batch plugin simply run the following: ] ### Configuring the backend to get AWS Batch working **Plugins > AWS Batch** ### Quick Start This plugin allows you to run batch tasks on AWS and only requires you to change a few lines of code. We can then move workflow execution from Kubernetes to AWS. ```python from flytekitplugins.awsbatch import AWSBatchConfig config = AWSBatch( parameters={"codec": "mp4"}, platformCapabilities="EC2", propagateTags=True, retryStrategy={"attempts": 10}, tags={"hello": "world"}, timeout={"attemptDurationSeconds": 60}, ) @task(task_config=config) def t1(a: int) -> str: return str(a) ``` ## Subpages - **External service backend plugins > AWS Batch > Batch** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins/aws-batch-plugin/batch === --- **Source**: integrations/external-service-backend-plugins/aws-batch-plugin/batch.md **URL**: /docs/v1/flyte/integrations/external-service-backend-plugins/aws-batch-plugin/batch/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins/flyteinteractive-plugin === # FlyteInteractive FlyteInteractive provides interactive task development in a remote environment. This allows developers to leverage remote environment capabilities while accessing features like debugging, code inspection, and Jupyter Notebook, traditionally available in local IDEs. Flyte tasks, designed as one-off jobs, require users to wait until completion to view results. These tasks are developed locally in a virtual environment before being deployed remotely. However, differences in data access, GPU availability, and dependencies between local and remote environments often lead to discrepancies, making local success an unreliable indicator of remote success. This results in frequent, tedious debugging cycles. ## Installation To use the Flyte interactive plugin, run the following command: ```shell $ pip install flytekitplugins-flyteinteractive ``` ## Acknowledgement This feature was created at LinkedIn and later donated to Flyte. ## Subpages - **External service backend plugins > FlyteInteractive > Jupyter** - **External service backend plugins > FlyteInteractive > Vscode** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins/flyteinteractive-plugin/jupyter === --- **Source**: integrations/external-service-backend-plugins/flyteinteractive-plugin/jupyter.md **URL**: /docs/v1/flyte/integrations/external-service-backend-plugins/flyteinteractive-plugin/jupyter/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins/flyteinteractive-plugin/vscode === --- **Source**: integrations/external-service-backend-plugins/flyteinteractive-plugin/vscode.md **URL**: /docs/v1/flyte/integrations/external-service-backend-plugins/flyteinteractive-plugin/vscode/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins/hive-plugin === # Hive Flyte backend can be connected with various hive services. Once enabled it can allow you to query a hive service (e.g. Qubole) and retrieve typed schema (optionally). This section will provide how to use the Hive Query Plugin using flytekit python ## Installation To use the flytekit hive plugin simply run the following: ```shell $ pip install flytekitplugins-hive ``` ## No Need of a dockerfile This plugin is purely a spec. Since SQL is completely portable there is no need to build a Docker container. ## Subpages - **External service backend plugins > Hive > Hive** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/external-service-backend-plugins/hive-plugin/hive === --- **Source**: integrations/external-service-backend-plugins/hive-plugin/hive.md **URL**: /docs/v1/flyte/integrations/external-service-backend-plugins/hive-plugin/hive/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flyte-operators === # Flyte operators This section covers Flyte operators. ## Subpages - **Flyte operators > Airflow Provider** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flyte-operators/airflow-plugin === # Airflow Provider The `airflow-provider-flyte` package provides an operator, a sensor, and a hook that integrates Flyte into Apache Airflow. `FlyteOperator` is helpful to trigger a task/workflow in Flyte and `FlyteSensor` enables monitoring a Flyte execution status for completion. The primary use case of this provider is to **scale Airflow for machine learning tasks using Flyte**. With the Flyte Airflow provider, you can construct your ETL pipelines in Airflow and machine learning pipelines in Flyte and use the provider to trigger machine learning or Flyte pipelines from within Airflow. ## Installation ```shell $ pip install airflow-provider-flyte ``` All the configuration options for the provider are available in the provider repo's [README](https://github.com/flyteorg/airflow-provider-flyte#readme). ## Subpages - **Flyte operators > Airflow Provider > Airflow** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/flyte-operators/airflow-plugin/airflow === --- **Source**: integrations/flyte-operators/airflow-plugin/airflow.md **URL**: /docs/v1/flyte/integrations/flyte-operators/airflow-plugin/airflow/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/deprecated-integrations === # Deprecated integrations This section covers deprecated integrations. ## Subpages - **Deprecated integrations > BigQuery plugin** - **Deprecated integrations > Databricks plugin** - **Deprecated integrations > Kubernetes Pods** - **Deprecated integrations > Snowflake plugin** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/deprecated-integrations/bigquery-plugin === # BigQuery plugin > [!WARNING] > This example code uses the legacy implementation of the BigQuery integration. We recommend using the [BigQuery connector](../../connectors/bigquery-connector/_index) instead. This directory contains example code for the deprecated BigQuery plugin. ## Subpages - **Deprecated integrations > BigQuery plugin > Bigquery Plugin Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/deprecated-integrations/bigquery-plugin/bigquery-plugin-example === --- **Source**: integrations/deprecated-integrations/bigquery-plugin/bigquery-plugin-example.md **URL**: /docs/v1/flyte/integrations/deprecated-integrations/bigquery-plugin/bigquery-plugin-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/deprecated-integrations/databricks-plugin === # Databricks plugin > [!WARNING] > This example code uses a legacy implementation of the Databricks integration. We recommend using the [Databricks connector](../../connectors/databricks-connector/_index) instead. This directory contains example code for the deprecated Databricks plugin. ## Subpages - **Deprecated integrations > Databricks plugin > Databricks Plugin Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/deprecated-integrations/databricks-plugin/databricks-plugin-example === --- **Source**: integrations/deprecated-integrations/databricks-plugin/databricks-plugin-example.md **URL**: /docs/v1/flyte/integrations/deprecated-integrations/databricks-plugin/databricks-plugin-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/deprecated-integrations/k8s-pod-plugin === # Kubernetes Pods > [!IMPORTANT] > This plugin is no longer needed and is here only for backwards compatibility. No new versions will be published after v1.13.x Please use the `pod_template` and `pod_template_name` arguments to `@task`. Flyte tasks, represented by the `@task` decorator, are essentially single functions that run in one container. However, there may be situations where you need to run a job with more than one container or require additional capabilities, such as: - Running a hyper-parameter optimizer that stores state in a Redis database - Simulating a service locally - Running a sidecar container for logging and monitoring purposes - Running a pod with additional capabilities, such as mounting volumes To support these use cases, Flyte provides a Pod configuration that allows you to customize the pod specification used to run the task. This simplifies the process of implementing the Kubernetes pod abstraction for running multiple containers. > [!NOTE] > A Kubernetes pod will not exit if it contains any sidecar containers (containers that do not exit automatically). > You do not need to write any additional code to handle this, as Flyte automatically manages pod tasks. ## Installation To use the Flytekit pod plugin, run the following command: ```shell $ pip install flytekitplugins-pod ``` ## Subpages - **Deprecated integrations > Kubernetes Pods > Pod** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/deprecated-integrations/k8s-pod-plugin/pod === --- **Source**: integrations/deprecated-integrations/k8s-pod-plugin/pod.md **URL**: /docs/v1/flyte/integrations/deprecated-integrations/k8s-pod-plugin/pod/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/integrations/deprecated-integrations/snowflake-plugin === # Snowflake plugin > [!WARNING] > This example code uses a legacy implementation of the Snowflake integration. We recommend using the [Snowflake connector](../../connectors/snowflake-connector/_index) instead. This directory contains example code for the deprecated Snowflake plugin. ## Subpages - **Deprecated integrations > Snowflake plugin > Snowflake Plugin Example** === PAGE: https://www.union.ai/docs/v1/flyte/integrations/deprecated-integrations/snowflake-plugin/snowflake-plugin-example === --- **Source**: integrations/deprecated-integrations/snowflake-plugin/snowflake-plugin-example.md **URL**: /docs/v1/flyte/integrations/deprecated-integrations/snowflake-plugin/snowflake-plugin-example/ **Weight**: 1 === PAGE: https://www.union.ai/docs/v1/flyte/api-reference === # Reference This section provides the reference material for all Flyte APIs, SDKs and CLIs. To get started, add `flytekit` to your project ```shell $ uv add flytekit ``` This will install the Flytekit SDKs and the `pyflyte` CLI. ### ๐Ÿ”— **Flytekit SDK** The Flytekit SDK provides the core Python API for building Flyte workflows. ### ๐Ÿ”— **Pyflyte CLI** The Pyflyte CLI is the command-line interface for interacting with your Flyte instance. ### ๐Ÿ”— **Flytectl CLI** The Flytectl CLI is an alternative CLI for performing administrative tasks and for use in CI/CD environments. ### ๐Ÿ”— **Flyteidl** Flyteidl is the specification for the Flyte language in protobuf. ## Subpages - **LLM context documents** - **Pyflyte CLI** - **Flytectl CLI** - **Flyteidl** - **FlyteKit Plugins** - **Flytekit SDK** === PAGE: https://www.union.ai/docs/v1/flyte/api-reference/flyte-context === # LLM context documents The following documents provide a LLM context for authoring and running Flyte/Union workflows. They can serve as a reference for LLM-based AI assistants to understand how to properly write, configure, and execute Flyte/Union workflows. * **Full documentation content**: The entire documentation (this site) for Flyte version 1.0 in a single text file. * ๐Ÿ“ฅ [llms-full.txt](/_static/public/llms-full.txt) * **Concise context document**: A concise overview of Flyte 1.0 concepts. * ๐Ÿ“ฅ [llms-concise.txt](/_static/public/llms-concise.txt) You can then add either or both to the context window of your LLM-based AI assistant to help it better understand Flyte/Union development. === PAGE: https://www.union.ai/docs/v1/flyte/api-reference/pyflyte-cli === # Pyflyte CLI The `pyflyte` CLI is the main tool developers use to interact with Flyte on the command line. ## Installation The recommended way to install the union CLI outside a workflow project is to use [`uv`](https://docs.astral.sh/uv/): ```shell $ uv tool install flytekit ``` This will install the `pyflyte` CLI globally on your system [as a `uv` tool](https://docs.astral.sh/uv/concepts/tools/). ## Configure the `pyflyte` CLI These command will create the file `~/.flyte/config.yaml` with the configuration information to connect to the Flyte instance. See **Getting started > Local setup** for more details. ## Overriding the configuration file location By default, the `pyflyte` CLI will look for a configuration file at `~/.flyte/config.yaml`. You can override this behavior to specify a different configuration file by setting the `FLYTECTL_CONFIG ` environment variable: ```shell export FLYTECTL_CONFIG=~/.my-config-location/my-config.yaml ``` Alternatively, you can always specify the configuration file on the command line when invoking `pyflyte` by using the `--config` flag: ```shell $ pyflyte --config ~/.my-config-location/my-config.yaml run my_script.py my_workflow ``` ## `pyflyte` CLI configuration search path The `pyflyte` CLI will check for configuration files as follows: First, if a `--config` option is used, it will use the specified config file. Second, the config file pointed to by the `FLYTECTL_CONFIG` environment variable. Third, the following hard-coded locations (in this order): Third, the hard-coded location `~/.flyte/config.yaml`. If none of these are present, the CLI will raise an error. ## `pyflyte` CLI commands Entrypoint for all the user commands. `pyflyte [OPTIONS] COMMAND [ARGS]...` **Options** **`-v,ย --verbose`** Show verbose messages and exception traces **`-k,ย --pkgsย `** Dot-delineated python packages to operate on. Multiple may be specified (can use commas, or specify the switch multiple times. Please note that this option will override the option specified in the configuration file, or environment variable **`-c,ย --configย `** Path to config file for use within container ### backfill The backfill command generates and registers a new workflow based on the input launchplan to run an automated backfill. The workflow can be managed using the Flyte UI and can be canceled, relaunched, and recovered. > launchplanย refers to the name of the Launchplanlaunchplan_versionย is optional and should be a valid version for a Launchplan version. ``` pyflyte backfill [OPTIONS] LAUNCHPLAN [LAUNCHPLAN_VERSION] ``` **Options** **`-p,ย --projectย `** Project for workflow/launchplan. Can also be set through envvarย `FLYTE_DEFAULT_PROJECT` **Default:** `'flytesnacks'` **`-d,ย --domainย `** Domain for workflow/launchplan, can also be set through envvarย `FLYTE_DEFAULT_DOMAIN` **Default:** `'development'` **`-v,ย --versionย `** Version for the registered workflow. If not specified it is auto-derived using the start and end date **`-n,ย --execution-nameย `** Create a named execution for the backfill. This can prevent launching multiple executions. **`--dry-run`** Just generate the workflow - do not register or execute **Default:** `False` **`--parallel,ย --serial`** All backfill steps can be run in parallel (limited by max-parallelism), if usingย `--parallel.`ย Else all steps will be run sequentially [`--serial`]. **Default:** `False` **`--execute,ย --do-not-execute`** Generate the workflow and register, do not execute **Default:** `True` **`--from-dateย `** Date from which the backfill should begin. Start date is inclusive. **`--to-dateย `** Date to which the backfill should run_until. End date is inclusive **`--backfill-windowย `** Timedelta for number of days, minutes hours after the from-date or before the to-date to compute the backfills between. This is needed with from-date / to-date. Optional if both from-date and to-date are provided **`--fail-fast,ย --no-fail-fast`** If set to true, the backfill will fail immediately (WorkflowFailurePolicy.FAIL_IMMEDIATELY) if any of the backfill steps fail. If set to false, the backfill will continue to run even if some of the backfill steps fail (WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE). **Default:** `True` **`--overwrite-cache`** Whether to overwrite the cache if it already exists. **Default:** `False` **Arguments** **LAUNCHPLAN** Required argument **LAUNCHPLAN_VERSION** Optional argument ### build This command can build an image for a workflow or a task from the command line, for fully self-contained scripts. ``` pyflyte build [OPTIONS] COMMAND [ARGS]... ``` **Options** **`-p,ย --projectย `** Project to register and run this workflow in. Can also be set through envvarย `FLYTE_DEFAULT_PROJECT` **Default:** `'flytesnacks'` **`-d,ย --domainย `** Domain to register and run this workflow in, can also be set through envvarย `FLYTE_DEFAULT_DOMAIN` **Default:** `'development'` **`--destination-dirย `** Directory inside the image where the tar file containing the code will be copied to **Default:** `'.'` **`--copy-all`** [Deprecated, see โ€“copy] Copy all files in the source root directory to the destination directory. You can specify โ€“copy all instead **Default:** `False` **`--copyย `** Specifies how to detect which files to copy into image. โ€˜allโ€™ will behave as the deprecated copy-all flag, โ€˜autoโ€™ copies only loaded Python modules **Default:** `'auto'`**Options:**all | auto **`-i,ย --imageย `** Multiple values allowed.Image used to register and run. **Default:** `'cr.flyte.org/flyteorg/flytekit:py3.9-latest'` **`--service-accountย `** Service account used when executing this workflow **`--wait,ย --wait-execution`** Whether to wait for the execution to finish **Default:** `False` **`-i,ย --poll-intervalย `** Poll interval in seconds to check the status of the execution **`--dump-snippet`** Whether to dump a code snippet instructing how to load the workflow execution using flyteremote **Default:** `False` **`--overwrite-cache`** Whether to overwrite the cache if it already exists **Default:** `False` **`--envvars,ย --envย `** Multiple values allowed.Environment variables to set in the container, of the formatย ENV_NAME=ENV_VALUE **`--tags,ย --tagย `** Multiple values allowed.Tags to set for the execution **`--nameย `** Name to assign to this execution **`--labels,ย --labelย `** Multiple values allowed.Labels to be attached to the execution of the formatย label_key=label_value. **`--annotations,ย --annotationย `** Multiple values allowed.Annotations to be attached to the execution of the formatย key=value. **`--raw-output-data-prefix,ย --raw-data-prefixย `** File Path prefix to store raw output data. Examples areย file://, s3://, gs:// etc as supported by fsspec. If not specified, raw data will be stored in default configured location in remote of locally to temp file system.Note, this is not metadata, but only the raw data location used to store Flytefile, Flytedirectory, Structuredataset, dataframes **`--max-parallelismย `** Number of nodes of a workflow that can be executed in parallel. If not specified, project/domain defaults are used. If 0 then it is unlimited. **`--disable-notifications`** Should notifications be disabled for this execution. **Default:** `False` **`-r,ย --remote`** Whether to register and run the workflow on a Flyte deployment **Default:** `False` **`--limitย `** Use this to limit number of entities to fetch **Default:** `50` **`--cluster-poolย `** Assign newly created execution to a given cluster pool **`--execution-cluster-label,ย --eclย `** Assign newly created execution to a given execution cluster label **`--fast`** Use fast serialization. The image wonโ€™t contain the source code. The value is false by default. **Default:** `False` #### conf.py Build an image for [workflow|task] from conf.py ``` pyflyte build conf.py [OPTIONS] COMMAND [ARGS]... ``` ### execution The execution command allows you to interact with Flyteโ€™s execution system, such as recovering/relaunching a failed execution. ``` pyflyte execution [OPTIONS] COMMAND [ARGS]... ``` **Options** **`-p,ย --projectย `** Project for workflow/launchplan. Can also be set through envvarย `FLYTE_DEFAULT_PROJECT` **Default:** `'flytesnacks'` **`-d,ย --domainย `** Domain for workflow/launchplan, can also be set through envvarย `FLYTE_DEFAULT_DOMAIN` **Default:** `'development'` **`--execution-idย `** **Required**ย The execution id #### recover Recover a failed execution ``` pyflyte execution recover [OPTIONS] ``` #### relaunch Relaunch a failed execution ``` pyflyte execution relaunch [OPTIONS] ``` ### fetch Retrieve Inputs/Outputs for a Flyte Execution or any of the inner node executions from the remote server. The URI can be retrieved from the Flyte Console, or by invoking the get_data API. ``` pyflyte fetch [OPTIONS] FLYTE-DATA-URI (format flyte://...) DOWNLOAD-TO Local path (optional) ``` **Options** **`-r,ย --recursive`** Fetch recursively, all variables in the URI. This is not needed for directories as they are automatically recursively downloaded. **Arguments** **FLYTE-DATA-URIย (formatย flyte://...)** Required argument **DOWNLOAD-TOย Localย pathย (optional)** Optional argument ### get Get a single or multiple remote objects. ``` pyflyte get [OPTIONS] COMMAND [ARGS]... ``` #### launchplan Interact with launchplans. ``` pyflyte get launchplan [OPTIONS] LAUNCHPLAN-NAME LAUNCHPLAN-VERSION ``` **Options** **`--active-only,ย --scheduled`** Only return active launchplans. **`-p,ย --projectย `** Project for workflow/launchplan. Can also be set through envvarย `FLYTE_DEFAULT_PROJECT` **Default:** `'flytesnacks'` **`-d,ย --domainย `** Domain for workflow/launchplan, can also be set through envvarย `FLYTE_DEFAULT_DOMAIN` **Default:** `'development'` **`-l,ย --limitย `** Limit the number of launchplans returned. **Arguments** **LAUNCHPLAN-NAME** Optional argument **LAUNCHPLAN-VERSION** Optional argument ### info Print out information about the current Flyte Python CLI environment - like the version of Flytekit, the version of Flyte Backend Version, backend endpoint currently configured, etc. ``` pyflyte info [OPTIONS] ``` ### init Create flyte-ready projects. ``` pyflyte init [OPTIONS] PROJECT_NAME ``` **Options** **`--templateย