Flyte 2 is available today for local execution - distributed execution coming to open source soon. Preview Flyte 2 for production, hosted on Union.ai
2.0.9

flyteplugins.databricks

Databricks connector plugin for Flyte.

This plugin provides integration between Flyte tasks and Databricks, enabling you to run PySpark jobs on Databricks clusters as Flyte tasks with full observability, retries, and caching.

Key features:

  • Run PySpark tasks natively on Databricks clusters
  • Configurable cluster spec via the Databricks Jobs API
  • Automatic job lifecycle management: create, poll, cancel
  • Automatic links to the Databricks job run UI in the Flyte UI

Basic usage example:

import flyte
from flyteplugins.databricks import Databricks

databricks_config = Databricks(
    spark_conf={"spark.executor.memory": "4g"},
    databricks_conf={
        "run_name": "my_job",
        "new_cluster": {
            "spark_version": "13.3.x-scala2.12",
            "node_type_id": "i3.xlarge",
            "num_workers": 2,
        },
    },
    databricks_instance="myorg.cloud.databricks.com",
    databricks_token="databricks_token_secret",
)

env = flyte.TaskEnvironment(
    name="databricks_env",
    plugin_config=databricks_config,
    image=flyte.Image.from_debian_base(name="pyspark").with_pip_packages(
        "flyteplugins-databricks"
    ),
)

@env.task
def process_data(input_path: str) -> int:
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.getOrCreate()
    df = spark.read.parquet(input_path)
    return df.count()

Directory

Classes

Class Description
Databricks Configuration for a Databricks task.
DatabricksConnector