Databricks connector
Flyte can be integrated with the Databricks service, enabling you to submit Spark jobs to the Databricks platform.
Installation
The Databricks connector comes bundled with the Spark plugin. To install the Spark plugin, run the following command:
pip install flytekitplugins-spark
Example usage
For a usage example, see Databricks connector example.
Local testing
To test the Databricks connector copy the following code to a file called databricks_task.py
, modifying as needed.
@fl.task(task_config=Databricks(...))
def hello_spark(partitions: int) -> float:
print("Starting Spark with Partitions: {}".format(partitions))
n = 100000 * partitions
sess = fl.current_context().spark_session
count = (
sess.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
)
pi_val = 4.0 * count / n
print("Pi val is :{}".format(pi_val))
return pi_val
To execute the Spark task on the connector, you must configure the raw-output-data-prefix
with a remote path.
This configuration ensures that Flyte transfers the input data to the blob storage and allows the Spark job running on Databricks to access the input data directly from the designated bucket.
The Spark task will run locally if the raw-output-data-prefix
is not set.
$ pyflyte run --raw-output-data-prefix s3://my-s3-bucket/databricks databricks_task.py hello_spark