Building Large-Scale Xarray Datasets for Geospatial Computing with Union.ai and Flyte

As geospatial machine learning engineers, we often face a common challenge: how do we effectively build, manage, and process large-scale geospatial datasets that can reach terabytes or even petabytes in size? Traditional approaches using notebooks or simple scripts quickly break down at scale, and manually babysitting these processes becomes unsustainable.
This post explores how to build scalable mosaic workflows using Union.ai and the Flyte orchestration platform combined with GDAL, Xarray, and Dask. This approach has been tested at multiple organizations, enabling us to build scalable, end-to-end remote sensing pipelines. Teams across data-intensive industries like geospatial, autonomous vehicles, and biotech rely on pipelines like these for mission-critical projects. Let’s learn how.
The Foundation: Standards for Remote Sensing and Machine Learning
Before diving into the technical implementation, let's establish some standards that make large-scale geospatial computing more manageable:
1. Data Models with Xarray
Xarray provides an excellent foundation for modeling mosaic datasets. Think of it as representing one large image with multiple dimensions:
- Dimensions: time, bands, y, x
- Coordinates: timestamps, band names, latitude, longitude
- Data: The actual pixel values
This data model allows us to:
- Represent arbitrarily large arrays
- Chunk data in optimal ways for processing
- Lazy-load specific regions of interest
- Apply operations to the entire dataset
2. Compute with Dask
Dask works exceptionally well with Xarray for distributed computing, particularly for:
- Operations on reasonably sized and properly chunked datasets
- Short-lived tasks like interpolation
- Distributing work across clusters
However, Dask has limitations:
- Very small chunks can create too many tasks
- It's not a workflow orchestration tool
- You still need something to manage the overall process
3. Mosaics with GDAL
For building mosaics (stitching together multiple scenes into one large image), GDAL offers powerful tools:
- VRT (Virtual Raster Transform): Maps and combines multiple files with the same coordinate reference system
- Warped VRT: Handles reprojection between different coordinate reference systems
- GTI (GeoTIFF Tile Index): A newer approach that indexes rasters with their metadata for faster access
Implementing Scalable Workflows with Flyte and Union.ai
Now let's see how to combine these standards with Flyte orchestration to build truly scalable workflows. Flyte is the leading OSS orchestration tool from Union.ai.
Building a Complete Mosaic Workflow
Let's look at a more complete example from the FlyteMosaic repo. The main workflow looks like this:
This workflow consists of three main stages:
- Ingest Scenes: Find and download all necessary scenes covering our area of interest
- Build Scene Features: Process raw scenes into standardized, optimized COGs (Cloud-Optimized GeoTIFFs)
- Build Target Mosaic: Create the final mosaic dataset in Zarr format
The Power and Limits of GTI for Large Mosaics
For the mosaic building step, we can use GDAL's GTI driver to handle large collections of files:
However, GTI doesn't support temporal dimensions natively, so we need to:
- Build one GTI per dataset per time
- Concatenate them along the time dimension using Xarray:
Handling Very Large Datasets with Partitioning
For extremely large datasets that might not fit into memory or cause issues with Dask's task graph, we use a partitioning approach:
Then we can process these partitions in parallel using Flyte's map_task (complete code):
This approach has several key advantages:
- We never materialize the entire dataset in memory, saving compute resources
- We process chunks in a way that leverages GDAL's internal caching which speeds up further iterations
- We can parallelize across partitions using Flyte's map_task without having to deal with other parallel computing frameworks
- The final output is a single, coherent Zarr store/Xarray dataset
Key Lessons and Caveats
Based on experience implementing these workflows, here are some important considerations:
- Map Task Limits: Try to keep below 5,000 tasks, as each Flyte task corresponds to a Kubernetes pod, and there are limits on the object size K8’s database (etcd) supports. For higher limits, explore Union-specific Map over Launchplans
- GDAL Configuration: Proper GDAL configuration makes a huge difference in performance:
- Single-threaded Partition Processing: Use single-threaded Dask scheduler for partition processing to better leverage GDAL's internal caching
- Chunk Size Considerations: Balance chunk sizes to avoid memory issues while minimizing the total number of tasks
- GTI Metadata: Provide complete metadata in your GTI files to avoid GDAL needing to fetch it
Conclusion
By combining the power of Flyte and Union.ai's orchestration capabilities with geospatial tools like GDAL and data models like Xarray, we can build truly scalable workflows for processing massive geospatial datasets. This approach allows us to:
- Process hundreds of terabytes of geospatial data
- Build end-to-end pipelines that are reproducible and scalable
- Separate concerns between data ingestion, processing, and analysis
- Enable data scientists to focus on analysis rather than infrastructure
The code examples in this post are available in the FlyteMosaic repository, where you can find complete implementations of these patterns.
A recently contributed Flyte plugin enables you to easily persist Xarray datasets and dataarrays to Zarr between tasks.
Whether you're building canopy height models, soil carbon predictions, or any other large-scale geospatial application, these patterns can help you move from notebook experiments to production-ready data pipelines.
Next Steps
- Get started with Flyte
- Check out the Flyte repo
- Watch the companion talk
- Join the Flyte Slack community
Start building your own geospatial workflows with Flyte and Union.ai today.