Pandera Joins Union.ai
I created Pandera at the beginning of 2019 with the initial kernel of an idea: “What if I could type-check my dataframes?” At the time, I was an ML engineer and recovering data scientist who worked with pandas dataframes all day, every day. I would explore, analyze and train models from tabular data, and I was burned many times by the inconsistencies and idiosyncrasies of real-world, enterprise data.
This kind of data is heterogeneous and messy: dates that would show up as strings; fields with null values that shouldn’t be there; and arrays of numbers that should be positive, but somehow a -1000 would inexplicably pop up now and again. This happened so frequently that I set out to fashion some oven mitts so that I wouldn’t be burned again.
As many projects do, Pandera started off as something that scratched my own itch, and I’ve been maintaining it in my spare time with a group of amazing community of open source contributors. Pandera community: If any of you are reading this, I’d like to express my gratitude for all your contributions, questions, and feedback that’s allowed Pandera to grow beyond its original scope and scale!
But now, I’m happy to say, three years, 60+ contributors, 1,600+ github stars, and 4 million downloads later, Pandera is joining Union.ai as an open-source ecosystem project 🎉.
In this front, I’m also excited to announce that I’ll be taking on the role of Chief ML Engineer at Union.ai, where I’ll be shepherding projects like Pandera and UnionML to help realize Union.ai’s mission.
I joined Union.ai in the beginning of 2021 because I fundamentally believe that intuitive, flexible, and expressive tools manifest and expand the latent capabilities of a technology, broadening its user base beyond the specialists who originally developed it.
As the principal maintainer of Flyte, an open-source data- and machine learning-aware orchestration platform, Union.ai’s mission is to enable builders to unlock value from their data through unified orchestration.
What does this mission mean exactly? To me, it speaks to the untapped potential of combining data with machine learning in order to achieve things like digitizing the world into 3D map, coordinating deep financial analytics at an enterprise scale, and facilitating scientific discoveries in biology with protein folding models.
This mission resonates with me in such a deep way that it only made sense that Pandera fits naturally into Union.ai’s ecosystem of products. This is because, if you agree that machine learning is already catalyzing a paradigm shift in how we build software, then high quality data is the most precious resource to fully realize its potential.
Data quality as a first-class concern
But high-quality data doesn’t just happen. There’s an entropic force that finds its way into the process of collecting, persisting, and transforming data for some specific use case. Practitioners in the data and ML field have observed this very same thing, and in recent years many companies have emerged to tame this entropy. Among them are WhyLabs, SuperConductive, DeepChecks, Evidently.ai, and many others, each one specializing on a particular set of use cases in the data and model lifecycle.
As a part of the ecosystem of data quality tools, Pandera provides an intuitive, flexible, and expressive framework for statistical data testing in Python. Basically, it gives you the tools to answer the question, “Are my data as I expect them to be, and are my data-processing functions behaving correctly?”
Pandera sets itself apart in this ecosystem by focusing on a few core design principles:
- DS- and ML-first: The language for expressing data quality rules should be as close as possible to the tools and conventions that data science and ML practitioners use on a daily basis.
- Incremental adoption: Introducing data quality to a project shouldn’t require buying into an entirely new tech stack in order to get value from it.
- Frictionless customization: Creating custom validation rules should be as simple as defining a function: it shouldn’t require a whole lot of syntax or ceremony.
- Seamless integration: Support integrations with the rich ecosystem of existing Python tools for working with and testing data.
- Zero configuration: Get value from data validation without having to understand and configure a bunch of yaml/toml files.
Over a series of subsequent posts, I’ll be unpacking how these design principles inform Pandera’s design by illustrating common data problems and how to address them by adopting tried-and-true data testing practices.
How will Pandera converge with Union.ai’s mission? In addition to supercharging Pandera’s existing roadmap with more development resources, over the next few months we’ll be further developing the integration between Flyte and Pandera by providing more visibility into the quality of data that passes through Flyte tasks and workflows.
We’re also committed to Pandera as a stand-alone tool that can be used with the broader ecosystem of data tools, and to that end, we’re also planning to:
- Add support for other data containers and formats, like SQL, xarray, pyarrow, etc.
- Integrate with other tools in the data/ML ecosystem like ZenML, WhyLogs, DBT, etc.
- Expand the suite of built-in validation checks that ship with Pandera.
Finally, we’ll be exploring ways in which Union Cloud can offer out-of-the-box data and model observability by integrating with Pandera under the hood. We’re not exactly sure where things will go on this front, but based on some early conversations I’ve had with Pandera users, there are a few problems I’ve identified that would help with data testing beyond validating snapshots of data at a particular time (more to be announced later 😉).