Niels Bantilan

Pandera 0.16: Going Beyond Pandas Data Validation

Summary

I’m super excited to announce the availability of Pandera 0.16! This release features a suite of improvements and bug fixes.

I’m super excited to announce the availability of Pandera 0.16! This release features a suite of improvements and bug fixes. The biggest advance: Pandera now supports Pyspark SQL DataFrames!

🐼 Beyond Pandas DataFrames

Before I get into this release’s highlights, I want to tell you how we got here: It’s been quite a journey.

🐣 Origins

I wrote the first commit to pandera at the end of 2018:

Copied to clipboard!
commit 387ccadd7afb2bdc18e2ae89203cac7ddae76c9b
Author: Niels Bantilan <niels.bantilan@gmail.com>
Date:   Wed Oct 31 22:18:34 2018 -0400

    Initial commit

At that time I was an ML engineer at a previous company, and I was working with Pandas DataFrames every day cleaning, exploring, and modeling data. In my spare time, I created Pandera to try to answer the question:

Can one create type annotations for dataframes that validate data at runtime?

The short answer? Yes.

Pandera started off as a lightweight, expressive and flexible data validation toolkit for Pandas DataFrames. It’s a no-nonsense Python library that doesn’t create pretty reports or ship with an interactive UI; it’s built for data practitioners by a data practitioner to help them build, debug and maintain data and machine learning pipelines:

Copied to clipboard!
import pandas as pd
import pandera as pa

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=pa.Check.str_startswith("value_")),
})

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

validated_df = schema(df)
print(validated_df)

With Pandera, you can focus on code-native schemas that you can import and run anywhere, and you don’t have to contend with yaml/json config files or set up a special project structure to get your data validators going.

I first shared Pandera to the broader community at Scipy 2020, where I gave a talk and wrote a paper about its core design and functionality. Since then, the project has accrued more than 2.5K stars and 12M downloads, as of the writing of this post.

🐓 Evolution

Today, Pandera is still lightweight, expressive and flexible at its core, but it now provides richer functionality and serves a wider set of DataFrame libraries and tools in the Python ecosystem:

  • Expose a class-based API via `DataFrameModel`
  • Expose a Pandera-native type engine API to manage physical and logical data types
  • Validate other DataFrame types: Dask, Modin, Pyspark Pandas (formerly Koalas), Geopandas
  • Parallelize validation with Fugue
  • Synthesize data with schema strategies and Hypothesis
  • Reuse Pydantic models in your Pandera schemas
  • Serialize Pandera schemas as yaml or json
Copied to clipboard!
class Schema(pa.DataFrameModel):

    column1: int = pa.Field(le=10)
    column2: float = pa.Field(lt=-1.2)
    column3: str = pa.Field(str_startswith="value_")

Schema.validate(df)

`DataFrameModel` allows for a class-based schema-definition syntax that’s more familiar to Python folks who use `dataclass` and Pydantic `BaseModel`s.

Supporting the DataFrame types mentioned above was a fairly light lift, since many of those libraries somewhat follow the Pandas API. However, as time went on, it became clear to me and the community that a rewrite of Pandera’s internals was needed to decouple the schema specification itself from the validation logic, which up until this point relied completely on the Pandas API.

🦩 Revolution

After about half a year of grueling work, Pandera finally supports a “Bring Your Own Backend” extension model. As a Pandera contributor, you can:

  1. Define a schema specification for some DataFrame object, or any arbitrary Python object.
  2. Register a backend for that schema, which implements library-specific validation logic.
  3. Validate away!

At a high level, this is what the code would look like for implementing a schema specification and backend for a fictional DataFrame library called `sloth` 🦥.

Copied to clipboard!
import sloth as sl

from pandera.api.base.schema import BaseSchema
from pandera.backends.base import BaseSchemaBackend

class DataFrameSchema(BaseSchema):
    def __init__(self, **kwargs):
        # add properties that this dataframe would contain

class DataFrameSchemaBackend(BaseSchemaBackend):
    def validate(self, check_obj: sl.DataFrame, schema: DataFrameSchema, **kwargs):
        # implement custom validation logic
        ...

# register the backend
DataFrameSchema.register_backend(sloth.DataFrame, DataFrameSchemaBackend)

What better way to come full circle than to present this major development at Scipy 2023? If you’re curious to learn more about the development and organizational challenges that came up during this rewrite process, you can take a look at the accompanying paper, which is currently available in preview mode.

To prove out the extensibility of Pandera with the new schema specification and backend API, we collaborated with the QuantumBlack team at McKinsey to implement a schema and backend for Pyspark SQL … and we completed an MVP in a matter of a few months! So without further ado, let’s dive into the highlights of this release.

🎉 Highlights

⭐ Validation Pyspark SQL DataFrames

You can now write `DataFrameSchema`s and `DataFrameModel`s that will validate `pyspark.sql.DataFrame` objects:

Copied to clipboard!
import pandera.pyspark as pa
import pyspark.sql.types as T

from pandera.pyspark import DataFrameModel


class PanderaSchema(DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    product_name: T.StringType() = pa.Field(str_startswith="B")
    price: T.DecimalType(20, 5) = pa.Field()
    description: T.ArrayType(T.StringType()) = pa.Field()

Then you can validate data as usual:

Copied to clipboard!
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark_schema = T.StructType(
    [
        T.StructField("id", T.IntegerType(), False),
        T.StructField("product", T.StringType(), False),
        T.StructField("price", T.DecimalType(20, 5), False),
        T.StructField("description", T.ArrayType(T.StringType(), False), False),
    ],
)

df = spark.createDataFrame(
    [
        (5, "Bread", Decimal(44.4), ["description of product"]),
       (15, "Butter", Decimal(99.0), ["more details here"]),
    ],
    spark_schema
)

df.show()

"""
+---+-------+--------+--------------------+
| id|product|   price|         description|
+---+-------+--------+--------------------+
|  5|  Bread|44.40000|[description of p...|
| 15| Butter|99.00000| [more details here]|
+---+-------+--------+--------------------+
"""

This unlocks the power and flexibility of Pandera to Pyspark users who want to validate their DataFrames in production!

🎛️ Control Validation Granularity

Control the validation depth of your schemas at a global level with the `PANDERA_VALIDATION_DEPTH` environment variable. The three acceptable values are:

Copied to clipboard!
# The default value: perform both schema metadata checks and data value checks
export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA

# Only perform schema metadata checks, e.g. data types, column presence, etc.
export PANDERA_VALIDATION_DEPTH=SCHEMA_ONLY

# Only perform data value checks, e.g. greater_than > 0, custom checks, etc.
export PANDERA_VALIDATION_DEPTH=DATA_ONLY

Note: this feature is currently only supported in the pyspark.sql Pandera API

💡 Switch Validation On and Off

In some cases you may want to disable all Pandera validation calls — for example, in certain production applications that require saving on compute resources. All you need to do is define the `PANDERA_VALIDATION_ENABLED` environment variable before running the application.

Copied to clipboard!
# By default validation is enabled
export PANDERA_VALIDATION_ENABLED=True

# Disable pandera validation
export PANDERA_VALIDATION_ENABLED=False

Note: this feature is currently only supported in the pyspark.sql Pandera API

ℹ️ Add Metadata to Fields

You can now add arbitrary metadata at the dataframe- or field-level components of your schema, which gives you the ability to embed additional information about your schema. This is useful, for example, if you need to write custom logic to select subsets of your schema for different DataFrames that have overlapping or common fields:

Copied to clipboard!
class Schema(pa.DataFrameModel):

    column1: int = pa.Field(le=10, metadata={"use-case": "forecasting"})
    column2: float = pa.Field(lt=-1.2, metadata={"use-case": "forecasting"})
    column3: str = pa.Field(str_startswith="value_")

    class Config:
        metadata = {"category": "product-details"}

You can get the metadata easily with the `get_metadata` method:

Copied to clipboard!
import json

metadata = Schema.get_metadata()
json.dumps(metadata, indent=4)

"""
{
    "Schema": {
        "columns": {
            "column1": {
                "use-case": "forecasting"
            },
            "column2": {
                "use-case": "forecasting"
            },
            "column3": null
        },
        "dataframe": {
            "category": "product-details"
        }
    }
}
"""

Note: If you have any ideas for how to extend this functionality to make it more useful, please feel free to open up an issue.

🏛️ Add Missing Columns

When loading raw data into a form that’s ready for data processing, it’s often useful to have guarantees that the columns specified in the schema are present, even if they’re missing from the raw data. This is where it’s useful to specify `add_missing_columns=True` in your schema definition.

When you call schema.validate(data), the schema will add any missing columns to the dataframe, using the default value if supplied at the column level, or to NaN if the column is nullable.

Copied to clipboard!
class Schema(pa.DataFrameModel):

    column1: int = pa.Field(le=10)
    column2: float = pa.Field(lt=-1.2, default=-2.0)
    column3: str = pa.Field(str_startswith="value_", nullable=True)

    class Config:
        add_missing_columns = True


df = pd.DataFrame({"column1": [1, 2, 3]})
print(Schema.validate(df))

"""
​​   column1  column2 column3
0        1     -2.0     NaN
1        2     -2.0     NaN
2        3     -2.0     NaN
"""

🚮 Drop Invalid Rows

If you wish to use the validation step to remove invalid data, you can pass the `drop_invalid_rows=True` argument to the schema definition. On `schema.validate(..., lazy=True)`, if a data-level check fails, then the row that caused the failure will be removed from the dataframe when it is returned.

Copied to clipboard!
class Schema(pa.DataFrameModel):

    column1: int = pa.Field(le=10)
    column2: float = pa.Field(lt=-1.2)
    column3: str = pa.Field(str_startswith="value_")

    class Config:
        drop_invalid_rows = True


# data to validate
df = pd.DataFrame.from_records([
    {"column1": 1, "column2": -1.3, "column3": "value_0"},
    {"column1": 100, "column2": -1.3, "column3": "value_0"}, # drop this
    {"column1": 1, "column2": 10, "column3": "value_0"}, # and this
    {"column1": 1, "column2": 10, "column3": "foobar"}, # and also this
])

print(Schema.validate(df, lazy=True))

"""
   column1  column2  column3
0        1     -1.3  value_0
"""

Note: Make sure to specify `lazy=True` in `schema.validate` to enable this feature.

📓 Full Changelog

This release shipped with many more improvements, bug fixes, and docs updates. To learn more about all of the enhancements and bug fixes in this release, check out the changelog here.

🛣️ What’s Next?

Hopefully this release gets you excited about Pandera! We’ve now opened the door to support the validation of any DataFrame library, and in fact, any Python object that you want (although at that point you should probably just use Pydantic 🙃).

What DataFrame library are we going for next? One word (of the ursine variety): Polars 🐻‍❄. There’s already a mini-roadmap that I’ll be converting into a proper roadmap over the next few weeks, but if you’re interested in contributing to this effort, head over to the issue and throw your hat into the contributor ring by posting a comment and saying hi!

Finally, if you’re new to Pandera, you can give it a try directly in your browser, courtesy of JupyterLite.

<a href="https://pandera.readthedocs.io/en/latest/try_pandera.html" class="button w-button" target="_blank">▶️ Try Pandera</a>

Data Quality