Niels Bantilan

Pandera 0.19.0: Polars DataFrame Validation

Summary

The day is finally here! Pandera 0.19.0 ships support for Polars.

The day is finally here! Pandera 0.19.0 ships support for Polars 🎉. I’m especially excited about this integration because, even though Pandera is still a Python project, it can now leverage the performance benefits of Rust.

This feature has been many years in the making, from the work of rewriting Pandera’s internals to decouple its strong dependency on the pandas API, to the Pyspark integration effort that added support of a non-pandas-like dataframe library.

Without further ado, here’s an example of how to validate `polars.DataFrame` and `polars.LazyFrame` objects:

Copied to clipboard!
import pandera.polars as pa
import polars as pl


class Schema(pa.DataFrameModel):
    state: str
    city: str
    price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})


lf = pl.LazyFrame(
    {
        'state': ['FL','FL','CA','CA'],
        'city': ['Orlando', 'Miami', 'San Francisco', 'San Diego'],
        'price': [8, 12, 10, 16, 20, 18],
    }
)
Schema.validate(lf).collect()

You can also the `check_types` decorator for functional validation:

Copied to clipboard!
from pandera.typing.polars import LazyFrame

@pa.check_types
def function(lf: LazyFrame[Schema]) -> LazyFrame[Schema]:
    return lf.filter(pl.col("state").eq("CA"))

function(lf).collect()

And, of course, if you want to use the object-based API, you can define the equivalent `DataFrameSchema`:

Copied to clipboard!
schema = pa.DataFrameSchema({
    "state": pa.Column(str),
    "city": pa.Column(str),
    "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
schema.validate(lf).collect()

Validating LazyFrames vs DataFrames

The main difference between validating `LazyFrame`s vs `DataFrame`s is that Pandera will only validate schema-level properties—e.g. the presence of columns and their data types—when validating `LazyFrame`s.

Copied to clipboard!
class ModelWithChecks(pa.DataFrameModel):
    a: int
    b: str = pa.Field(isin=[*"abc"])
    c: float = pa.Field(ge=0.0, le=1.0)

invalid_lf = pl.LazyFrame({
    "a": pl.Series(["1", "2", "3"], dtype=pl.Utf8),
    "b": ["d", "e", "f"],
    "c": [0.0, 1.1, -0.1],
})
ModelWithChecks.validate(invalid_lf, lazy=True)
"""
Traceback (most recent call last):
...
pandera.errors.SchemaErrors: {
    "SCHEMA": {
        "WRONG_DATATYPE": [
            {
                "schema": "ModelWithChecks",
                "column": "a",
                "check": "dtype('Int64')",
                "error": "expected column 'a' to have type Int64, got String"
            }
        ]
    }
}
"""

On the other hand, Pandera will examine both schema- and data-level properties when validating a `DataFrame`. For example, data-level properties would include any `Check`s that you specify in the schema definition, which require looking at the actual data values:

Copied to clipboard!
class ModelWithChecks(pa.DataFrameModel):
    a: int
    b: str = pa.Field(isin=[*"abc"])
    c: float = pa.Field(ge=0.0, le=1.0)

invalid_lf = pl.DataFrame({
    "a": pl.Series(["1", "2", "3"], dtype=pl.Utf8),
    "b": ["d", "e", "f"],
    "c": [0.0, 1.1, -0.1],
})
ModelWithChecks.validate(invalid_lf, lazy=True)

"""
Traceback (most recent call last):
...
pandera.errors.SchemaErrors: {
    "SCHEMA": {
        "WRONG_DATATYPE": [
            {
                "schema": "ModelWithChecks",
                "column": "a",
                "check": "dtype('Int64')",
                "error": "expected column 'a' to have type Int64, got String"
            }
        ]
    },
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "ModelWithChecks",
                "column": "b",
                "check": "isin(['a', 'b', 'c'])",
                "error": "Column 'b' failed validator number 0: <Check isin: isin(['a', 'b', 'c'])> failure case examples: [{'b': 'd'}, {'b': 'e'}, {'b': 'f'}]"
            },
            ...
        ]
    }
}
"""

This behavior adheres to Pandera’s design philosophy of minimizing the surprise for users of the underlying dataframe library. If I have a `LazyFrame` method chain, I don’t want to break the chain of lazy operations and the optimizations that polars does under the hood:

Copied to clipboard!
df = (
    pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
    .cast({"a": pl.Int64})
    .pipe(schema.validate) # this only validates schema-level properties
    .with_columns(b=pl.lit("a"))
    # do more lazy operations
    .collect()
)
print(df)

If you want to check the actual values of the data, materializing the actual data with a `collect` call needs to be apparent in the code:

Copied to clipboard!
df = (
    pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
    .cast({"a": pl.Int64})
    .collect()             # convert to pl.DataFrame
    .pipe(schema.validate) # this validates schema- and data-level properties
    .lazy()                # convert back to pl.LazyFrame
    .with_columns(b=pl.lit("a"))
    # do more lazy operations
    .collect()
)
print(df)

There is a way of overriding the `LazyFrame` validation behavior by exporting the environment variable `PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA`, which will then cause Pandera to validate both schema- and data-level properties.

You can read more about this integration in the docs, but here’s a list of functionality that this initial integration with polars provides:

For a comprehensive list of all supported and unsupported features, you can check out the new handy supported features table in the documentation.

Wrapping up

This has been a highly-requested feature in the Pandera community for quite some time now, and I’m happy that we’ve been able to deliver initial support of Polars with the help of the community. If you want to get involved in Pandera, you can join the discord community. There’s also a dedicated #pandera-polars channel if you want to discuss ideas relating to the polars integration.

I wanted to give special shoutouts to @AndriiG13 and @FilipAisot for their contributions on the built-in checks and polars datatypes, respectively, and to @evanrasmussen9, @baldwinj30, @obiii, @Filimoa, @philiporlando, @r-bar, @alkment, @jjfantini, and @robertdj for their early feedback and bug reports during the 0.19.0 beta. Check out the full changelog for 0.19.0 here.

What’s next for Pandera? Besides the never-ending quest to fix bugs and improve developer experience, we’ve already set our sights on the next big thing: Ibis support 🦩.

Data Quality