Niels Bantilan

Pandera 0.18: Global and granular validation controls

Pandera 0.18 introduces two new configuration settings that control how validation happens: a global validation on/off switch that you can set through the `PANDERA_VALIDATION_ENABLED` environment variable, and granular control of schema and data validation that you can set through the `PANDERA_VALIDATION_DEPTH` environment variable. These settings were first introduced in version 0.16.0 but were only available in the pyspark validation engine. Release 0.18 ports these settings to the pandas validation engine.

Before now, a call to `schema.validate(dataframe)` would perform run-time validation on the `dataframe` based on the schema specification:

Copied to clipboard!
import pandera as pa
import pandas as pd

class MyData(pa.DataFrameModel):
    x: int = pa.Field(gt=0)
    y: float = pa.Field(ge=0.0, le=1.0)
    z: str = pa.Fields(isin=[*"abc"])

data = ...
MyData.validate(data)  # 👈 validates at run-time

This runtime validation also applies to functions that are called with the `pandera.check_types` decorator:

Copied to clipboard!
from pandera.typing import DataFrame

def transform(data: DataFrame[MyData]) -> DataFrame[MyData]:  # 👈 validates at run-time
    ...  # perform transformations on the data

Global validation on/off switch

With export `PANDERA_VALIDATION_ENABLED=False`, you can turn off validation altogether with a simple switch, no code changes necessary! You might want to do this in production contexts where you don’t want to incur the additional runtime cost of validating data, which can be substantial with very large datasets. In these cases, you may have development and/or staging pipelines where you set `PANDERA_VALIDATION_ENABLED=True` to perform Pandera validation on realistic-looking data or samples of your real data, while you set `PANDERA_VALIDATION_ENABLED=False` in your production environment to shut off validation there.

While this does somewhat defeat the point of making sure your actual data is valid, one pattern that may make sense for you would be to perform validation on your production data at rest. In other words, run your data ingestion pipeline, which may produce a dataset that you persist in some blob store, and then have a separate workload that runs the validation procedure (perhaps as a scheduled job). This way, you don’t have to hold up the production pipelines, especially when you do find data errors.

Granular control of schema and data validation

An orthogonal approach to streamlined data validation is to only perform checks that can be done on the dataframe’s metadata and skip the checks that have to inspect the actual data values.

Pandera provides a way for you to do this through the `PANDERA_VALIDATION_DEPTH` configuration setting, which differentiates between schema-level validations and data-level validations.

Schema-level validations are checks on metadata:

  • Checking for column presence
  • Verifying column data types
  • Ensuring column ordering 

Data-level validations, as the name suggests, are checks that inspect actual data values, for example:

  • Checking that integer values of a column are positive numbers
  • Making sure that string values are drawn from a set, e.g. `{“Apple”, “Orange”, “Banana”}`
  • Checking that float point values are probabilities between 0.0 and 1.0

If we look at the schema below, we can see that it’s specified to perform all of the validations above:

Copied to clipboard!
class MyData(pa.DataFrameModel):
    x: int = pa.Field(gt=0)
    y: float = pa.Field(ge=0.0, le=1.0)
    z: str = pa.Fields(isin=[*"abc"])

    class Config:
        ordered = True

With the `PANDERA_VALIDATION_DEPTH` environment variable, you can determine what kinds of validations to perform:

Copied to clipboard!
export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA  # schema- & data-level checks (default)
export PANDERA_VALIDATION_DEPTH=SCHEMA_ONLY  # only do schema-level checks
export PANDERA_VALIDATION_DEPTH=DATA_ONLY  # only do data-level checks

Wrapping up

The 0.18.* releases also deliver bug fixes, improved docs, and housekeeping changes; see the full changelogs here and here.

What’s next for Pandera? I’m happy to announce that the pandera-polars integration beta release 0.19.0b0 is now available for early testing! Just `pip install pandera[polars]==0.19.0b` and check out the preview docs here. If you have any feedback, please feel free to join the #pandera-polars channel on our discord.

Thanks for reading, and if you’re new to Pandera, you can try it out quickly ▶️ here.

Data Quality