Skip to content

feat: Add Pandera data validation plugin#631

Open
andreahlert wants to merge 1 commit intoflyteorg:mainfrom
andreahlert:feat/pandera-plugin
Open

feat: Add Pandera data validation plugin#631
andreahlert wants to merge 1 commit intoflyteorg:mainfrom
andreahlert:feat/pandera-plugin

Conversation

@andreahlert
Copy link
Contributor

Summary

Port the Pandera plugin from flytekit v1 to the Flyte v2 SDK, enabling automatic runtime validation of pandas DataFrames against Pandera schemas as data flows between tasks.

This brings Pandera support to the v2 SDK, following the same plugin architecture used by polars, wandb, and other existing plugins.

What's included

  • PanderaTransformer: A TypeTransformer that wraps the DataFrameTransformerEngine to add Pandera schema validation on both to_literal (output) and to_python_value (input)
  • ValidationConfig: Configurable error handling (raise or warn) via typing.Annotated
  • PandasReportRenderer: HTML report generation using great_tables for validation results (compatible with Flyte Decks)
  • Validation memo: Skips redundant re-validation during local execution where to_literal is immediately followed by to_python_value
  • Entry point registration: Registered via flyte.plugins.types entry point, loaded automatically by flyte.init()

Usage

import flyte
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series

env = flyte.TaskEnvironment(name="my-env")

class UserSchema(pa.DataFrameModel):
    name: Series[str]
    age: Series[int] = pa.Field(ge=0, le=120)
    email: Series[str]

@env.task
async def generate_users() -> DataFrame[UserSchema]:
    return pd.DataFrame({
        "name": ["Alice", "Bob"],
        "age": [25, 30],
        "email": ["alice@example.com", "bob@example.com"],
    })

@env.task
async def process_users(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
    df["age"] = df["age"] + 1
    return df

Test plan

  • Transformer registration with TypeEngine
  • Literal type generation for pandera DataFrame types
  • Schema extraction from DataFrameModel annotations
  • Validation with valid data (to_literal)
  • Validation failure with invalid data (raises SchemaErrors)
  • Warn mode with ValidationConfig(on_error="warn")
  • Roundtrip encode/decode with valid data
  • Validation memo prevents duplicate validation
  • Type assertion checks

.as_raw_html()
)

def to_html(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no implicit to_html call in 2.0. you will have to invoke it cc @wild-endeavor ?
also cc @cosmicBboy

What is the right ux for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was one of the parts I wasn't sure about. I ported the renderer from v1 but I knew the Decks integration doesn't exist the same way in v2, so I left it open for discussion.

Looking at the codebase, I see v2 has the Renderable protocol and TypeEngine.to_html() checks for it. I think the cleanest approach would be to override to_html() directly in PanderaTransformer - since the transformer already has access to the schema via _get_pandera_schema():

def to_html(self, python_val, expected_python_type):
    schema, config = self._get_pandera_schema(expected_python_type)
    renderer = PandasReportRenderer(title=f"Pandera Report: {schema.name}")
    return renderer.to_html(python_val, schema)

This way it works automatically through the Report system without users needing to do anything extra.

Another option would be making PandasReportRenderer implement the Renderable protocol so users could opt-in via Annotated[DataFrame[Schema], PandasReportRenderer()], but that feels like unnecessary friction for the common case.

Open to suggestions on the right direction here.

@kumare3
Copy link
Contributor

kumare3 commented Feb 8, 2026

What is the usage example for this, i dont think this works especially for the html report

Port the Pandera plugin from flytekit v1 to the Flyte v2 SDK,
enabling automatic runtime validation of pandas DataFrames
against Pandera schemas as data flows between tasks.

The plugin registers pandera.typing.DataFrame as a custom type
with the TypeEngine, wrapping the DataFrameTransformerEngine
to add schema validation on both serialization and deserialization.

Features:
- Automatic validation via pandera.typing.DataFrame type annotations
- Configurable error handling (raise or warn) via ValidationConfig
- HTML validation reports using great_tables for Flyte Decks
- Validation memo to skip redundant re-validation in local execution

Signed-off-by: André Ahlert <andre@aex.partners>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants