feat: Add Pandera data validation plugin#631
feat: Add Pandera data validation plugin#631andreahlert wants to merge 1 commit intoflyteorg:mainfrom
Conversation
| .as_raw_html() | ||
| ) | ||
|
|
||
| def to_html( |
There was a problem hiding this comment.
There is no implicit to_html call in 2.0. you will have to invoke it cc @wild-endeavor ?
also cc @cosmicBboy
What is the right ux for this.
There was a problem hiding this comment.
Yeah, this was one of the parts I wasn't sure about. I ported the renderer from v1 but I knew the Decks integration doesn't exist the same way in v2, so I left it open for discussion.
Looking at the codebase, I see v2 has the Renderable protocol and TypeEngine.to_html() checks for it. I think the cleanest approach would be to override to_html() directly in PanderaTransformer - since the transformer already has access to the schema via _get_pandera_schema():
def to_html(self, python_val, expected_python_type):
schema, config = self._get_pandera_schema(expected_python_type)
renderer = PandasReportRenderer(title=f"Pandera Report: {schema.name}")
return renderer.to_html(python_val, schema)This way it works automatically through the Report system without users needing to do anything extra.
Another option would be making PandasReportRenderer implement the Renderable protocol so users could opt-in via Annotated[DataFrame[Schema], PandasReportRenderer()], but that feels like unnecessary friction for the common case.
Open to suggestions on the right direction here.
|
What is the usage example for this, i dont think this works especially for the html report |
Port the Pandera plugin from flytekit v1 to the Flyte v2 SDK, enabling automatic runtime validation of pandas DataFrames against Pandera schemas as data flows between tasks. The plugin registers pandera.typing.DataFrame as a custom type with the TypeEngine, wrapping the DataFrameTransformerEngine to add schema validation on both serialization and deserialization. Features: - Automatic validation via pandera.typing.DataFrame type annotations - Configurable error handling (raise or warn) via ValidationConfig - HTML validation reports using great_tables for Flyte Decks - Validation memo to skip redundant re-validation in local execution Signed-off-by: André Ahlert <andre@aex.partners>
5c32e96 to
70add7b
Compare
Summary
Port the Pandera plugin from flytekit v1 to the Flyte v2 SDK, enabling automatic runtime validation of pandas DataFrames against Pandera schemas as data flows between tasks.
This brings Pandera support to the v2 SDK, following the same plugin architecture used by polars, wandb, and other existing plugins.
What's included
PanderaTransformer: ATypeTransformerthat wraps theDataFrameTransformerEngineto add Pandera schema validation on bothto_literal(output) andto_python_value(input)ValidationConfig: Configurable error handling (raiseorwarn) viatyping.AnnotatedPandasReportRenderer: HTML report generation usinggreat_tablesfor validation results (compatible with Flyte Decks)to_literalis immediately followed byto_python_valueflyte.plugins.typesentry point, loaded automatically byflyte.init()Usage
Test plan