feat: Add Hugging Face Datasets plugin#629
Open
andreahlert wants to merge 6 commits intoflyteorg:mainfrom
Open
feat: Add Hugging Face Datasets plugin#629andreahlert wants to merge 6 commits intoflyteorg:mainfrom
andreahlert wants to merge 6 commits intoflyteorg:mainfrom
Conversation
be91055 to
17d010a
Compare
Contributor
Author
|
@pingsutw @cosmicBboy @kumare3 could you take a look? This ports the huggingface plugin from flytekit to v2, same approach as the polars plugin. |
kumare3
reviewed
Feb 8, 2026
kumare3
reviewed
Feb 8, 2026
plugins/huggingface/src/flyteplugins/huggingface/df_transformer.py
Outdated
Show resolved
Hide resolved
kumare3
reviewed
Feb 8, 2026
Add a new plugin that provides native support for HuggingFace datasets.Dataset as a Flyte DataFrame type, enabling seamless serialization/deserialization through Parquet format. Features: - DataFrameEncoder/Decoder for datasets.Dataset <-> Parquet - Cloud storage support (S3, GCS, Azure) via fsspec storage options - Anonymous S3 fallback for public datasets - Column filtering on both encode and decode - Auto-registration via flyte.plugins.types entry point Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
b5f2178 to
da8a0a2
Compare
kumare3
reviewed
Mar 3, 2026
plugins/huggingface/src/flyteplugins/huggingface/df_transformer.py
Outdated
Show resolved
Hide resolved
kumare3
reviewed
Mar 3, 2026
plugins/huggingface/src/flyteplugins/huggingface/df_transformer.py
Outdated
Show resolved
Hide resolved
… infra - Use storage.get_configured_fsspec_kwargs() instead of get_storage() (fix review) - Add [tool.uv.sources] flyte editable for dev (match Anthropic/OpenAI) - Conftest: use LocalDB._get_db_path and reset _conn (match Polars after main) - Tests: patch flyte.storage._storage.get_storage; run.outputs()[0]; skip empty dataset to avoid CI flakiness Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
…, DataFrame Signed-off-by: André Ahlert <andre@aex.partners>
c5c1f84 to
c73f62f
Compare
Contributor
Author
|
Hi Ketan! With 2.0 out I’ve rebased and addressed several of your comments: using the public storage API instead of get_storage, and public flyte.io imports where possible. |
kumare3
reviewed
Mar 6, 2026
| datasets = lazy_module("datasets") | ||
|
|
||
|
|
||
| def get_hf_storage_options(protocol: typing.Optional[str], anonymous: bool = False) -> typing.Dict[str, typing.Any]: |
Contributor
There was a problem hiding this comment.
why is this function needed?
kumare3
reviewed
Mar 6, 2026
| df = df.remove_columns(to_remove) | ||
|
|
||
| filesystem = storage.get_underlying_filesystem(path=path) | ||
| storage_options = get_hf_storage_options(protocol=filesystem.protocol) |
Contributor
There was a problem hiding this comment.
i dont see any hf storage options, i think you can drop this entirely
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port of the flytekit-huggingface plugin to flyte-sdk v2, enabling native support for
datasets.Datasetas a Flyte DataFrame type.datasets.Datasetwith Parquet serializationflyte.plugins.typesentry pointFollows the same patterns as the existing Polars plugin.
Usage Example
Test plan
flyte.TaskEnvironment