Skip to content

feat: Add Hugging Face Datasets plugin#629

Open
andreahlert wants to merge 6 commits intoflyteorg:mainfrom
andreahlert:feat/huggingface-plugin
Open

feat: Add Hugging Face Datasets plugin#629
andreahlert wants to merge 6 commits intoflyteorg:mainfrom
andreahlert:feat/huggingface-plugin

Conversation

@andreahlert
Copy link
Contributor

@andreahlert andreahlert commented Feb 7, 2026

Summary

Port of the flytekit-huggingface plugin to flyte-sdk v2, enabling native support for datasets.Dataset as a Flyte DataFrame type.

  • DataFrameEncoder/Decoder for datasets.Dataset with Parquet serialization
  • Cloud storage support (S3, GCS, Azure) via fsspec-compatible storage options
  • Anonymous S3 fallback for public datasets (mirrors Polars plugin pattern)
  • Column filtering on both encode and decode via type annotations
  • Auto-registration via flyte.plugins.types entry point

Follows the same patterns as the existing Polars plugin.

Usage Example

import flyte
import datasets

env = flyte.TaskEnvironment("hf-example")

@env.task
async def create_dataset() -> datasets.Dataset:
    return datasets.Dataset.from_dict({
        "text": ["hello", "world"],
        "label": [0, 1],
    })

@env.task
async def process(ds: datasets.Dataset) -> int:
    return len(ds)

Test plan

  • Type recognition tests (Dataset, with columns, with format annotations)
  • Handler registration and property tests
  • Encode/decode roundtrip tests
  • DataFrame wrapper integration test
  • Raw task I/O test via flyte.TaskEnvironment
  • Column subsetting on decode
  • Various data types (int, float, str, bool)
  • Empty dataset roundtrip
  • Storage options tests (S3, GCS, Azure, anonymous, unknown protocol)

@andreahlert andreahlert force-pushed the feat/huggingface-plugin branch from be91055 to 17d010a Compare February 7, 2026 19:37
@andreahlert
Copy link
Contributor Author

@pingsutw @cosmicBboy @kumare3 could you take a look? This ports the huggingface plugin from flytekit to v2, same approach as the polars plugin.

Add a new plugin that provides native support for HuggingFace
datasets.Dataset as a Flyte DataFrame type, enabling seamless
serialization/deserialization through Parquet format.

Features:
- DataFrameEncoder/Decoder for datasets.Dataset <-> Parquet
- Cloud storage support (S3, GCS, Azure) via fsspec storage options
- Anonymous S3 fallback for public datasets
- Column filtering on both encode and decode
- Auto-registration via flyte.plugins.types entry point

Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
@andreahlert andreahlert force-pushed the feat/huggingface-plugin branch from b5f2178 to da8a0a2 Compare February 8, 2026 07:02
… infra

- Use storage.get_configured_fsspec_kwargs() instead of get_storage() (fix review)
- Add [tool.uv.sources] flyte editable for dev (match Anthropic/OpenAI)
- Conftest: use LocalDB._get_db_path and reset _conn (match Polars after main)
- Tests: patch flyte.storage._storage.get_storage; run.outputs()[0]; skip empty dataset to avoid CI flakiness

Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
…, DataFrame

Signed-off-by: André Ahlert <andre@aex.partners>
@andreahlert andreahlert force-pushed the feat/huggingface-plugin branch from c5c1f84 to c73f62f Compare March 6, 2026 09:53
@andreahlert
Copy link
Contributor Author

Hi Ketan! With 2.0 out I’ve rebased and addressed several of your comments: using the public storage API instead of get_storage, and public flyte.io imports where possible.

@andreahlert andreahlert requested a review from kumare3 March 6, 2026 09:57
datasets = lazy_module("datasets")


def get_hf_storage_options(protocol: typing.Optional[str], anonymous: bool = False) -> typing.Dict[str, typing.Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this function needed?

df = df.remove_columns(to_remove)

filesystem = storage.get_underlying_filesystem(path=path)
storage_options = get_hf_storage_options(protocol=filesystem.protocol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont see any hf storage options, i think you can drop this entirely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants