chore(deps): update dependency datasets to v4.7.0 by red-hat-konflux-kflux-prd-rh02[bot] · Pull Request #82 · lightspeed-core/rag-content

red-hat-konflux-kflux-prd-rh02 · 2026-02-07T12:10:26Z

This PR contains the following updates:

Package	Change	Age	Confidence
datasets	`==4.4.1` -> `==4.7.0`

Release Notes

huggingface/datasets (datasets)

`v4.7.0`

Compare Source

Datasets Features

Add Json() type by @lhoestq in #8027
- JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the Json()type is used to store such data that would normally not be supported in Arrow/Parquet
- Use the Json() type in Features() for any dataset, it is supported in any functions that accepts features=like load_dataset(), .map(), .cast(), .from_dict(), .from_list()
- Use on_mixed_types="use_json" to automatically set the Json() type on mixed types in .from_dict(), .from_list() and .map()

Examples:

You can use on_mixed_types="use_json" or specify features= with a [Json] type:

>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
  ...
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]

This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:

>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]]  # missing fields are filled with None

>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]]  # OK

Another example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):

>>> messages = [
...     {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
...     {"role": "assistant", "tool_calls": [
...         {"type": "function", "function": {
...             "name": "control_light",
...             "arguments": {"room": "living room", "state": "on"}
...         }},
...         {"type": "function", "function": {
...             "name": "play_music",
...             "arguments": {"playlist": "electronic"}  # mixed-type here since keys ["playlist"] and ["room", "state"] are different
...         }}]
...     },
...     {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
...     {"role": "tool", "name": "play_music", "content": "The music is now playing."},
...     {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}

What's Changed

Fix typos in iterable_dataset.py by @omkar-334 in #8049
Fix non-deterministic by sorting metadata extensions (#8034) by @Nexround in #8039
Use num_examples instead of len(self) for iterable_dataset's SplitInfo by @HaukurPall in #8041
Fix silent data loss in push_to_hub when num_proc > num_shards by @HaukurPall in #8044
Don't extract bad files by @lhoestq in #8056
fix(iterable_dataset): preserve features when chaining filter() on typed IterableDataset by @s-zx in #8053
fix: handle nested null types in feature alignment for multi-proc map by @ain-soph in #8047
Fix unstable tokenizer fingerprinting (enables map cache reuse) by @KOKOSde in #7982
Limit dataset listing to first 20 entries in readme by @lhoestq in #8057

New Contributors

@omkar-334 made their first contribution in #8049
@Nexround made their first contribution in #8039
@HaukurPall made their first contribution in #8041
@s-zx made their first contribution in #8053
@ain-soph made their first contribution in #8047
@KOKOSde made their first contribution in #7982

Full Changelog: huggingface/datasets@4.6.1...4.7.0

`v4.6.1`

Compare Source

Bug fix

Remove tmp file in push to hub by @lhoestq in #8030

Full Changelog: huggingface/datasets@4.6.0...4.6.1

`v4.6.0`

Compare Source

Dataset Features

Support Image, Video and Audio types in Lance datasets

Infer types from lance blobs by @lhoestq in #7966

>>> from datasets import load_dataset
>>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
>>> ds.features
{'video_blob': Video(),
 'video_path': Value('string'),
 'caption': Value('string'),
 'aesthetic_score': Value('float64'),
 'motion_score': Value('float64'),
 'temporal_consistency_score': Value('float64'),
 'camera_motion': Value('string'),
 'frame': Value('int64'),
 'fps': Value('float64'),
 'seconds': Value('float64'),
 'embedding': List(Value('float32'), length=1024)}

Push to hub now supports Video types

push_to_hub() for videos by @lhoestq in #7971

 >>> from datasets import Dataset, Video
>>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
>>> ds = ds.cast_column("video", Video())
>>> ds.push_to_hub("username/my-video-dataset")

Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in #7976
- this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
- E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
- See more info here: https://huggingface.co/docs/hub/en/xet/deduplication

Add IterableDataset.reshard() by @lhoestq in #7992

Reshard the dataset if possible, i.e. split the current shards further into more shards.
This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
Equality may happen if no shard can be split further.

The resharding mechanism depends on the dataset file format:
- Parquet: shard per row group instead of per file
- Other: not implemented yet (contributions are welcome !)
```
>>> from datasets import load_dataset
>>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
>>> ds
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 4
})
>>> ds.reshard()
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 3600
})
```

What's Changed

Fix load_from_disk progress bar with redirected stdout by @omarfarhoud in #7919
Revert "feat: avoid some copies in torch formatter (#7787)" by @lhoestq in #7961
docs: fix grammar and add type hints in splits.py by @Edge-Explorer in #7960
Fix interleave_datasets with all_exhausted_without_replacement strategy by @prathamk-tw in #7955
Add examples for Lance datasets by @prrao87 in #7950
Support null in json string cols by @lhoestq in #7963
handle blob lance by @lhoestq in #7964
Count examples in lance by @lhoestq in #7969
Use temp files in push_to_hub to save memory by @lhoestq in #7979
Drop python 3.9 by @lhoestq in #7980
Support pandas 3 by @lhoestq in #7981
Remove unused data files optims by @lhoestq in #7985
Remove pre-release workaround in CI for transformers v5 and huggingface_hub v1 by @hanouticelina in #7989
very basic support for more hf urls by @lhoestq in #8003
Bump fsspec upper bound to 2026.2.0 (fixes #7994) by @jayzuccarelli in #7995
Fix: make environment variable naming consistent (issue #7998) by @AnkitAhlawat7742 in #8000
More IterableDataset.from_x methods and docs and polars.Lazyframe support by @lhoestq in #8009
Support empty shard in from_generator by @lhoestq in #8023
Allow import polars in map() by @lhoestq in #8024

New Contributors

@omarfarhoud made their first contribution in #7919
@Edge-Explorer made their first contribution in #7960
@prathamk-tw made their first contribution in #7955
@prrao87 made their first contribution in #7950
@hanouticelina made their first contribution in #7989
@jayzuccarelli made their first contribution in #7995
@AnkitAhlawat7742 made their first contribution in #8000

Full Changelog: huggingface/datasets@4.5.0...4.6.0

`v4.5.0`

Compare Source

Dataset Features

Add lance format support by @eddyxu in #7913
- Support for both Lance dataset (including metadata / manifests) and standalone .lance files
- e.g. with lance-format/fineweb-edu
```
from datasets import load_dataset

ds = load_dataset("lance-format/fineweb-edu", streaming=True)
for example in ds["train"]:
    ...
```

What's Changed

Raise early for invalid revision in load_dataset by @Scott-Simmons in #7929
fix low but large example indexerror by @CloseChoice in #7912
Fix method to retrieve attributes from file object by @lhoestq in #7938
add _OverridableIOWrapper by @lhoestq in #7942
Add _generate_shards by @lhoestq in #7943

New Contributors

@eddyxu made their first contribution in #7913
@Scott-Simmons made their first contribution in #7929

Full Changelog: huggingface/datasets@4.4.2...4.5.0

`v4.4.2`

Compare Source

Bug fixes

Fix embed storage nifti by @CloseChoice in #7853
ArXiv -> HF Papers by @qgallouedec in #7855
fix some broken links by @julien-c in #7859
Nifti visualization support by @CloseChoice in #7874
Replace papaya with niivue by @CloseChoice in #7878
Fix 7846: add_column and add_item erroneously(?) require new_fingerprint parameter by @sajmaru in #7884
fix(fingerprint): treat TMPDIR as strict API and fail (Issue #7877) by @ada-ggf25 in #7891
encode nifti correctly when uploading lazily by @CloseChoice in #7892
fix(nifti): enable lazy loading for Nifti1ImageWrapper by @The-Obstacle-Is-The-Way in #7887

Minor additions

Add type overloads to load_dataset for better static type inference by @Aditya2755 in #7888
Add inspect_ai eval logs support by @lhoestq in #7899
Save input shard lengths by @lhoestq in #7897
Don't save original_shard_lengths by default for backward compat by @lhoestq in #7906

New Contributors

@sajmaru made their first contribution in #7884
@Aditya2755 made their first contribution in #7888
@ada-ggf25 made their first contribution in #7891
@The-Obstacle-Is-The-Way made their first contribution in #7887

Full Changelog: huggingface/datasets@4.4.1...4.4.2

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

To execute skipped test pipelines write comment /ok-to-test.

Documentation

Find out how to configure dependency updates in MintMaker documentation or see all available configuration options in Renovate documentation.

Signed-off-by: red-hat-konflux-kflux-prd-rh02 <190377777+red-hat-konflux-kflux-prd-rh02[bot]@users.noreply.github.com>

red-hat-konflux-kflux-prd-rh02 bot force-pushed the konflux/mintmaker/main/datasets-4.x branch from 4fc4035 to 3065cb0 Compare February 10, 2026 12:04

red-hat-konflux-kflux-prd-rh02 bot force-pushed the konflux/mintmaker/main/datasets-4.x branch from 3065cb0 to c17f397 Compare February 25, 2026 12:05

red-hat-konflux-kflux-prd-rh02 bot changed the title ~~chore(deps): update dependency datasets to v4.5.0~~ chore(deps): update dependency datasets to v4.6.0 Feb 25, 2026

red-hat-konflux-kflux-prd-rh02 bot force-pushed the konflux/mintmaker/main/datasets-4.x branch from c17f397 to c445402 Compare February 28, 2026 04:05

red-hat-konflux-kflux-prd-rh02 bot changed the title ~~chore(deps): update dependency datasets to v4.6.0~~ chore(deps): update dependency datasets to v4.6.1 Feb 28, 2026

chore(deps): update dependency datasets to v4.7.0

d2de819

Signed-off-by: red-hat-konflux-kflux-prd-rh02 <190377777+red-hat-konflux-kflux-prd-rh02[bot]@users.noreply.github.com>

red-hat-konflux-kflux-prd-rh02 bot force-pushed the konflux/mintmaker/main/datasets-4.x branch from c445402 to d2de819 Compare March 9, 2026 20:51

red-hat-konflux-kflux-prd-rh02 bot changed the title ~~chore(deps): update dependency datasets to v4.6.1~~ chore(deps): update dependency datasets to v4.7.0 Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): update dependency datasets to v4.7.0#82

chore(deps): update dependency datasets to v4.7.0#82
red-hat-konflux-kflux-prd-rh02[bot] wants to merge 1 commit intomainfrom
konflux/mintmaker/main/datasets-4.x

red-hat-konflux-kflux-prd-rh02 bot commented Feb 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

red-hat-konflux-kflux-prd-rh02 bot commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

v4.7.0

Datasets Features

What's Changed

New Contributors

v4.6.1

Bug fix

v4.6.0

Dataset Features

What's Changed

New Contributors

v4.5.0

Dataset Features

What's Changed

New Contributors

v4.4.2

Bug fixes

Minor additions

New Contributors

Configuration

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

red-hat-konflux-kflux-prd-rh02 bot commented Feb 7, 2026 •

edited

Loading

`v4.7.0`

`v4.6.1`

`v4.6.0`

`v4.5.0`

`v4.4.2`