ShardListDataset cache miss rate with wids

Hello,

I am currently implementing a pipeline with DDP and wids. My dataloaders look like the following:

```python

chunk_size = math.ceil(dataset_length / int(os.environ["WORLD_SIZE"]))
dataset = (
            wids.ShardListDataset(
                wids_map["shardlist"],
                cache_dir=cache_dir,
                keep=True
            )
        ).add_transform(preprocess)
loader = torch.utils.data.DataLoader(
            dataset,
            num_workers=num_workers,
            batch_size=batch_size,
            collate_fn=identify_fn,
            pin_memory=True,
            sampler=wids.DistributedChunkedSampler(dset, chunksize=chunk_size, shuffle=True) if "train" else None,
        )
```

While everything seems to be working correctly, I am seeing messages around cache miss rate similar to `Warning: ShardListDataset has a cache miss rate of 9901.0%%`. I haven't found any information on this and was wondering what these signify as it relates to ShardListDataset given the data is already cached locally on disk and the cache_dir simply points there? So I'm not sure how it would miss but still train through the iter and epoch without any performance impact (or so it seems)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ShardListDataset cache miss rate with wids #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ShardListDataset cache miss rate with wids #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions