Skip to content

[BUG] python datasets.py --download-all这里的py文件和datasets库的名字冲突了 #19

@CauchyJenson

Description

@CauchyJenson

🐛 Bug 描述

在终端执行python datasets.py --download-all时候无法下载全部数据集

📋 复现步骤

# 4. 下载实验数据(可选,部分实验不需要)
cd modules/common
python datasets.py --download-all
cd ../..

这一步执行后的报错

1️⃣ TinyShakespeare
✅ 从缓存加载 TinyShakespeare: G:\DL\minimind-notes\modules\common\data\tinyshakespeare.txt
   大小: 1.06 MB

2️⃣ TinyStories (10MB subset)
❌ 缺少 datasets 库,请安装: pip install datasets
   ⚠️ 跳过 TinyStories: cannot import name 'load_dataset' from 'datasets' (G:\DL\minimind-notes\modules\common\datasets.py)

3️⃣ TinyStories (50MB subset)
❌ 缺少 datasets 库,请安装: pip install datasets
   ⚠️ 跳过 TinyStories 50MB: cannot import name 'load_dataset' from 'datasets' (G:\DL\minimind-notes\modules\common\datasets.py)

这个文件名与datasets库冲突了导致后两个数据集下载不下来

🔧 环境信息

  • 操作系统: Win11
  • Python 版本: 3.10.16
  • PyTorch 版本: 2.5.1+cu121

📝 相关代码

如果相关,请提供导致问题的代码片段:

def _get_tinystories(size_mb: float, cache: bool = True) -> List[str]:
    """
    获取 TinyStories 子集

    注意:需要安装 datasets 库
        pip install datasets
    """

    cache_file = DATA_DIR / f'tinystories_{size_mb}mb.json'

    # 检查缓存
    if cache and cache_file.exists():
        print(f"✅ 从缓存加载 TinyStories: {cache_file}")
        with open(cache_file, 'r', encoding='utf-8') as f:
            return json.load(f)

    # 下载
    try:
        from datasets import load_dataset

        print(f"📥 下载 TinyStories (目标大小: {size_mb} MB)")

        # 加载数据集
        dataset = load_dataset('roneneldan/TinyStories', split='train', streaming=True)

        # 逐步加载直到达到目标大小
        texts = []
        current_size = 0
        target_size = size_mb * 1024 * 1024  # 转换为字节

        for example in dataset:
            text = example['text']
            texts.append(text)
            current_size += len(text.encode('utf-8'))

            if current_size >= target_size:
                break

        # 保存缓存
        if cache:
            with open(cache_file, 'w', encoding='utf-8') as f:
                json.dump(texts, f, ensure_ascii=False)
            print(f"✅ 已缓存到: {cache_file}")

        print(f"✅ TinyStories 加载完成: {len(texts):,} 个故事, {current_size / 1024 / 1024:.2f} MB")
        return texts

    except ImportError:
        print("❌ 缺少 datasets 库,请安装: pip install datasets")
        raise
    except Exception as e:
        print(f"❌ 加载失败: {e}")
        raise

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions