Skip to content

Conversation

@Deep-unlearning
Copy link
Contributor

@Deep-unlearning Deep-unlearning commented Jul 14, 2025

Summary

  • Updated Common Voice dataset from v13 to v17 (latest available version)
  • Updated language count from 108 to 124 languages in Common Voice 17
  • Updated Whisper model reference from whisper-large-v2 to whisper-large-v3
  • NEW: Added comprehensive coverage of modern ASR architectures beyond Whisper
  • NEW: Added Moonshine ASR (edge-optimized, 5x faster for short audio)
  • NEW: Added Kyutai STT (real-time streaming capabilities)

Changes Made

Dataset and Model Updates

  • Dataset Version: Common Voice 13 → Common Voice 17
  • Language Support: 108 → 124 languages
  • Model Reference: whisper-large-v2 → whisper-large-v3
  • URLs Updated: All common_voice_13_0common_voice_17_0

New ASR Architecture Coverage

  • Moonshine ASR: Edge computing focus, 5x faster processing for short audio
  • Kyutai STT: Real-time streaming with ultra-low latency (0.5-2.5s)
  • Architecture Comparison: Detailed comparison table with performance metrics
  • Code Examples: Working examples for all three model types
  • Model Selection Guide: When to choose each architecture

Files Modified

  • chapters/en/chapter5/asr_models.mdx - Added modern ASR section, comparison table, code examples
  • chapters/en/chapter5/choosing_dataset.mdx - Added model-specific dataset recommendations
  • chapters/en/chapter5/evaluation.mdx - Updated dataset references
  • chapters/en/chapter5/fine-tuning.mdx - Updated training examples
  • chapters/en/_toctree.yml - Minor formatting fix

Key Features Added

Architecture Comparison Table

Feature Whisper Moonshine Kyutai STT
Processing Fixed 30s chunks Variable-length Streaming
Best Use Case General-purpose ASR Edge/Mobile devices Real-time applications
Speed Baseline 5x faster (short audio) Ultra-low latency
Languages 96+ languages English only English (+French)

Model Selection Guidelines

  • Whisper: Multilingual support, high accuracy, translation capabilities
  • Moonshine: Edge deployment, memory efficiency, fast processing
  • Kyutai STT: Real-time streaming, low latency, robust audio handling

Test Plan

  • Verified Common Voice 17 dataset is available on Hugging Face Hub
  • Confirmed Dhivehi language is supported in Common Voice 17
  • Checked that all URLs and references are valid
  • Ensured code examples maintain compatibility
  • Verified Moonshine and Kyutai models are available on Hugging Face Hub
  • Tested code examples for syntax and API compatibility

- Update Common Voice dataset from v13 to v17 (latest available)
- Update language count from 108 to 124 languages in Common Voice 17
- Update all dataset URLs and references throughout chapter5 files
- Update Whisper model reference from whisper-large-v2 to whisper-large-v3
- Update training examples and code snippets to use latest dataset version
- Maintain educational content structure while using current resources

Files updated:
- chapters/en/chapter5/choosing_dataset.mdx
- chapters/en/chapter5/evaluation.mdx
- chapters/en/chapter5/fine-tuning.mdx
- chapters/en/chapter5/asr_models.mdx
- chapters/en/_toctree.yml (minor formatting fix)
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- Add detailed section on Moonshine ASR: edge-optimized, 5x faster for short audio
- Add detailed section on Kyutai STT: real-time streaming capabilities
- Include architecture comparison table with performance characteristics
- Add code examples for using Moonshine and Kyutai models
- Update model selection table with new ASR alternatives
- Add model-specific dataset recommendations in choosing_dataset.mdx
- Provide guidance on when to choose each model architecture
- Update summary to reflect expanded ASR landscape

This addresses the Whisper-centric nature of Chapter 5 by providing comprehensive
coverage of modern ASR alternatives with different optimization focuses.
Copy link

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Deep-unlearning thanks for the updates! In short, I think it could also be good to mention Parakeet and Voxtral

| small | 244 M | 2.3 | 6 | [](https://huggingface.co/openai/whisper-small.en) | [](https://huggingface.co/openai/whisper-small) |
| medium | 769 M | 4.2 | 2 | [](https://huggingface.co/openai/whisper-medium.en) | [](https://huggingface.co/openai/whisper-medium) |
| large | 1550 M | 7.5 | 1 | x | [](https://huggingface.co/openai/whisper-large-v2) |
| large | 1550 M | 7.5 | 1 | x | [](https://huggingface.co/openai/whisper-large-v3) |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding Turbo as well? https://huggingface.co/openai/whisper-large-v3-turbo

should be faster as it has fewer decoder layers

| Moonshine Tiny | 27 M | 0.5 | 5x faster for short audio | English | [](https://huggingface.co/UsefulSensors/moonshine-tiny) |
| Moonshine Base | 61 M | 1.0 | Edge-optimized | English | [](https://huggingface.co/UsefulSensors/moonshine-base) |
| Kyutai STT 1B | 1000 M | 3.0 | Real-time streaming | English, French | [](https://huggingface.co/kyutai/stt-1b-en_fr) |
| Kyutai STT 2.6B | 2600 M | 6.0 | Low-latency streaming | English | [](https://huggingface.co/kyutai/stt-2.6b-en) |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding others supported by transformers:

NOTE: last two are Audio LLM models that have built in audio understanding

and can also mention ASR leaderboard so people have a convenient comparison

### When to Choose Each Model:

**Choose Whisper when:**
- You need multilingual support (96+ languages)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equivalent but I think most docs I see say 99+?


While Whisper has been a game-changer for speech recognition, the field continues to evolve with new architectures designed to address specific limitations and use cases. Let's explore two notable recent developments: **Moonshine** and **Kyutai STT**, which offer different approaches to improving upon Whisper's capabilities.

### Moonshine: Efficient Edge Computing ASR
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about adding such sections for Parakeet and Voxtral? (as they are both quite popular)

- **Diverse domains** benefit from Whisper's robust pre-training
- **Punctuation and casing** are recommended as Whisper handles them well

### For Moonshine Models
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar as above, such sections for Voxtral and Parakeet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants