-
Notifications
You must be signed in to change notification settings - Fork 148
Update Chapter 5 ASR content with latest datasets and models #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Update Common Voice dataset from v13 to v17 (latest available) - Update language count from 108 to 124 languages in Common Voice 17 - Update all dataset URLs and references throughout chapter5 files - Update Whisper model reference from whisper-large-v2 to whisper-large-v3 - Update training examples and code snippets to use latest dataset version - Maintain educational content structure while using current resources Files updated: - chapters/en/chapter5/choosing_dataset.mdx - chapters/en/chapter5/evaluation.mdx - chapters/en/chapter5/fine-tuning.mdx - chapters/en/chapter5/asr_models.mdx - chapters/en/_toctree.yml (minor formatting fix)
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
- Add detailed section on Moonshine ASR: edge-optimized, 5x faster for short audio - Add detailed section on Kyutai STT: real-time streaming capabilities - Include architecture comparison table with performance characteristics - Add code examples for using Moonshine and Kyutai models - Update model selection table with new ASR alternatives - Add model-specific dataset recommendations in choosing_dataset.mdx - Provide guidance on when to choose each model architecture - Update summary to reflect expanded ASR landscape This addresses the Whisper-centric nature of Chapter 5 by providing comprehensive coverage of modern ASR alternatives with different optimization focuses.
ebezzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Deep-unlearning thanks for the updates! In short, I think it could also be good to mention Parakeet and Voxtral
| | small | 244 M | 2.3 | 6 | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) | | ||
| | medium | 769 M | 4.2 | 2 | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) | | ||
| | large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v2) | | ||
| | large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v3) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding Turbo as well? https://huggingface.co/openai/whisper-large-v3-turbo
should be faster as it has fewer decoder layers
| | Moonshine Tiny | 27 M | 0.5 | 5x faster for short audio | English | [✓](https://huggingface.co/UsefulSensors/moonshine-tiny) | | ||
| | Moonshine Base | 61 M | 1.0 | Edge-optimized | English | [✓](https://huggingface.co/UsefulSensors/moonshine-base) | | ||
| | Kyutai STT 1B | 1000 M | 3.0 | Real-time streaming | English, French | [✓](https://huggingface.co/kyutai/stt-1b-en_fr) | | ||
| | Kyutai STT 2.6B | 2600 M | 6.0 | Low-latency streaming | English | [✓](https://huggingface.co/kyutai/stt-2.6b-en) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding others supported by transformers:
- Parakeet (giving some LInkedIn reactions, it seems this is quite popular!): https://huggingface.co/nvidia/parakeet-ctc-1.1b
- Voxtral: https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
- Audio Flamingo: https://huggingface.co/nvidia/audio-flamingo-3-hf
NOTE: last two are Audio LLM models that have built in audio understanding
and can also mention ASR leaderboard so people have a convenient comparison
| ### When to Choose Each Model: | ||
|
|
||
| **Choose Whisper when:** | ||
| - You need multilingual support (96+ languages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equivalent but I think most docs I see say 99+?
|
|
||
| While Whisper has been a game-changer for speech recognition, the field continues to evolve with new architectures designed to address specific limitations and use cases. Let's explore two notable recent developments: **Moonshine** and **Kyutai STT**, which offer different approaches to improving upon Whisper's capabilities. | ||
|
|
||
| ### Moonshine: Efficient Edge Computing ASR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about adding such sections for Parakeet and Voxtral? (as they are both quite popular)
| - **Diverse domains** benefit from Whisper's robust pre-training | ||
| - **Punctuation and casing** are recommended as Whisper handles them well | ||
|
|
||
| ### For Moonshine Models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar as above, such sections for Voxtral and Parakeet?
Summary
Changes Made
Dataset and Model Updates
common_voice_13_0→common_voice_17_0New ASR Architecture Coverage
Files Modified
chapters/en/chapter5/asr_models.mdx- Added modern ASR section, comparison table, code exampleschapters/en/chapter5/choosing_dataset.mdx- Added model-specific dataset recommendationschapters/en/chapter5/evaluation.mdx- Updated dataset referenceschapters/en/chapter5/fine-tuning.mdx- Updated training exampleschapters/en/_toctree.yml- Minor formatting fixKey Features Added
Architecture Comparison Table
Model Selection Guidelines
Test Plan