-
Notifications
You must be signed in to change notification settings - Fork 148
Update Chapter 5 ASR content with latest datasets and models #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -162,7 +162,18 @@ Based on this information, you can select a checkpoint that is best suited to yo | |
| | base | 74 M | 1.5 | 16 | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) | | ||
| | small | 244 M | 2.3 | 6 | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) | | ||
| | medium | 769 M | 4.2 | 2 | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) | | ||
| | large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v2) | | ||
| | large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v3) | | ||
|
|
||
| ### Alternative ASR Models | ||
|
|
||
| In addition to Whisper, several other modern ASR models are available with different optimization focuses: | ||
|
|
||
| | Model | Parameters | VRAM / GB | Key Feature | Languages | Link | | ||
| |-------|------------|-----------|-------------|-----------|------| | ||
| | Moonshine Tiny | 27 M | 0.5 | 5x faster for short audio | English | [✓](https://huggingface.co/UsefulSensors/moonshine-tiny) | | ||
| | Moonshine Base | 61 M | 1.0 | Edge-optimized | English | [✓](https://huggingface.co/UsefulSensors/moonshine-base) | | ||
| | Kyutai STT 1B | 1000 M | 3.0 | Real-time streaming | English, French | [✓](https://huggingface.co/kyutai/stt-1b-en_fr) | | ||
| | Kyutai STT 2.6B | 2600 M | 6.0 | Low-latency streaming | English | [✓](https://huggingface.co/kyutai/stt-2.6b-en) | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about adding others supported by transformers:
NOTE: last two are Audio LLM models that have built in audio understanding and can also mention ASR leaderboard so people have a convenient comparison |
||
|
|
||
| Let's load the [Whisper Base](https://huggingface.co/openai/whisper-base) checkpoint, which is of comparable size to the | ||
| Wav2Vec2 checkpoint we used previously. Preempting our move to multilingual speech recognition, we'll load the multilingual | ||
|
|
@@ -380,20 +391,175 @@ pipe( | |
|
|
||
| And voila! We have our predicted text as well as corresponding timestamps. | ||
|
|
||
| ## Modern ASR Architectures: Beyond Whisper | ||
|
|
||
| While Whisper has been a game-changer for speech recognition, the field continues to evolve with new architectures designed to address specific limitations and use cases. Let's explore two notable recent developments: **Moonshine** and **Kyutai STT**, which offer different approaches to improving upon Whisper's capabilities. | ||
|
|
||
| ### Moonshine: Efficient Edge Computing ASR | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how about adding such sections for Parakeet and Voxtral? (as they are both quite popular) |
||
|
|
||
| [Moonshine](https://huggingface.co/UsefulSensors/moonshine-base) is a family of speech recognition models developed by Useful Sensors specifically for **edge computing** and **real-time applications**. Released in October 2024, it represents a significant advancement in efficient ASR. | ||
|
|
||
| #### Key Architecture Differences from Whisper: | ||
|
|
||
| **1. Variable-Length Processing:** | ||
| - **Whisper**: Processes all audio in fixed 30-second chunks | ||
| - **Moonshine**: Processes audio in variable-length segments, making it **5x faster** for shorter audio clips | ||
|
|
||
| **2. Model Size and Efficiency:** | ||
| - **Moonshine Tiny**: 27M parameters (~190MB) | ||
| - **Moonshine Base**: 61M parameters (~400MB) | ||
| - **Whisper Small**: 244M parameters (~2.3GB) | ||
|
|
||
| **3. Training Data:** | ||
| - **Moonshine**: 200,000 hours of audio data | ||
| - **Whisper**: 680,000 hours of audio data | ||
|
|
||
| Let's see Moonshine in action: | ||
|
|
||
| ```python | ||
| import torch | ||
| from transformers import AutoProcessor, MoonshineForConditionalGeneration | ||
| from datasets import load_dataset | ||
|
|
||
| # Load the processor and model | ||
| processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-base") | ||
| model = MoonshineForConditionalGeneration.from_pretrained("UsefulSensors/moonshine-base") | ||
|
|
||
| # Load sample audio | ||
| dataset = load_dataset( | ||
| "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" | ||
| ) | ||
| sample = dataset[0]["audio"] | ||
|
|
||
| # Process the audio | ||
| inputs = processor( | ||
| sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt" | ||
| ) | ||
|
|
||
| # Generate transcription | ||
| with torch.no_grad(): | ||
| generated_ids = model.generate(**inputs, max_length=256) | ||
|
|
||
| # Decode the result | ||
| transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] | ||
| print(f"Moonshine: {transcription}") | ||
| ``` | ||
|
|
||
| **Performance Characteristics:** | ||
| - **Speed**: 5x faster than Whisper for short audio clips | ||
| - **Accuracy**: Comparable to Whisper on English ASR tasks | ||
| - **Memory**: Significantly lower memory footprint | ||
| - **Language Support**: English-only (currently) | ||
|
|
||
| ### Kyutai STT: Streaming ASR with Real-Time Capabilities | ||
|
|
||
| [Kyutai STT](https://huggingface.co/kyutai/stt-2.6b-en) represents a different approach to ASR, focusing on **streaming capabilities** and **real-time transcription**. Developed by Kyutai Labs, it's based on the **Delayed Streams Modeling (DSM)** framework. | ||
|
|
||
| #### Key Architecture Differences from Whisper: | ||
|
|
||
| **1. Streaming Architecture:** | ||
| - **Whisper**: Offline processing, requires complete audio | ||
| - **Kyutai STT**: Streaming processing, transcribes audio as it arrives | ||
|
|
||
| **2. Audio Tokenization:** | ||
| - **Whisper**: Log-mel spectrograms | ||
| - **Kyutai STT**: Audio tokenized using **Mimi codec** at 12.5 Hz | ||
|
|
||
| **3. Model Variants:** | ||
| - **kyutai/stt-1b-en_fr**: 1B parameters, English/French, 0.5s delay | ||
| - **kyutai/stt-2.6b-en**: 2.6B parameters, English-only, 2.5s delay | ||
|
|
||
| **4. Training Scale:** | ||
| - **Kyutai STT**: 2.5 million hours of public audio | ||
| - **Whisper**: 680,000 hours of labeled audio | ||
|
|
||
| Let's try Kyutai STT (requires transformers >= 4.53.0): | ||
|
|
||
| ```python | ||
| import torch | ||
| from transformers import AutoProcessor, KyutaiSTTForConditionalGeneration | ||
| from datasets import load_dataset | ||
|
|
||
| # Load the processor and model | ||
| processor = AutoProcessor.from_pretrained("kyutai/stt-2.6b-en") | ||
| model = KyutaiSTTForConditionalGeneration.from_pretrained("kyutai/stt-2.6b-en") | ||
|
|
||
| # Load sample audio | ||
| dataset = load_dataset( | ||
| "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" | ||
| ) | ||
| sample = dataset[0]["audio"] | ||
|
|
||
| # Process the audio | ||
| inputs = processor( | ||
| sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt" | ||
| ) | ||
|
|
||
| # Generate transcription | ||
| with torch.no_grad(): | ||
| generated_ids = model.generate(**inputs, max_length=256) | ||
|
|
||
| # Decode the result | ||
| transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] | ||
| print(f"Kyutai STT: {transcription}") | ||
| ``` | ||
|
|
||
| **Performance Characteristics:** | ||
| - **Latency**: Ultra-low latency (0.5-2.5s depending on model) | ||
| - **Robustness**: Handles noisy conditions well | ||
| - **Audio Length**: Can process up to 2 hours of audio | ||
| - **Punctuation**: Includes capitalization and punctuation | ||
|
|
||
| ### Architecture Comparison Summary | ||
|
|
||
| | Feature | Whisper | Moonshine | Kyutai STT | | ||
| |---------|---------|-----------|------------| | ||
| | **Processing** | Fixed 30s chunks | Variable-length | Streaming | | ||
| | **Best Use Case** | General-purpose ASR | Edge/Mobile devices | Real-time applications | | ||
| | **Model Size** | 39M - 1.5B params | 27M - 61M params | 1B - 2.6B params | | ||
| | **Speed** | Baseline | 5x faster (short audio) | Ultra-low latency | | ||
| | **Languages** | 96+ languages | English only | English (+French) | | ||
| | **Punctuation** | Yes | Yes | Yes | | ||
| | **Memory Usage** | High | Low | Medium | | ||
| | **Training Data** | 680k hours | 200k hours | 2.5M hours | | ||
|
|
||
| ### When to Choose Each Model: | ||
|
|
||
| **Choose Whisper when:** | ||
| - You need multilingual support (96+ languages) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. equivalent but I think most docs I see say 99+? |
||
| - Accuracy is more important than speed | ||
| - You're working with diverse audio domains | ||
| - You need translation capabilities | ||
|
|
||
| **Choose Moonshine when:** | ||
| - You're deploying on edge devices or mobile | ||
| - You need fast processing for short audio clips | ||
| - Memory efficiency is crucial | ||
| - You're working with English-only content | ||
|
|
||
| **Choose Kyutai STT when:** | ||
| - You need real-time transcription | ||
| - Low latency is critical | ||
| - You're building streaming applications | ||
| - You need robust handling of long audio files | ||
|
|
||
| The choice between these models depends on your specific use case, computational constraints, and performance requirements. Each represents a different optimization point in the trade-off between accuracy, speed, memory usage, and feature set. | ||
|
|
||
| ## Summary | ||
|
|
||
| Whisper is a strong pre-trained model for speech recognition and translation. Compared to Wav2Vec2, it has higher | ||
| transcription accuracy, with outputs that contain punctuation and casing. It can be used to transcribe speech in English | ||
| as well as 96 other languages, both on short audio segments and longer ones through _chunking_. These attributes make it | ||
| a viable model for many speech recognition and translation tasks without the need for fine-tuning. The `pipeline()` method | ||
| provides an easy way of running inference in one-line API calls with control over the generated predictions. | ||
|
|
||
| While the Whisper model performs extremely well on many high-resource languages, it has lower transcription and translation | ||
| accuracy on low-resource languages, i.e. those with less readily available training data. There is also varying performance | ||
| across different accents and dialects of certain languages, including lower accuracy for speakers of different genders, | ||
| races, ages or other demographic criteria (_c.f._ [Whisper paper](https://arxiv.org/pdf/2212.04356.pdf)). | ||
|
|
||
| To boost the performance on low-resource languages, accents or dialects, we can take the pre-trained Whisper model and | ||
| train it on a small corpus of appropriately selected data, in a process called _fine-tuning_. We'll show that with | ||
| as little as ten hours of additional data, we can improve the performance of the Whisper model by over 100% on a low-resource | ||
| language. In the next section, we'll cover the process behind selecting a dataset for fine-tuning. | ||
| The landscape of automatic speech recognition has expanded significantly beyond the groundbreaking Whisper model. While Whisper remains a strong pre-trained model for speech recognition and translation with support for 96+ languages, we now have specialized alternatives that excel in specific use cases. | ||
|
|
||
| **Whisper** excels at general-purpose ASR with multilingual support, high accuracy, and translation capabilities. However, it requires complete audio input and has higher computational requirements. | ||
|
|
||
| **Moonshine** represents the next generation of efficient ASR, optimized for edge computing and real-time applications. With 5x faster processing for short audio clips and significantly lower memory usage, it's ideal for mobile and embedded applications, though currently limited to English. | ||
|
|
||
| **Kyutai STT** pushes the boundaries of real-time ASR with streaming capabilities and ultra-low latency. Its ability to transcribe audio as it arrives makes it perfect for live applications, though it's currently limited to English and French. | ||
|
|
||
| Each model represents different optimization trade-offs: | ||
| - **Whisper**: Accuracy and multilingual support | ||
| - **Moonshine**: Efficiency and edge deployment | ||
| - **Kyutai STT**: Real-time processing and streaming | ||
|
|
||
| The choice depends on your specific requirements: language support, computational constraints, latency requirements, and deployment environment. All three models support punctuation and casing, and are available through the 🤗 Transformers library with `pipeline()` support for easy inference. | ||
|
|
||
| For applications requiring fine-tuning, the same principles apply across all models. In the next section, we'll explore dataset selection strategies that can be adapted for any of these ASR architectures. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding Turbo as well? https://huggingface.co/openai/whisper-large-v3-turbo
should be faster as it has fewer decoder layers