diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index 837b155..967b602 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -85,7 +85,7 @@ title: Hands-on exercise - local: chapter5/supplemental_reading title: Supplemental reading and resources -# + - title: Unit 6. From text to speech sections: - local: chapter6/introduction diff --git a/chapters/en/chapter5/asr_models.mdx b/chapters/en/chapter5/asr_models.mdx index 44e261b..8cc676b 100644 --- a/chapters/en/chapter5/asr_models.mdx +++ b/chapters/en/chapter5/asr_models.mdx @@ -162,7 +162,18 @@ Based on this information, you can select a checkpoint that is best suited to yo | base | 74 M | 1.5 | 16 | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) | | small | 244 M | 2.3 | 6 | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) | | medium | 769 M | 4.2 | 2 | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) | -| large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v2) | +| large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v3) | + +### Alternative ASR Models + +In addition to Whisper, several other modern ASR models are available with different optimization focuses: + +| Model | Parameters | VRAM / GB | Key Feature | Languages | Link | +|-------|------------|-----------|-------------|-----------|------| +| Moonshine Tiny | 27 M | 0.5 | 5x faster for short audio | English | [✓](https://huggingface.co/UsefulSensors/moonshine-tiny) | +| Moonshine Base | 61 M | 1.0 | Edge-optimized | English | [✓](https://huggingface.co/UsefulSensors/moonshine-base) | +| Kyutai STT 1B | 1000 M | 3.0 | Real-time streaming | English, French | [✓](https://huggingface.co/kyutai/stt-1b-en_fr) | +| Kyutai STT 2.6B | 2600 M | 6.0 | Low-latency streaming | English | [✓](https://huggingface.co/kyutai/stt-2.6b-en) | Let's load the [Whisper Base](https://huggingface.co/openai/whisper-base) checkpoint, which is of comparable size to the Wav2Vec2 checkpoint we used previously. Preempting our move to multilingual speech recognition, we'll load the multilingual @@ -380,20 +391,175 @@ pipe( And voila! We have our predicted text as well as corresponding timestamps. +## Modern ASR Architectures: Beyond Whisper + +While Whisper has been a game-changer for speech recognition, the field continues to evolve with new architectures designed to address specific limitations and use cases. Let's explore two notable recent developments: **Moonshine** and **Kyutai STT**, which offer different approaches to improving upon Whisper's capabilities. + +### Moonshine: Efficient Edge Computing ASR + +[Moonshine](https://huggingface.co/UsefulSensors/moonshine-base) is a family of speech recognition models developed by Useful Sensors specifically for **edge computing** and **real-time applications**. Released in October 2024, it represents a significant advancement in efficient ASR. + +#### Key Architecture Differences from Whisper: + +**1. Variable-Length Processing:** +- **Whisper**: Processes all audio in fixed 30-second chunks +- **Moonshine**: Processes audio in variable-length segments, making it **5x faster** for shorter audio clips + +**2. Model Size and Efficiency:** +- **Moonshine Tiny**: 27M parameters (~190MB) +- **Moonshine Base**: 61M parameters (~400MB) +- **Whisper Small**: 244M parameters (~2.3GB) + +**3. Training Data:** +- **Moonshine**: 200,000 hours of audio data +- **Whisper**: 680,000 hours of audio data + +Let's see Moonshine in action: + +```python +import torch +from transformers import AutoProcessor, MoonshineForConditionalGeneration +from datasets import load_dataset + +# Load the processor and model +processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-base") +model = MoonshineForConditionalGeneration.from_pretrained("UsefulSensors/moonshine-base") + +# Load sample audio +dataset = load_dataset( + "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" +) +sample = dataset[0]["audio"] + +# Process the audio +inputs = processor( + sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt" +) + +# Generate transcription +with torch.no_grad(): + generated_ids = model.generate(**inputs, max_length=256) + +# Decode the result +transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] +print(f"Moonshine: {transcription}") +``` + +**Performance Characteristics:** +- **Speed**: 5x faster than Whisper for short audio clips +- **Accuracy**: Comparable to Whisper on English ASR tasks +- **Memory**: Significantly lower memory footprint +- **Language Support**: English-only (currently) + +### Kyutai STT: Streaming ASR with Real-Time Capabilities + +[Kyutai STT](https://huggingface.co/kyutai/stt-2.6b-en) represents a different approach to ASR, focusing on **streaming capabilities** and **real-time transcription**. Developed by Kyutai Labs, it's based on the **Delayed Streams Modeling (DSM)** framework. + +#### Key Architecture Differences from Whisper: + +**1. Streaming Architecture:** +- **Whisper**: Offline processing, requires complete audio +- **Kyutai STT**: Streaming processing, transcribes audio as it arrives + +**2. Audio Tokenization:** +- **Whisper**: Log-mel spectrograms +- **Kyutai STT**: Audio tokenized using **Mimi codec** at 12.5 Hz + +**3. Model Variants:** +- **kyutai/stt-1b-en_fr**: 1B parameters, English/French, 0.5s delay +- **kyutai/stt-2.6b-en**: 2.6B parameters, English-only, 2.5s delay + +**4. Training Scale:** +- **Kyutai STT**: 2.5 million hours of public audio +- **Whisper**: 680,000 hours of labeled audio + +Let's try Kyutai STT (requires transformers >= 4.53.0): + +```python +import torch +from transformers import AutoProcessor, KyutaiSTTForConditionalGeneration +from datasets import load_dataset + +# Load the processor and model +processor = AutoProcessor.from_pretrained("kyutai/stt-2.6b-en") +model = KyutaiSTTForConditionalGeneration.from_pretrained("kyutai/stt-2.6b-en") + +# Load sample audio +dataset = load_dataset( + "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" +) +sample = dataset[0]["audio"] + +# Process the audio +inputs = processor( + sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt" +) + +# Generate transcription +with torch.no_grad(): + generated_ids = model.generate(**inputs, max_length=256) + +# Decode the result +transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] +print(f"Kyutai STT: {transcription}") +``` + +**Performance Characteristics:** +- **Latency**: Ultra-low latency (0.5-2.5s depending on model) +- **Robustness**: Handles noisy conditions well +- **Audio Length**: Can process up to 2 hours of audio +- **Punctuation**: Includes capitalization and punctuation + +### Architecture Comparison Summary + +| Feature | Whisper | Moonshine | Kyutai STT | +|---------|---------|-----------|------------| +| **Processing** | Fixed 30s chunks | Variable-length | Streaming | +| **Best Use Case** | General-purpose ASR | Edge/Mobile devices | Real-time applications | +| **Model Size** | 39M - 1.5B params | 27M - 61M params | 1B - 2.6B params | +| **Speed** | Baseline | 5x faster (short audio) | Ultra-low latency | +| **Languages** | 96+ languages | English only | English (+French) | +| **Punctuation** | Yes | Yes | Yes | +| **Memory Usage** | High | Low | Medium | +| **Training Data** | 680k hours | 200k hours | 2.5M hours | + +### When to Choose Each Model: + +**Choose Whisper when:** +- You need multilingual support (96+ languages) +- Accuracy is more important than speed +- You're working with diverse audio domains +- You need translation capabilities + +**Choose Moonshine when:** +- You're deploying on edge devices or mobile +- You need fast processing for short audio clips +- Memory efficiency is crucial +- You're working with English-only content + +**Choose Kyutai STT when:** +- You need real-time transcription +- Low latency is critical +- You're building streaming applications +- You need robust handling of long audio files + +The choice between these models depends on your specific use case, computational constraints, and performance requirements. Each represents a different optimization point in the trade-off between accuracy, speed, memory usage, and feature set. + ## Summary -Whisper is a strong pre-trained model for speech recognition and translation. Compared to Wav2Vec2, it has higher -transcription accuracy, with outputs that contain punctuation and casing. It can be used to transcribe speech in English -as well as 96 other languages, both on short audio segments and longer ones through _chunking_. These attributes make it -a viable model for many speech recognition and translation tasks without the need for fine-tuning. The `pipeline()` method -provides an easy way of running inference in one-line API calls with control over the generated predictions. - -While the Whisper model performs extremely well on many high-resource languages, it has lower transcription and translation -accuracy on low-resource languages, i.e. those with less readily available training data. There is also varying performance -across different accents and dialects of certain languages, including lower accuracy for speakers of different genders, -races, ages or other demographic criteria (_c.f._ [Whisper paper](https://arxiv.org/pdf/2212.04356.pdf)). - -To boost the performance on low-resource languages, accents or dialects, we can take the pre-trained Whisper model and -train it on a small corpus of appropriately selected data, in a process called _fine-tuning_. We'll show that with -as little as ten hours of additional data, we can improve the performance of the Whisper model by over 100% on a low-resource -language. In the next section, we'll cover the process behind selecting a dataset for fine-tuning. +The landscape of automatic speech recognition has expanded significantly beyond the groundbreaking Whisper model. While Whisper remains a strong pre-trained model for speech recognition and translation with support for 96+ languages, we now have specialized alternatives that excel in specific use cases. + +**Whisper** excels at general-purpose ASR with multilingual support, high accuracy, and translation capabilities. However, it requires complete audio input and has higher computational requirements. + +**Moonshine** represents the next generation of efficient ASR, optimized for edge computing and real-time applications. With 5x faster processing for short audio clips and significantly lower memory usage, it's ideal for mobile and embedded applications, though currently limited to English. + +**Kyutai STT** pushes the boundaries of real-time ASR with streaming capabilities and ultra-low latency. Its ability to transcribe audio as it arrives makes it perfect for live applications, though it's currently limited to English and French. + +Each model represents different optimization trade-offs: +- **Whisper**: Accuracy and multilingual support +- **Moonshine**: Efficiency and edge deployment +- **Kyutai STT**: Real-time processing and streaming + +The choice depends on your specific requirements: language support, computational constraints, latency requirements, and deployment environment. All three models support punctuation and casing, and are available through the 🤗 Transformers library with `pipeline()` support for easy inference. + +For applications requiring fine-tuning, the same principles apply across all models. In the next section, we'll explore dataset selection strategies that can be adapted for any of these ASR architectures. diff --git a/chapters/en/chapter5/choosing_dataset.mdx b/chapters/en/chapter5/choosing_dataset.mdx index 36e4927..7cb574b 100644 --- a/chapters/en/chapter5/choosing_dataset.mdx +++ b/chapters/en/chapter5/choosing_dataset.mdx @@ -56,7 +56,7 @@ Here is a summary of the most popular English speech recognition datasets on the | Dataset | Train Hours | Domain | Speaking Style | Casing | Punctuation | License | Recommended Use | |-----------------------------------------------------------------------------------------|-------------|-----------------------------|-----------------------|--------|-------------|-----------------|----------------------------------| | [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | 960 | Audiobook | Narrated | ❌ | ❌ | CC-BY-4.0 | Academic benchmarks | -| [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) | 3000 | Wikipedia | Narrated | ✅ | ✅ | CC0-1.0 | Non-native speakers | +| [Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 3000 | Wikipedia | Narrated | ✅ | ✅ | CC0-1.0 | Non-native speakers | | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 540 | European Parliament | Oratory | ❌ | ✅ | CC0 | Non-native speakers | | [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | 450 | TED talks | Oratory | ❌ | ❌ | CC-BY-NC-ND 3.0 | Technical topics | | [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | 10000 | Audiobook, podcast, YouTube | Narrated, spontaneous | ❌ | ✅ | apache-2.0 | Robustness over multiple domains | @@ -71,7 +71,7 @@ for each dataset, and replace it with the number of languages per dataset: | Dataset | Languages | Domain | Speaking Style | Casing | Punctuation | License | Recommended Usage | |-----------------------------------------------------------------------------------------------|-----------|---------------------------------------|----------------|--------|-------------|-----------|-------------------------| | [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) | 6 | Audiobooks | Narrated | ❌ | ❌ | CC-BY-4.0 | Academic benchmarks | -| [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) | 108 | Wikipedia text & crowd-sourced speech | Narrated | ✅ | ✅ | CC0-1.0 | Diverse speaker set | +| [Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 124 | Wikipedia text & crowd-sourced speech | Narrated | ✅ | ✅ | CC0-1.0 | Diverse speaker set | | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 15 | European Parliament recordings | Spontaneous | ❌ | ✅ | CC0 | European languages | | [FLEURS](https://huggingface.co/datasets/google/fleurs) | 101 | European Parliament recordings | Spontaneous | ❌ | ❌ | CC-BY-4.0 | Multilingual evaluation | @@ -85,26 +85,26 @@ efforts - the audio community is inclusive and wide-ranging, and others will app Alright! Now that we've gone through all the criterion for selecting an ASR dataset, let's pick one for the purpose of this tutorial. We know that Whisper already does a pretty good job at transcribing data in high-resource languages (such as English and Spanish), so we'll focus ourselves on low-resource multilingual transcription. We want to retain Whisper's ability to predict punctuation and casing, -so it seems from the second table that Common Voice 13 is a great candidate dataset! +so it seems from the second table that Common Voice 17 is a great candidate dataset! -## Common Voice 13 +## Common Voice 17 -Common Voice 13 is a crowd-sourced dataset where speakers record text from Wikipedia in various languages. It forms part of +Common Voice 17 is a crowd-sourced dataset where speakers record text from Wikipedia in various languages. It forms part of the Common Voice series, a collection of Common Voice datasets released by Mozilla Foundation. At the time of writing, -Common Voice 13 is the latest edition of the dataset, with the most languages and hours per language out of any release to date. +Common Voice 17 is the latest edition of the dataset, with the most languages and hours per language out of any release to date. -We can get the full list of languages for the Common Voice 13 dataset by checking-out the dataset page on the Hub: -[mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0). +We can get the full list of languages for the Common Voice 17 dataset by checking-out the dataset page on the Hub: +[mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0). The first time you view this page, you'll be asked to accept the terms of use. After that, you'll be given full access to the dataset. Once we've provided authentication to use the dataset, we'll be presented with the dataset preview. The dataset preview shows us the first 100 samples of the dataset for each language. What's more, it's loaded up with audio samples ready for us to listen to in real time. For this Unit, we'll select [_Dhivehi_](https://en.wikipedia.org/wiki/Maldivian_language) (or _Maldivian_), an Indo-Aryan language spoken in the South Asian island country of the Maldives. While we're selecting -Dhivehi for this tutorial, the steps covered here apply to any one of the 108 languages in the Common Voice 13 dataset, and +Dhivehi for this tutorial, the steps covered here apply to any one of the 124 languages in the Common Voice 17 dataset, and more generally to any one of the 180+ audio datasets on the Hugging Face Hub, so there's no restriction on language or dialect. -We can select the Dhivehi subset of Common Voice 13 by setting the subset to `dv` using the dropdown menu (`dv` being the language +We can select the Dhivehi subset of Common Voice 17 by setting the subset to `dv` using the dropdown menu (`dv` being the language identifier code for Dhivehi):