diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index 837b155..967b602 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -85,7 +85,7 @@ title: Hands-on exercise - local: chapter5/supplemental_reading title: Supplemental reading and resources -# + - title: Unit 6. From text to speech sections: - local: chapter6/introduction diff --git a/chapters/en/chapter5/asr_models.mdx b/chapters/en/chapter5/asr_models.mdx index 44e261b..8cc676b 100644 --- a/chapters/en/chapter5/asr_models.mdx +++ b/chapters/en/chapter5/asr_models.mdx @@ -162,7 +162,18 @@ Based on this information, you can select a checkpoint that is best suited to yo | base | 74 M | 1.5 | 16 | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) | | small | 244 M | 2.3 | 6 | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) | | medium | 769 M | 4.2 | 2 | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) | -| large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v2) | +| large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v3) | + +### Alternative ASR Models + +In addition to Whisper, several other modern ASR models are available with different optimization focuses: + +| Model | Parameters | VRAM / GB | Key Feature | Languages | Link | +|-------|------------|-----------|-------------|-----------|------| +| Moonshine Tiny | 27 M | 0.5 | 5x faster for short audio | English | [✓](https://huggingface.co/UsefulSensors/moonshine-tiny) | +| Moonshine Base | 61 M | 1.0 | Edge-optimized | English | [✓](https://huggingface.co/UsefulSensors/moonshine-base) | +| Kyutai STT 1B | 1000 M | 3.0 | Real-time streaming | English, French | [✓](https://huggingface.co/kyutai/stt-1b-en_fr) | +| Kyutai STT 2.6B | 2600 M | 6.0 | Low-latency streaming | English | [✓](https://huggingface.co/kyutai/stt-2.6b-en) | Let's load the [Whisper Base](https://huggingface.co/openai/whisper-base) checkpoint, which is of comparable size to the Wav2Vec2 checkpoint we used previously. Preempting our move to multilingual speech recognition, we'll load the multilingual @@ -380,20 +391,175 @@ pipe( And voila! We have our predicted text as well as corresponding timestamps. +## Modern ASR Architectures: Beyond Whisper + +While Whisper has been a game-changer for speech recognition, the field continues to evolve with new architectures designed to address specific limitations and use cases. Let's explore two notable recent developments: **Moonshine** and **Kyutai STT**, which offer different approaches to improving upon Whisper's capabilities. + +### Moonshine: Efficient Edge Computing ASR + +[Moonshine](https://huggingface.co/UsefulSensors/moonshine-base) is a family of speech recognition models developed by Useful Sensors specifically for **edge computing** and **real-time applications**. Released in October 2024, it represents a significant advancement in efficient ASR. + +#### Key Architecture Differences from Whisper: + +**1. Variable-Length Processing:** +- **Whisper**: Processes all audio in fixed 30-second chunks +- **Moonshine**: Processes audio in variable-length segments, making it **5x faster** for shorter audio clips + +**2. Model Size and Efficiency:** +- **Moonshine Tiny**: 27M parameters (~190MB) +- **Moonshine Base**: 61M parameters (~400MB) +- **Whisper Small**: 244M parameters (~2.3GB) + +**3. Training Data:** +- **Moonshine**: 200,000 hours of audio data +- **Whisper**: 680,000 hours of audio data + +Let's see Moonshine in action: + +```python +import torch +from transformers import AutoProcessor, MoonshineForConditionalGeneration +from datasets import load_dataset + +# Load the processor and model +processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-base") +model = MoonshineForConditionalGeneration.from_pretrained("UsefulSensors/moonshine-base") + +# Load sample audio +dataset = load_dataset( + "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" +) +sample = dataset[0]["audio"] + +# Process the audio +inputs = processor( + sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt" +) + +# Generate transcription +with torch.no_grad(): + generated_ids = model.generate(**inputs, max_length=256) + +# Decode the result +transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] +print(f"Moonshine: {transcription}") +``` + +**Performance Characteristics:** +- **Speed**: 5x faster than Whisper for short audio clips +- **Accuracy**: Comparable to Whisper on English ASR tasks +- **Memory**: Significantly lower memory footprint +- **Language Support**: English-only (currently) + +### Kyutai STT: Streaming ASR with Real-Time Capabilities + +[Kyutai STT](https://huggingface.co/kyutai/stt-2.6b-en) represents a different approach to ASR, focusing on **streaming capabilities** and **real-time transcription**. Developed by Kyutai Labs, it's based on the **Delayed Streams Modeling (DSM)** framework. + +#### Key Architecture Differences from Whisper: + +**1. Streaming Architecture:** +- **Whisper**: Offline processing, requires complete audio +- **Kyutai STT**: Streaming processing, transcribes audio as it arrives + +**2. Audio Tokenization:** +- **Whisper**: Log-mel spectrograms +- **Kyutai STT**: Audio tokenized using **Mimi codec** at 12.5 Hz + +**3. Model Variants:** +- **kyutai/stt-1b-en_fr**: 1B parameters, English/French, 0.5s delay +- **kyutai/stt-2.6b-en**: 2.6B parameters, English-only, 2.5s delay + +**4. Training Scale:** +- **Kyutai STT**: 2.5 million hours of public audio +- **Whisper**: 680,000 hours of labeled audio + +Let's try Kyutai STT (requires transformers >= 4.53.0): + +```python +import torch +from transformers import AutoProcessor, KyutaiSTTForConditionalGeneration +from datasets import load_dataset + +# Load the processor and model +processor = AutoProcessor.from_pretrained("kyutai/stt-2.6b-en") +model = KyutaiSTTForConditionalGeneration.from_pretrained("kyutai/stt-2.6b-en") + +# Load sample audio +dataset = load_dataset( + "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" +) +sample = dataset[0]["audio"] + +# Process the audio +inputs = processor( + sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt" +) + +# Generate transcription +with torch.no_grad(): + generated_ids = model.generate(**inputs, max_length=256) + +# Decode the result +transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] +print(f"Kyutai STT: {transcription}") +``` + +**Performance Characteristics:** +- **Latency**: Ultra-low latency (0.5-2.5s depending on model) +- **Robustness**: Handles noisy conditions well +- **Audio Length**: Can process up to 2 hours of audio +- **Punctuation**: Includes capitalization and punctuation + +### Architecture Comparison Summary + +| Feature | Whisper | Moonshine | Kyutai STT | +|---------|---------|-----------|------------| +| **Processing** | Fixed 30s chunks | Variable-length | Streaming | +| **Best Use Case** | General-purpose ASR | Edge/Mobile devices | Real-time applications | +| **Model Size** | 39M - 1.5B params | 27M - 61M params | 1B - 2.6B params | +| **Speed** | Baseline | 5x faster (short audio) | Ultra-low latency | +| **Languages** | 96+ languages | English only | English (+French) | +| **Punctuation** | Yes | Yes | Yes | +| **Memory Usage** | High | Low | Medium | +| **Training Data** | 680k hours | 200k hours | 2.5M hours | + +### When to Choose Each Model: + +**Choose Whisper when:** +- You need multilingual support (96+ languages) +- Accuracy is more important than speed +- You're working with diverse audio domains +- You need translation capabilities + +**Choose Moonshine when:** +- You're deploying on edge devices or mobile +- You need fast processing for short audio clips +- Memory efficiency is crucial +- You're working with English-only content + +**Choose Kyutai STT when:** +- You need real-time transcription +- Low latency is critical +- You're building streaming applications +- You need robust handling of long audio files + +The choice between these models depends on your specific use case, computational constraints, and performance requirements. Each represents a different optimization point in the trade-off between accuracy, speed, memory usage, and feature set. + ## Summary -Whisper is a strong pre-trained model for speech recognition and translation. Compared to Wav2Vec2, it has higher -transcription accuracy, with outputs that contain punctuation and casing. It can be used to transcribe speech in English -as well as 96 other languages, both on short audio segments and longer ones through _chunking_. These attributes make it -a viable model for many speech recognition and translation tasks without the need for fine-tuning. The `pipeline()` method -provides an easy way of running inference in one-line API calls with control over the generated predictions. - -While the Whisper model performs extremely well on many high-resource languages, it has lower transcription and translation -accuracy on low-resource languages, i.e. those with less readily available training data. There is also varying performance -across different accents and dialects of certain languages, including lower accuracy for speakers of different genders, -races, ages or other demographic criteria (_c.f._ [Whisper paper](https://arxiv.org/pdf/2212.04356.pdf)). - -To boost the performance on low-resource languages, accents or dialects, we can take the pre-trained Whisper model and -train it on a small corpus of appropriately selected data, in a process called _fine-tuning_. We'll show that with -as little as ten hours of additional data, we can improve the performance of the Whisper model by over 100% on a low-resource -language. In the next section, we'll cover the process behind selecting a dataset for fine-tuning. +The landscape of automatic speech recognition has expanded significantly beyond the groundbreaking Whisper model. While Whisper remains a strong pre-trained model for speech recognition and translation with support for 96+ languages, we now have specialized alternatives that excel in specific use cases. + +**Whisper** excels at general-purpose ASR with multilingual support, high accuracy, and translation capabilities. However, it requires complete audio input and has higher computational requirements. + +**Moonshine** represents the next generation of efficient ASR, optimized for edge computing and real-time applications. With 5x faster processing for short audio clips and significantly lower memory usage, it's ideal for mobile and embedded applications, though currently limited to English. + +**Kyutai STT** pushes the boundaries of real-time ASR with streaming capabilities and ultra-low latency. Its ability to transcribe audio as it arrives makes it perfect for live applications, though it's currently limited to English and French. + +Each model represents different optimization trade-offs: +- **Whisper**: Accuracy and multilingual support +- **Moonshine**: Efficiency and edge deployment +- **Kyutai STT**: Real-time processing and streaming + +The choice depends on your specific requirements: language support, computational constraints, latency requirements, and deployment environment. All three models support punctuation and casing, and are available through the 🤗 Transformers library with `pipeline()` support for easy inference. + +For applications requiring fine-tuning, the same principles apply across all models. In the next section, we'll explore dataset selection strategies that can be adapted for any of these ASR architectures. diff --git a/chapters/en/chapter5/choosing_dataset.mdx b/chapters/en/chapter5/choosing_dataset.mdx index 36e4927..7cb574b 100644 --- a/chapters/en/chapter5/choosing_dataset.mdx +++ b/chapters/en/chapter5/choosing_dataset.mdx @@ -56,7 +56,7 @@ Here is a summary of the most popular English speech recognition datasets on the | Dataset | Train Hours | Domain | Speaking Style | Casing | Punctuation | License | Recommended Use | |-----------------------------------------------------------------------------------------|-------------|-----------------------------|-----------------------|--------|-------------|-----------------|----------------------------------| | [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | 960 | Audiobook | Narrated | ❌ | ❌ | CC-BY-4.0 | Academic benchmarks | -| [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) | 3000 | Wikipedia | Narrated | ✅ | ✅ | CC0-1.0 | Non-native speakers | +| [Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 3000 | Wikipedia | Narrated | ✅ | ✅ | CC0-1.0 | Non-native speakers | | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 540 | European Parliament | Oratory | ❌ | ✅ | CC0 | Non-native speakers | | [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | 450 | TED talks | Oratory | ❌ | ❌ | CC-BY-NC-ND 3.0 | Technical topics | | [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | 10000 | Audiobook, podcast, YouTube | Narrated, spontaneous | ❌ | ✅ | apache-2.0 | Robustness over multiple domains | @@ -71,7 +71,7 @@ for each dataset, and replace it with the number of languages per dataset: | Dataset | Languages | Domain | Speaking Style | Casing | Punctuation | License | Recommended Usage | |-----------------------------------------------------------------------------------------------|-----------|---------------------------------------|----------------|--------|-------------|-----------|-------------------------| | [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) | 6 | Audiobooks | Narrated | ❌ | ❌ | CC-BY-4.0 | Academic benchmarks | -| [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) | 108 | Wikipedia text & crowd-sourced speech | Narrated | ✅ | ✅ | CC0-1.0 | Diverse speaker set | +| [Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 124 | Wikipedia text & crowd-sourced speech | Narrated | ✅ | ✅ | CC0-1.0 | Diverse speaker set | | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 15 | European Parliament recordings | Spontaneous | ❌ | ✅ | CC0 | European languages | | [FLEURS](https://huggingface.co/datasets/google/fleurs) | 101 | European Parliament recordings | Spontaneous | ❌ | ❌ | CC-BY-4.0 | Multilingual evaluation | @@ -85,26 +85,26 @@ efforts - the audio community is inclusive and wide-ranging, and others will app Alright! Now that we've gone through all the criterion for selecting an ASR dataset, let's pick one for the purpose of this tutorial. We know that Whisper already does a pretty good job at transcribing data in high-resource languages (such as English and Spanish), so we'll focus ourselves on low-resource multilingual transcription. We want to retain Whisper's ability to predict punctuation and casing, -so it seems from the second table that Common Voice 13 is a great candidate dataset! +so it seems from the second table that Common Voice 17 is a great candidate dataset! -## Common Voice 13 +## Common Voice 17 -Common Voice 13 is a crowd-sourced dataset where speakers record text from Wikipedia in various languages. It forms part of +Common Voice 17 is a crowd-sourced dataset where speakers record text from Wikipedia in various languages. It forms part of the Common Voice series, a collection of Common Voice datasets released by Mozilla Foundation. At the time of writing, -Common Voice 13 is the latest edition of the dataset, with the most languages and hours per language out of any release to date. +Common Voice 17 is the latest edition of the dataset, with the most languages and hours per language out of any release to date. -We can get the full list of languages for the Common Voice 13 dataset by checking-out the dataset page on the Hub: -[mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0). +We can get the full list of languages for the Common Voice 17 dataset by checking-out the dataset page on the Hub: +[mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0). The first time you view this page, you'll be asked to accept the terms of use. After that, you'll be given full access to the dataset. Once we've provided authentication to use the dataset, we'll be presented with the dataset preview. The dataset preview shows us the first 100 samples of the dataset for each language. What's more, it's loaded up with audio samples ready for us to listen to in real time. For this Unit, we'll select [_Dhivehi_](https://en.wikipedia.org/wiki/Maldivian_language) (or _Maldivian_), an Indo-Aryan language spoken in the South Asian island country of the Maldives. While we're selecting -Dhivehi for this tutorial, the steps covered here apply to any one of the 108 languages in the Common Voice 13 dataset, and +Dhivehi for this tutorial, the steps covered here apply to any one of the 124 languages in the Common Voice 17 dataset, and more generally to any one of the 180+ audio datasets on the Hugging Face Hub, so there's no restriction on language or dialect. -We can select the Dhivehi subset of Common Voice 13 by setting the subset to `dv` using the dropdown menu (`dv` being the language +We can select the Dhivehi subset of Common Voice 17 by setting the subset to `dv` using the dropdown menu (`dv` being the language identifier code for Dhivehi):
@@ -121,6 +121,37 @@ dataset on the Hub, scroll through the samples and listen to the audio for the d it's the right dataset for your needs. Once you've selected a dataset, it's trivial to load the data so that you can start using it. +## Model-Specific Dataset Considerations + +When selecting a dataset for fine-tuning, it's important to consider the specific requirements and characteristics of your chosen ASR model: + +### For Whisper Models +- **Multilingual datasets** work best due to Whisper's multilingual training +- **Longer audio segments** (up to 30 seconds) are processed efficiently +- **Diverse domains** benefit from Whisper's robust pre-training +- **Punctuation and casing** are recommended as Whisper handles them well + +### For Moonshine Models +- **English-only datasets** are required (currently) +- **Shorter audio segments** (under 10 seconds) leverage Moonshine's efficiency advantages +- **Clean, high-quality audio** maximizes the model's edge computing benefits +- **Domain-specific data** can be particularly effective due to the model's focused training + +### For Kyutai STT Models +- **English and French datasets** are supported +- **Streaming-compatible data** with natural speech patterns work best +- **Longer audio files** (up to 2 hours) are handled efficiently +- **Noisy or challenging audio** conditions are well-supported + +### Dataset Selection Strategy + +1. **Match your model's language support** to your target languages +2. **Consider audio length characteristics** for optimal performance +3. **Evaluate computational constraints** when choosing between models +4. **Test on representative samples** from your target domain + +The Common Voice 17 dataset we selected works well for all three model types, though you might see different performance characteristics depending on your choice of model architecture. + Now, I personally don't speak Dhivehi, and expect the vast majority of readers not to either! To know if our fine-tuned model is any good, we'll need a rigorous way of _evaluating_ it on unseen data and measuring its transcription accuracy. We'll cover exactly this in the next section! diff --git a/chapters/en/chapter5/evaluation.mdx b/chapters/en/chapter5/evaluation.mdx index 6f9b50b..a289165 100644 --- a/chapters/en/chapter5/evaluation.mdx +++ b/chapters/en/chapter5/evaluation.mdx @@ -227,7 +227,7 @@ orthographic text and evaluating on normalised text to get the best of both worl Alright! We've covered three topics so far in this Unit: pre-trained models, dataset selection and evaluation. Let's have some fun and put them together in one end-to-end example 🚀 We're going to set ourselves up for the next -section on fine-tuning by evaluating the pre-trained Whisper model on the Common Voice 13 Dhivehi test set. We'll use +section on fine-tuning by evaluating the pre-trained Whisper model on the Common Voice 17 Dhivehi test set. We'll use the WER number we get as a _baseline_ for our fine-tuning run, or a target number that we'll try and beat 🥊 First, we'll load the pre-trained Whisper model using the `pipeline()` class. This process will be extremely familiar by now! @@ -253,8 +253,8 @@ pipe = pipeline( ) ``` -Next, we'll load the Dhivehi test split of Common Voice 13. You'll remember from the previous section that the Common -Voice 13 is *gated*, meaning we had to agree to the dataset terms of use before gaining access to the dataset. We can +Next, we'll load the Dhivehi test split of Common Voice 17. You'll remember from the previous section that the Common +Voice 17 is *gated*, meaning we had to agree to the dataset terms of use before gaining access to the dataset. We can now link our Hugging Face account to our notebook, so that we have access to the dataset from the machine we're currently using. @@ -275,13 +275,13 @@ preparing it automatically on your notebook: from datasets import load_dataset common_voice_test = load_dataset( - "mozilla-foundation/common_voice_13_0", "dv", split="test" + "mozilla-foundation/common_voice_17_0", "dv", split="test" ) ``` If you face an authentication issue when loading the dataset, ensure that you have accepted the dataset's terms of use - on the Hugging Face Hub through the following link: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0 + on the Hugging Face Hub through the following link: https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0 Evaluating over an entire dataset can be done in much the same way as over a single example - all we have to do is **loop** diff --git a/chapters/en/chapter5/fine-tuning.mdx b/chapters/en/chapter5/fine-tuning.mdx index a95e442..6bda1e2 100644 --- a/chapters/en/chapter5/fine-tuning.mdx +++ b/chapters/en/chapter5/fine-tuning.mdx @@ -1,6 +1,6 @@ # Fine-tuning the ASR model -In this section, we'll cover a step-by-step guide on fine-tuning Whisper for speech recognition on the Common Voice 13 +In this section, we'll cover a step-by-step guide on fine-tuning Whisper for speech recognition on the Common Voice 17 dataset. We'll use the 'small' version of the model and a relatively lightweight dataset, enabling you to run fine-tuning fairly quickly on any 16GB+ GPU with low disk space requirements, such as the 16GB T4 GPU provided in the Google Colab free tier. @@ -41,12 +41,12 @@ Your token has been saved to /root/.huggingface/token ## Load Dataset -[Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) contains approximately ten +[Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) contains approximately ten hours of labelled Dhivehi data, three of which is held-out test data. This is extremely little data for fine-tuning, so we'll be relying on leveraging the extensive multilingual ASR knowledge acquired by Whisper during pre-training for the low-resource Dhivehi language. -Using 🤗 Datasets, downloading and preparing data is extremely simple. We can download and prepare the Common Voice 13 +Using 🤗 Datasets, downloading and preparing data is extremely simple. We can download and prepare the Common Voice 17 splits in just one line of code. Since Dhivehi is very low-resource, we'll combine the `train` and `validation` splits to give approximately seven hours of training data. We'll use the three hours of `test` data as our held-out test set: @@ -56,10 +56,10 @@ from datasets import load_dataset, DatasetDict common_voice = DatasetDict() common_voice["train"] = load_dataset( - "mozilla-foundation/common_voice_13_0", "dv", split="train+validation" + "mozilla-foundation/common_voice_17_0", "dv", split="train+validation" ) common_voice["test"] = load_dataset( - "mozilla-foundation/common_voice_13_0", "dv", split="test" + "mozilla-foundation/common_voice_17_0", "dv", split="test" ) print(common_voice) @@ -81,7 +81,7 @@ DatasetDict({ You can change the language identifier from `"dv"` to a language identifier of your choice. To see all possible languages - in Common Voice 13, check out the dataset card on the Hugging Face Hub: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0 + in Common Voice 17, check out the dataset card on the Hugging Face Hub: https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0 Most ASR datasets only provide input audio samples (`audio`) and the corresponding transcribed text (`sentence`). @@ -514,8 +514,8 @@ model name accordingly: ```python kwargs = { - "dataset_tags": "mozilla-foundation/common_voice_13_0", - "dataset": "Common Voice 13", # a 'pretty' name for the training dataset + "dataset_tags": "mozilla-foundation/common_voice_17_0", + "dataset": "Common Voice 17", # a 'pretty' name for the training dataset "language": "dv", "model_name": "Whisper Small Dv - Sanchit Gandhi", # a 'pretty' name for your model "finetuned_from": "openai/whisper-small", @@ -532,7 +532,7 @@ trainer.push_to_hub(**kwargs) This will save the training logs and model weights under `"your-username/the-name-you-picked"`. For this example, check out the upload at `sanchit-gandhi/whisper-small-dv`. -While the fine-tuned model yields satisfactory results on the Common Voice 13 Dhivehi test data, it is by no means optimal. +While the fine-tuned model yields satisfactory results on the Common Voice 17 Dhivehi test data, it is by no means optimal. The purpose of this guide is to demonstrate how to fine-tune an ASR model using the 🤗 Trainer for multilingual speech recognition. @@ -559,7 +559,7 @@ pipe = pipeline("automatic-speech-recognition", model="sanchit-gandhi/whisper-sm ## Conclusion In this section, we covered a step-by-step guide on fine-tuning the Whisper model for speech recognition 🤗 Datasets, -Transformers and the Hugging Face Hub. We first loaded the Dhivehi subset of the Common Voice 13 dataset and pre-processed +Transformers and the Hugging Face Hub. We first loaded the Dhivehi subset of the Common Voice 17 dataset and pre-processed it by computing log-mel spectrograms and tokenising the text. We then defined a data collator, evaluation metric and training arguments, before using the 🤗 Trainer to train and evaluate our model. We finished by uploading the fine-tuned model to the Hugging Face Hub, and showcased how to share and use it with the `pipeline()` class.