Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 18 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ We present LFM2-Audio-1.5B, [Liquid AI](https://www.liquid.ai/)'s first end-to-e
LFM2-Audio supports two generation modes, interleaved and sequential, to maximize performance and quality across different tasks. Interleaved generation outputs text and audio tokens in a fixed interleaved pattern. This approach minimizes time to first audio output and number of tokens generated, making it ideal for naturally flowing real-time speech-to-speech interactions on resource constrained devices. Sequential generation mode, where the model decides when to switch modalities via special tokens, is suitable for non-conversational tasks, such as speech-to-text (ASR) or text-to-speech (TTS).

### Updates
- [Finetuning](#finetuning) is now supported in both interleaved and sequential generation modes. Version 1.2.0 introduces data preparation tools and a lightweight trainer, enabling users to fine-tune models for a broad range of tasks, from ASR and TTS to function calling and end-to-end speech-to-speech chat.
- [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) is released! This model is based on the stronger LFM2.5-1.2B base, and comes with a lightning fast LFM2 based audio detokenizer, stronger ASR, and better TTS voices. To use the new detokenizer, simply use `processor.decode`, see the examples below for more details. For the improved TTS voices, see the [TTS](#tts) section.

## Installation
Expand All @@ -15,6 +16,11 @@ pip install "liquid-audio [demo]" # optional, to install demo dependencies
pip install flash-attn --no-build-isolation # optional, to use flash attention 2. Will fallback to torch SDPA if not installed
```

For installation on AMD ROCm, don't forget to specify the correct `pytorch` version and index, e.g.
```bash
pip install liquid-audio torch==2.12.0+rocm7.2 --extra-index-url https://download.pytorch.org/whl/rocm7.2
```

## Usage
Generation is handled by two generation modes, interleaved and sequential, accessible from the methods `LFM2AudioModel.generate_interleaved` and `LFM2AudioModel.generate_sequential` respectively. Both are generators that yield `torch.Tensor`s. Text tokens are represented by tensors with 1 entry, and audio tokens are tensors with 8 entries, corresponding to 8 [Mimi](https://huggingface.co/docs/transformers/en/model_doc/mimi) codebooks.

Expand Down Expand Up @@ -60,7 +66,7 @@ https://github.com/user-attachments/assets/d0d054b2-6d1d-49fb-94df-4aa0b6641990

```python
import torch
import torchaudio
import soundfile as sf
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality

# Load models
Expand All @@ -77,7 +83,8 @@ chat.add_text("Respond with interleaved text and audio.")
chat.end_turn()

chat.new_turn("user")
wav, sampling_rate = torchaudio.load("assets/question.wav")
wav, sampling_rate = sf.read("assets/question.wav", dtype="float32")
wav = torch.from_numpy(wav).unsqueeze(0)
chat.add_audio(wav, sampling_rate)
chat.end_turn()

Expand All @@ -102,7 +109,7 @@ for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperatur
# Mimi returns audio at 24kHz
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
waveform = processor.decode(audio_codes)
torchaudio.save("answer1.wav", waveform.cpu(), 24_000)
sf.write("answer1.wav", waveform.cpu()[0], 24_000)

# Append newly generated tokens to chat history
chat.append(
Expand Down Expand Up @@ -132,7 +139,7 @@ for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperatur
# Detokenize second turn audio, removing the last "end-of-audio" codes
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
waveform = processor.decode(audio_codes)
torchaudio.save("answer2.wav", waveform.cpu(), 24_000)
sf.write("answer2.wav", waveform.cpu()[0], 24_000)
```


Expand All @@ -151,7 +158,7 @@ https://github.com/user-attachments/assets/b3cc017f-363d-49f3-8e7d-f6db9556900e

```python
import torch
import torchaudio
import soundfile as sf
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality

# Load models
Expand All @@ -168,7 +175,8 @@ chat.add_text("Perform ASR.")
chat.end_turn()

chat.new_turn("user")
wav, sampling_rate = torchaudio.load("assets/asr.wav")
wav, sampling_rate = sf.read("assets/asr.wav", dtype="float32")
wav = torch.from_numpy(wav).unsqueeze(0)
chat.add_audio(wav, sampling_rate)
chat.end_turn()

Expand Down Expand Up @@ -207,7 +215,7 @@ https://github.com/user-attachments/assets/8d57c184-b92e-4e1a-983b-d1f9d16d0d92

```python
import torch
import torchaudio
import soundfile as sf
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality

# Load models
Expand Down Expand Up @@ -238,7 +246,7 @@ for t in model.generate_sequential(**chat, max_new_tokens=512, audio_temperature
# Detokenize audio
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
waveform = processor.decode(audio_codes)
torchaudio.save("tts.wav", waveform.cpu(), 24_000)
sf.write("tts.wav", waveform.cpu()[0], 24_000)
```

## Finetuning
Expand All @@ -249,12 +257,6 @@ To finetune on your own data, make use of the `ChatMessage` interface. This requ
2. use the [`LFM2AudioChatMapper`](src/liquid_audio/data/mapper.py) to create a preprocessed dataset
3. train a model from the preprocessed dataset with `LFM2DataLoader`

First, install project dependencies:

```bash
uv sync
```

### Preprocess

Before training, convert dataset into our preprocessed training format.
Expand All @@ -274,7 +276,7 @@ See [examples/preprocess_jenny_tts.py](examples/preprocess_jenny_tts.py) for an
Run preprocessing with:

```bash
python -m examples.preprocess_jenny_tts
python examples/preprocess_jenny_tts.py
```

This writes a preprocessed dataset to `data/jenny_tts/train`.
Expand All @@ -287,7 +289,7 @@ For example, to finetune a model on the [Jenny TTS Dataset](https://huggingface.
using the preprocessed dataset from before, run:

```bash
python -m examples.train
python examples/train.py
```


Expand Down
1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ dependencies = [
"sentencepiece>=0.2.1",
"torch>=2.8.0",
"torchaudio>=2.8.0",
"torchcodec>=0.9.1",
"transformers>=4.55.4",
]
keywords = ["Liquid AI", "LFM", "LFM2", "Audio", "Speech-to-Speech"]
Expand Down
4 changes: 3 additions & 1 deletion src/liquid_audio/data/mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import io

import soundfile
import torch
import torchaudio

Expand Down Expand Up @@ -233,7 +234,8 @@ def _encode_audio_out(self, *, wav: torch.Tensor, sampling_rate: int) -> torch.T
@staticmethod
def _load_audio_bytes(audio: bytes) -> tuple[torch.Tensor, int]:
with io.BytesIO(audio) as stream:
wav, sampling_rate = torchaudio.load(stream)
data, sampling_rate = soundfile.read(stream, dtype="float32", always_2d=True)
wav = torch.from_numpy(data.T.copy())
if wav.shape[0] > 1:
wav = wav.mean(dim=0, keepdim=True)
return wav, sampling_rate
18 changes: 0 additions & 18 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading