Skip to content

Conversation

@kevinschweikert
Copy link

The xav library is a wrapper around the libav libs, from which FFmpeg is built. The nice thing is, that this compiles these libav functions to a NIF, so that you don't need FFmpeg installed on your computer.

There are two prerequisites until this could be merged:

  1. the xav library could precompile the NIF so we get a real benefit from this change. Otherwise, we just make it harder for the user to install the development packages (Install instructions from xav) instead of the normal FFmpeg binary
  2. xav needs to update nx so we can merge. I've already opened a PR version bump nx, exla and bumblebee elixir-webrtc/xav#19. To test this branch, i temporarily set override: true on the nx dependency

Comment on lines 181 to 183
|> Stream.chunk_every(1000)
|> Stream.map(&Nx.Batch.concatenate/1)
|> Stream.map(fn batch -> Nx.Defn.jit_apply(&Function.identity/1, [batch]) end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function just needs to return a stream of chunks, so we don't need to do this concatenation.

Suggested change
|> Stream.chunk_every(1000)
|> Stream.map(&Nx.Batch.concatenate/1)
|> Stream.map(fn batch -> Nx.Defn.jit_apply(&Function.identity/1, [batch]) end)

Do you know what determines the length of each chunk, could that be configurable perhaps?

Copy link
Author

@kevinschweikert kevinschweikert Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not completely sure but i'm guessing that a chunk/frame in a video context is the audio duration of one frame. When i remove the Stream.chunk_every/2 pipeline i get a massive performance degradation and the processing does not finish in a reasonable time. I imagine that it's far more efficient to read a chunk of frames first, before sending it to the Whisper serving.
I just reused the implementation of the Elixir WebRTC team from https://github.com/elixir-webrtc/xav/blob/master/README.md

{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"})

serving =
  Bumblebee.Audio.speech_to_text_whisper(whisper, featurizer, tokenizer, generation_config,
    defn_options: [compiler: EXLA]
  )

# Read a couple of frames.
# See https://hexdocs.pm/bumblebee/Bumblebee.Audio.WhisperFeaturizer.html for default sampling rate.
frames =
    Xav.Reader.stream!("sample.mp3", read: :audio, out_format: :f32, out_channels: 1, out_sample_rate: 16_000)
    |> Stream.take(200)
    |> Enum.map(fn frame -> Xav.Reader.to_nx(frame) end)

batch = Nx.Batch.concatenate(frames)
batch = Nx.Defn.jit_apply(&Function.identity/1, [batch])
Nx.Serving.run(serving, batch) 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The serving transforms the stream to accumulate smaller chunks, but there is a place where we need to append to a list and that may be the reason why it's inefficient with tiny chunks.

However, either way, I think it's wasteful to convert every frame to a tensor just to concatenate later. With the current ffmpeg code we get a single binary for the whole 30s and create a tensor from that. So ideally we want to replicate this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this suffice?

path
|> Xav.Reader.stream!(
  read: :audio,
  out_format: :f32,
  out_channels: 1,
  out_sample_rate: sampling_rate
)
|> Stream.chunk_every(1000)
|> Stream.map(fn frames ->
  [frame | _] = frames
  binary = Enum.reduce(frames, <<>>, fn frame, acc -> acc <> frame.data end)

  Nx.with_default_backend(Nx.BinaryBackend, fn -> Nx.from_binary(binary, frame.format) end)
end)

The 1000 chunks is currently arbitrary because we don't know the frame_size of the used codec. This could be added to xav i think

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: The information is already there. Xav.Frame has a samples field from which we could calculate the 30s chunks which would be calculated by

round(sampling_rate / frame.samples * 30)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling the frame binaries directly is a good call. I think we can transform the stream, so that we can accumulate the binary, instead of waiting for 1000 binaries and joining then. I was thinking we can determine the number of samples from binary size, but frame.samples is even more convenient.

So it would be something like this:

chunk_samples = sampling_rate * 30

path
|> Xav.Reader.stream!(
  read: :audio,
  out_format: :f32,
  out_channels: 1,
  out_sample_rate: sampling_rate
)
|> Stream.transform(
  fn -> {<<>>, 0} end,
  fn frame, {buffer, samples} ->
    buffer = buffer <> frame.data
    samples = samples + frame.samples

    if samples >= chunk_samples do
      chunk = Nx.from_binary(buffer, :f32, backend: Nx.BinaryBackend)
      {[chunk], {<<>>, 0}}
    else
      {[], {buffer, samples}}
    end
  end,
  fn {buffer, _samples} ->
    chunk = Nx.from_binary(buffer, :f32, backend: Nx.BinaryBackend)
    {[chunk], {<<>>, 0}}
  end,
  fn _ -> :ok end
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! That is a great solution! Thanks!

@jonatanklosko
Copy link
Member

@kevinschweikert thanks for the PR! Dropping the ffmpeg dependency would be great, but yeah, I agree that we need xav to be precompiled for this to be beneficial.

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep ffmpeg but also have xav as an optional dependency.

@jonatanklosko
Copy link
Member

Sounds good to me!

Co-authored-by: Jonatan Kłosko <[email protected]>
@kevinschweikert
Copy link
Author

Awesome idea! I will bring back the ffmpeg implementation when xav is not available. Thanks for the feedback and suggestions!

@jonatanklosko
Copy link
Member

I've just realised that it's not just about precompilation, the main blocker is that xav still requires ffmpeg to be installed, so at the moment there is no benefit really (perhaps a tiny bit less overhead, because we don't need System.cmd, but I think that's fine). So I think what we need is xav having the necessary ffmpeg C code vendored, and then ideally precompilation.

@mickel8
Copy link

mickel8 commented Oct 23, 2024

Precompiling FFmpeg is not an easy task as it has so many other dependencies. I agree that until Xav solves this problem, it's better for Bumblebee to stay with the current solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants