Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions authors/justus_august.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Author: Justus August Title: Software Engineer Description: Justus August is a
software engineer focused on practical AI workflows, developer tooling, and
reproducible automation. He writes guides that turn integration details into
clear, testable steps for builders working with modern cloud and AI services.
Author Image: Author LinkedIn: Author Twitter: Company Name: Company
Description: Company Logo Dark: Company Logo White:
22 changes: 22 additions & 0 deletions definitions/20260525_definition_deepinfra_speech_recognition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: 'DeepInfra Speech Recognition'
description: 'DeepInfra Speech Recognition runs Whisper-style audio transcription through hosted inference endpoints.'
date: 2026-05-25
author: 'Justus August'
---

# DeepInfra Speech Recognition

## Definition

DeepInfra Speech Recognition is the use of DeepInfra-hosted automatic speech
recognition models, including Whisper variants, to transcribe uploaded audio
files into text through an API endpoint.

## Context and Usage

Developers use DeepInfra Speech Recognition when they want hosted transcription
without operating speech models or GPU infrastructure themselves. In a Daytona
workspace, it can be combined with command-line tools such as Sapat so media
preparation, API calls, validation, and transcript review happen in a repeatable
development environment.
346 changes: 346 additions & 0 deletions guides/20260525_run_deepinfra_transcription_with_sapat_in_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,346 @@
---
title: 'Run DeepInfra Transcription with Sapat'
description:
'Use Daytona, Sapat, and DeepInfra-hosted Whisper to turn audio or video files
into reproducible transcripts.'
date: 2026-05-25
author: 'Justus August'
tags: ['daytona', 'sapat', 'deepinfra', 'speech-to-text']
---

# Run DeepInfra Transcription with Sapat

## Introduction

Transcription scripts usually start as a one-off command on a laptop. That works
until the file is large, the machine is missing `ffmpeg`, the API key is only on
one developer's shell, or the next teammate needs the same result and cannot
recreate the setup. A [Daytona workspace](../definitions/20240819_definition_daytona workspace.md)
turns that fragile local setup into a repeatable sandbox with the same commands,
dependencies, and validation steps every time.

This guide shows how to run [Sapat](https://github.com/nkkko/sapat) with a
DeepInfra-backed speech recognition provider. Sapat is a Python command-line
tool that extracts audio from media files and writes transcripts next to the
source file. DeepInfra hosts Whisper speech recognition models behind a simple
HTTP API, so it is a good fit when you want hosted transcription without running
GPU workloads locally.

The DeepInfra provider used here lives in a companion contribution to Sapat:
[nibzard/sapat#49](https://github.com/nibzard/sapat/pull/49). Until that pull
request is merged, use the branch in the commands below. After it lands, replace
the fork URL with the upstream Sapat repository and keep the same `--api
deepinfra` workflow.

## TL;DR

- Create a Daytona sandbox so the transcription workflow runs in a clean,
repeatable environment.
- Clone the Sapat branch that adds `--api deepinfra`.
- Install Sapat and `ffmpeg` inside the sandbox.
- Pass `DEEPINFRA_TOKEN` as an environment variable, not as committed code.
- Run `sapat media.mp4 --api deepinfra` and verify the generated transcript.

## Prerequisites

You need four things before starting:

- A Daytona account with the [Daytona CLI](https://www.daytona.io/docs/tools/cli/)
installed and authenticated.
- A DeepInfra account and API token. DeepInfra's speech recognition tutorial
lists Whisper models such as `openai/whisper-large`, `openai/whisper-medium`,
and `openai/whisper-small`.
- A media file for testing, such as `demo.mp4`, `demo.mp3`, or `demo.wav`.
- Basic comfort with shell commands inside a sandbox.

DeepInfra's native API accepts multipart audio uploads at model-specific
inference endpoints. For the default model used in this guide, the endpoint is
`https://api.deepinfra.com/v1/inference/openai/whisper-large`.

## How the Workflow Fits Together

![DeepInfra Sapat transcription flow](assets/20260525_deepinfra_sapat_transcription_flow.svg)

The flow is intentionally small. Daytona provides the disposable, repeatable
runtime. Sapat handles the media file, uses `ffmpeg` when it needs to extract
audio, and sends the audio to DeepInfra. DeepInfra runs Whisper and returns
structured text. Sapat writes the transcript to a local `.txt` file so it can be
reviewed, committed, or passed into a downstream notes workflow.

This split keeps the provider logic small. Sapat does not need to know how to
operate GPUs, and Daytona does not need to know anything about speech
recognition. Each tool does one job, which makes the workflow easier to debug.

## Step 1: Create a Daytona Sandbox

Start with a named sandbox. The `--auto-stop` flag keeps the sandbox from
running forever after the work is done, and `--class small` is enough because
the model inference runs on DeepInfra rather than inside the sandbox.

```bash
daytona create --name deepinfra-sapat --class small --auto-stop 30
```

If your Daytona CLI uses the newer namespaced command form, this is equivalent:

```bash
daytona sandbox create --name deepinfra-sapat --class small --auto-stop 30
```

Now clone the Sapat branch that includes the DeepInfra provider:

```bash
daytona exec deepinfra-sapat -- bash -lc \
"git clone https://github.com/justusaugust/sapat.git && \
cd sapat && \
git checkout codex/deepinfra-provider"
```

Open an interactive shell when you want to inspect files or copy a sample media
file into the workspace:

```bash
daytona ssh deepinfra-sapat
```

## Step 2: Install Sapat and ffmpeg

Inside the sandbox, install the system packages and Python environment used by
Sapat. The provider itself uses the project's existing Python dependency stack,
so there is no extra SDK to install for DeepInfra.

```bash
cd sapat

sudo apt-get update
sudo apt-get install -y ffmpeg python3-venv

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .
```

Confirm that the command-line app sees the new provider:

```bash
sapat --help | grep deepinfra
```

You should see `deepinfra` in the list of valid values for `--api`. If it is not
listed, check that you are on the `codex/deepinfra-provider` branch and that
`pip install -e .` completed successfully.

## Step 3: Configure DeepInfra Credentials

Do not commit API tokens to the repository, issue comments, screenshots, or pull
requests. Keep the token in the shell session or pass it through Daytona's
environment support when creating a sandbox.

For an interactive shell, read the token without echoing it:

```bash
read -rsp "DeepInfra token: " DEEPINFRA_TOKEN
echo
export DEEPINFRA_TOKEN
```

The provider uses these environment variables:

| Variable | Required | Purpose |
| --- | --- | --- |
| `DEEPINFRA_TOKEN` | Yes | Bearer token for DeepInfra requests. |
| `DEEPINFRA_MODEL` | No | Speech model, defaulting to `openai/whisper-large`. |
| `DEEPINFRA_API_ENDPOINT` | No | Full endpoint override for custom routing. |
| `DEEPINFRA_TIMEOUT` | No | Request timeout in seconds. |
| `DEEPINFRA_MODEL_NAME_CHAT` | No | Enables DeepInfra-backed transcript correction. |
| `DEEPINFRA_OPENAI_ENDPOINT` | No | Chat correction base URL, defaulting to DeepInfra's OpenAI-compatible endpoint. |

For most runs, the token is enough:

```bash
export DEEPINFRA_MODEL=openai/whisper-large
```

Use `openai/whisper-medium` or `openai/whisper-small` when you prefer faster,
lighter processing over the highest available accuracy.

## Step 4: Add a Test Media File

Sapat expects an audio or video file in the workspace. If you are testing with a
video, keep it small at first so you can verify the complete workflow quickly.
Copy a local file into the sandbox, or download a public test clip from a source
you are allowed to use.

For example, from your local machine:

```bash
daytona cp ./demo.mp4 deepinfra-sapat:/home/daytona/sapat/demo.mp4
```

If your CLI does not provide `daytona cp`, open the sandbox with `daytona ssh`
and place the file through your editor or Git provider workflow. The important
part is that the file ends up in the Sapat project directory.

## Step 5: Run Transcription with DeepInfra

Run Sapat with the new provider:

```bash
sapat demo.mp4 --api deepinfra --language en --quality M
```

Sapat extracts audio when necessary, sends a supported audio file to DeepInfra,
and writes the transcript next to the input file. A successful run should leave
you with a text output such as `demo.txt`.

Use a prompt when the audio contains product names, names of people, acronyms, or
domain-specific vocabulary:

```bash
sapat demo.mp4 \
--api deepinfra \
--language en \
--quality H \
--prompt "Product names: Daytona, DeepInfra, Sapat"
```

Use `--quality H` for more careful local preparation before the API call. Use
`--quality M` for ordinary review workflows where you want a good balance of
speed and output quality.

## Step 6: Optional Transcript Correction

The companion provider also supports Sapat's correction pass through DeepInfra's
OpenAI-compatible chat endpoint. This is separate from speech recognition. The
speech model produces the transcript first, and a chat model can then clean up
obvious formatting or punctuation issues.

Set a DeepInfra chat model name before using `--correct`:

```bash
export DEEPINFRA_MODEL_NAME_CHAT=deepseek-ai/DeepSeek-V3

sapat demo.mp4 \
--api deepinfra \
--language en \
--quality H \
--correct
```

Keep correction conservative. It is useful for punctuation, casing, and obvious
formatting. It should not be treated as a source of truth for unclear speech,
technical names, or legally sensitive transcripts.

## Step 7: Validate the Setup

Run the provider's mocked tests so you know the local integration is wired
correctly before spending API credits on longer files:

```bash
python -m unittest discover -s tests -v
python -m compileall src tests
```

These tests do not call DeepInfra. They verify that the provider requires a
token, builds the expected multipart request, respects endpoint overrides,
surfaces API errors, rejects unsupported input types, and appears in Sapat's CLI
routing.

Then validate a real short file:

```bash
sapat demo.mp3 --api deepinfra --language en
sed -n '1,40p' demo.txt
```

Read the first section of the transcript and check it against the audio. For
longer media, also spot-check the middle and final minute. Hosted speech models
can be very good, but they still need human review when the source audio has
overlapping speakers, background music, or uncommon vocabulary.

## Troubleshooting

**`DEEPINFRA_TOKEN environment variable is required`**

The provider did not receive credentials. Re-run `export DEEPINFRA_TOKEN` in the
same shell session where you run Sapat, or create the Daytona sandbox with an
environment variable.

**`DeepInfra transcription failed: 401`**

The token is missing, expired, or copied incorrectly. Create a fresh token in
DeepInfra and avoid adding quotes or spaces around the value.

**`DeepInfra transcription failed: 404`**

Check `DEEPINFRA_MODEL` and `DEEPINFRA_API_ENDPOINT`. The default model is
`openai/whisper-large`, which maps to DeepInfra's native inference endpoint.

**`Unsupported audio file format`**

The provider uploads `.mp3` and `.wav` files. If your source is a video, let
Sapat extract the audio. If you already have a different audio type, convert it
with `ffmpeg`:

```bash
ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
```

**`ffmpeg: command not found`**

Install it inside the sandbox:

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg
```

**Correction fails while transcription succeeds**

Set `DEEPINFRA_MODEL_NAME_CHAT` before using `--correct`, or run without the
correction step. Speech recognition and transcript correction use different
model endpoints.

## Cleanup

When the transcript is saved and copied out, stop or delete the sandbox so it no
longer consumes resources:

```bash
daytona stop deepinfra-sapat
```

Use deletion when you no longer need the files inside the sandbox:

```bash
daytona delete deepinfra-sapat
```

If your CLI uses the namespaced command form:

```bash
daytona sandbox stop deepinfra-sapat
daytona sandbox delete deepinfra-sapat
```

## Conclusion

With Daytona, Sapat, and DeepInfra, you get a compact transcription workflow
that is easy to recreate: a sandbox for the runtime, Sapat for media handling,
and DeepInfra for hosted Whisper inference. The result is much easier to share
with teammates than a local-only script because every important step is captured
in shell commands and environment variables.

The same shape works well for content production, podcast notes, product demos,
research interviews, and internal meeting archives. Start with short files,
validate the transcript, then scale to longer media once the provider and token
configuration are proven.

## References

- [Sapat repository](https://github.com/nkkko/sapat)
- [DeepInfra Whisper speech recognition tutorial](https://docs.deepinfra.com/tutorials/whisper)
- [DeepInfra Native API documentation](https://docs.deepinfra.com/apis/deepinfra-native)
- [Daytona CLI documentation](https://www.daytona.io/docs/tools/cli/)
- [Companion DeepInfra provider pull request](https://github.com/nibzard/sapat/pull/49)
Loading