daytonaio · justusaugust · May 25, 2026
diff --git a/authors/justus_august.md b/authors/justus_august.md
@@ -0,0 +1,6 @@
+Author: Justus August Title: Software Engineer Description: Justus August is a
+software engineer focused on practical AI workflows, developer tooling, and
+reproducible automation. He writes guides that turn integration details into
+clear, testable steps for builders working with modern cloud and AI services.
+Author Image: Author LinkedIn: Author Twitter: Company Name: Company
+Description: Company Logo Dark: Company Logo White:
diff --git a/definitions/20260525_definition_deepinfra_speech_recognition.md b/definitions/20260525_definition_deepinfra_speech_recognition.md
@@ -0,0 +1,22 @@
+---
+title: 'DeepInfra Speech Recognition'
+description: 'DeepInfra Speech Recognition runs Whisper-style audio transcription through hosted inference endpoints.'
+date: 2026-05-25
+author: 'Justus August'
+---
+
+# DeepInfra Speech Recognition
+
+## Definition
+
+DeepInfra Speech Recognition is the use of DeepInfra-hosted automatic speech
+recognition models, including Whisper variants, to transcribe uploaded audio
+files into text through an API endpoint.
+
+## Context and Usage
+
+Developers use DeepInfra Speech Recognition when they want hosted transcription
+without operating speech models or GPU infrastructure themselves. In a Daytona
+workspace, it can be combined with command-line tools such as Sapat so media
+preparation, API calls, validation, and transcript review happen in a repeatable
+development environment.
diff --git a/guides/20260525_run_deepinfra_transcription_with_sapat_in_daytona.md b/guides/20260525_run_deepinfra_transcription_with_sapat_in_daytona.md
@@ -0,0 +1,346 @@
+---
+title: 'Run DeepInfra Transcription with Sapat'
+description:
+  'Use Daytona, Sapat, and DeepInfra-hosted Whisper to turn audio or video files
+  into reproducible transcripts.'
+date: 2026-05-25
+author: 'Justus August'
+tags: ['daytona', 'sapat', 'deepinfra', 'speech-to-text']
+---
+
+# Run DeepInfra Transcription with Sapat
+
+## Introduction
+
+Transcription scripts usually start as a one-off command on a laptop. That works
+until the file is large, the machine is missing `ffmpeg`, the API key is only on
+one developer's shell, or the next teammate needs the same result and cannot
+recreate the setup. A [Daytona workspace](../definitions/20240819_definition_daytona workspace.md)
+turns that fragile local setup into a repeatable sandbox with the same commands,
+dependencies, and validation steps every time.
+
+This guide shows how to run [Sapat](https://github.com/nkkko/sapat) with a
+DeepInfra-backed speech recognition provider. Sapat is a Python command-line
+tool that extracts audio from media files and writes transcripts next to the
+source file. DeepInfra hosts Whisper speech recognition models behind a simple
+HTTP API, so it is a good fit when you want hosted transcription without running
+GPU workloads locally.
+
+The DeepInfra provider used here lives in a companion contribution to Sapat:
+[nibzard/sapat#49](https://github.com/nibzard/sapat/pull/49). Until that pull
+request is merged, use the branch in the commands below. After it lands, replace
+the fork URL with the upstream Sapat repository and keep the same `--api
+deepinfra` workflow.
+
+## TL;DR
+
+- Create a Daytona sandbox so the transcription workflow runs in a clean,
+  repeatable environment.
+- Clone the Sapat branch that adds `--api deepinfra`.
+- Install Sapat and `ffmpeg` inside the sandbox.
+- Pass `DEEPINFRA_TOKEN` as an environment variable, not as committed code.
+- Run `sapat media.mp4 --api deepinfra` and verify the generated transcript.
+
+## Prerequisites
+
+You need four things before starting:
+
+- A Daytona account with the [Daytona CLI](https://www.daytona.io/docs/tools/cli/)
+  installed and authenticated.
+- A DeepInfra account and API token. DeepInfra's speech recognition tutorial
+  lists Whisper models such as `openai/whisper-large`, `openai/whisper-medium`,
+  and `openai/whisper-small`.
+- A media file for testing, such as `demo.mp4`, `demo.mp3`, or `demo.wav`.
+- Basic comfort with shell commands inside a sandbox.
+
+DeepInfra's native API accepts multipart audio uploads at model-specific
+inference endpoints. For the default model used in this guide, the endpoint is
+`https://api.deepinfra.com/v1/inference/openai/whisper-large`.
+
+## How the Workflow Fits Together
+
+![DeepInfra Sapat transcription flow](assets/20260525_deepinfra_sapat_transcription_flow.svg)
+
+The flow is intentionally small. Daytona provides the disposable, repeatable
+runtime. Sapat handles the media file, uses `ffmpeg` when it needs to extract
+audio, and sends the audio to DeepInfra. DeepInfra runs Whisper and returns
+structured text. Sapat writes the transcript to a local `.txt` file so it can be
+reviewed, committed, or passed into a downstream notes workflow.
+
+This split keeps the provider logic small. Sapat does not need to know how to
+operate GPUs, and Daytona does not need to know anything about speech
+recognition. Each tool does one job, which makes the workflow easier to debug.
+
+## Step 1: Create a Daytona Sandbox
+
+Start with a named sandbox. The `--auto-stop` flag keeps the sandbox from
+running forever after the work is done, and `--class small` is enough because
+the model inference runs on DeepInfra rather than inside the sandbox.
+
+```bash
+daytona create --name deepinfra-sapat --class small --auto-stop 30
+```
+
+If your Daytona CLI uses the newer namespaced command form, this is equivalent:
+
+```bash
+daytona sandbox create --name deepinfra-sapat --class small --auto-stop 30
+```
+
+Now clone the Sapat branch that includes the DeepInfra provider:
+
+```bash
+daytona exec deepinfra-sapat -- bash -lc \
+  "git clone https://github.com/justusaugust/sapat.git && \
+  cd sapat && \
+  git checkout codex/deepinfra-provider"
+```
+
+Open an interactive shell when you want to inspect files or copy a sample media
+file into the workspace:
+
+```bash
+daytona ssh deepinfra-sapat
+```
+
+## Step 2: Install Sapat and ffmpeg
+
+Inside the sandbox, install the system packages and Python environment used by
+Sapat. The provider itself uses the project's existing Python dependency stack,
+so there is no extra SDK to install for DeepInfra.
+
+```bash
+cd sapat
+
+sudo apt-get update
+sudo apt-get install -y ffmpeg python3-venv
+
+python3 -m venv .venv
+source .venv/bin/activate
+pip install --upgrade pip
+pip install -e .
+```
+
+Confirm that the command-line app sees the new provider:
+
+```bash
+sapat --help | grep deepinfra
+```
+
+You should see `deepinfra` in the list of valid values for `--api`. If it is not
+listed, check that you are on the `codex/deepinfra-provider` branch and that
+`pip install -e .` completed successfully.
+
+## Step 3: Configure DeepInfra Credentials
+
+Do not commit API tokens to the repository, issue comments, screenshots, or pull
+requests. Keep the token in the shell session or pass it through Daytona's
+environment support when creating a sandbox.
+
+For an interactive shell, read the token without echoing it:
+
+```bash
+read -rsp "DeepInfra token: " DEEPINFRA_TOKEN
+echo
+export DEEPINFRA_TOKEN
+```
+
+The provider uses these environment variables:
+
+| Variable | Required | Purpose |
+| --- | --- | --- |
+| `DEEPINFRA_TOKEN` | Yes | Bearer token for DeepInfra requests. |
+| `DEEPINFRA_MODEL` | No | Speech model, defaulting to `openai/whisper-large`. |
+| `DEEPINFRA_API_ENDPOINT` | No | Full endpoint override for custom routing. |
+| `DEEPINFRA_TIMEOUT` | No | Request timeout in seconds. |
+| `DEEPINFRA_MODEL_NAME_CHAT` | No | Enables DeepInfra-backed transcript correction. |
+| `DEEPINFRA_OPENAI_ENDPOINT` | No | Chat correction base URL, defaulting to DeepInfra's OpenAI-compatible endpoint. |
+
+For most runs, the token is enough:
+
+```bash
+export DEEPINFRA_MODEL=openai/whisper-large
+```
+
+Use `openai/whisper-medium` or `openai/whisper-small` when you prefer faster,
+lighter processing over the highest available accuracy.
+
+## Step 4: Add a Test Media File
+
+Sapat expects an audio or video file in the workspace. If you are testing with a
+video, keep it small at first so you can verify the complete workflow quickly.
+Copy a local file into the sandbox, or download a public test clip from a source
+you are allowed to use.
+
+For example, from your local machine:
+
+```bash
+daytona cp ./demo.mp4 deepinfra-sapat:/home/daytona/sapat/demo.mp4
+```
+
+If your CLI does not provide `daytona cp`, open the sandbox with `daytona ssh`
+and place the file through your editor or Git provider workflow. The important
+part is that the file ends up in the Sapat project directory.
+
+## Step 5: Run Transcription with DeepInfra
+
+Run Sapat with the new provider:
+
+```bash
+sapat demo.mp4 --api deepinfra --language en --quality M
+```
+
+Sapat extracts audio when necessary, sends a supported audio file to DeepInfra,
+and writes the transcript next to the input file. A successful run should leave
+you with a text output such as `demo.txt`.
+
+Use a prompt when the audio contains product names, names of people, acronyms, or
+domain-specific vocabulary:
+
+```bash
+sapat demo.mp4 \
+  --api deepinfra \
+  --language en \
+  --quality H \
+  --prompt "Product names: Daytona, DeepInfra, Sapat"
+```
+
+Use `--quality H` for more careful local preparation before the API call. Use
+`--quality M` for ordinary review workflows where you want a good balance of
+speed and output quality.
+
+## Step 6: Optional Transcript Correction
+
+The companion provider also supports Sapat's correction pass through DeepInfra's
+OpenAI-compatible chat endpoint. This is separate from speech recognition. The
+speech model produces the transcript first, and a chat model can then clean up
+obvious formatting or punctuation issues.
+
+Set a DeepInfra chat model name before using `--correct`:
+
+```bash
+export DEEPINFRA_MODEL_NAME_CHAT=deepseek-ai/DeepSeek-V3
+
+sapat demo.mp4 \
+  --api deepinfra \
+  --language en \
+  --quality H \
+  --correct
+```
+
+Keep correction conservative. It is useful for punctuation, casing, and obvious
+formatting. It should not be treated as a source of truth for unclear speech,
+technical names, or legally sensitive transcripts.
+
+## Step 7: Validate the Setup
+
+Run the provider's mocked tests so you know the local integration is wired
+correctly before spending API credits on longer files:
+
+```bash
+python -m unittest discover -s tests -v
+python -m compileall src tests
+```
+
+These tests do not call DeepInfra. They verify that the provider requires a
+token, builds the expected multipart request, respects endpoint overrides,
+surfaces API errors, rejects unsupported input types, and appears in Sapat's CLI
+routing.
+
+Then validate a real short file:
+
+```bash
+sapat demo.mp3 --api deepinfra --language en
+sed -n '1,40p' demo.txt
+```
+
+Read the first section of the transcript and check it against the audio. For
+longer media, also spot-check the middle and final minute. Hosted speech models
+can be very good, but they still need human review when the source audio has
+overlapping speakers, background music, or uncommon vocabulary.
+
+## Troubleshooting
+
+**`DEEPINFRA_TOKEN environment variable is required`**
+
+The provider did not receive credentials. Re-run `export DEEPINFRA_TOKEN` in the
+same shell session where you run Sapat, or create the Daytona sandbox with an
+environment variable.
+
+**`DeepInfra transcription failed: 401`**
+
+The token is missing, expired, or copied incorrectly. Create a fresh token in
+DeepInfra and avoid adding quotes or spaces around the value.
+
+**`DeepInfra transcription failed: 404`**
+
+Check `DEEPINFRA_MODEL` and `DEEPINFRA_API_ENDPOINT`. The default model is
+`openai/whisper-large`, which maps to DeepInfra's native inference endpoint.
+
+**`Unsupported audio file format`**
+
+The provider uploads `.mp3` and `.wav` files. If your source is a video, let
+Sapat extract the audio. If you already have a different audio type, convert it
+with `ffmpeg`:
+
+```bash
+ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
+```
+
+**`ffmpeg: command not found`**
+
+Install it inside the sandbox:
+
+```bash
+sudo apt-get update
+sudo apt-get install -y ffmpeg
+```
+
+**Correction fails while transcription succeeds**
+
+Set `DEEPINFRA_MODEL_NAME_CHAT` before using `--correct`, or run without the
+correction step. Speech recognition and transcript correction use different
+model endpoints.
+
+## Cleanup
+
+When the transcript is saved and copied out, stop or delete the sandbox so it no
+longer consumes resources:
+
+```bash
+daytona stop deepinfra-sapat
+```
+
+Use deletion when you no longer need the files inside the sandbox:
+
+```bash
+daytona delete deepinfra-sapat
+```
+
+If your CLI uses the namespaced command form:
+
+```bash
+daytona sandbox stop deepinfra-sapat
+daytona sandbox delete deepinfra-sapat
+```
+
+## Conclusion
+
+With Daytona, Sapat, and DeepInfra, you get a compact transcription workflow
+that is easy to recreate: a sandbox for the runtime, Sapat for media handling,
+and DeepInfra for hosted Whisper inference. The result is much easier to share
+with teammates than a local-only script because every important step is captured
+in shell commands and environment variables.
+
+The same shape works well for content production, podcast notes, product demos,
+research interviews, and internal meeting archives. Start with short files,
+validate the transcript, then scale to longer media once the provider and token
+configuration are proven.
+
+## References
+
+- [Sapat repository](https://github.com/nkkko/sapat)
+- [DeepInfra Whisper speech recognition tutorial](https://docs.deepinfra.com/tutorials/whisper)
+- [DeepInfra Native API documentation](https://docs.deepinfra.com/apis/deepinfra-native)
+- [Daytona CLI documentation](https://www.daytona.io/docs/tools/cli/)
+- [Companion DeepInfra provider pull request](https://github.com/nibzard/sapat/pull/49)