Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions authors/abridegan-bit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Author: abridegan-bit Title: Open Source Contributor Description: abridegan-bit
works on practical developer tooling, automation, and documentation workflows
that make open source projects easier to test, review, and reproduce across
clean environments. Author Image:
![abridegan-bit](https://avatars.githubusercontent.com/u/279021788?v=4) Author
LinkedIn: Author Twitter: Company Name: Independent Company Description:
Independent open source contributor focused on developer experience and
workflow automation.
23 changes: 23 additions & 0 deletions definitions/20260526_definition_stt_ai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: 'STT.ai'
description: 'A speech-to-text API for transcribing audio and video files with REST endpoints, diarization, and multiple output formats.'
date: 2026-05-26
author: 'abridegan-bit'
---

# STT.ai

## Definition

STT.ai is a speech-to-text service that exposes REST and WebSocket APIs for
transcribing audio and video files. Its REST API accepts multipart file uploads
and can return transcript text, JSON segments, subtitles, speaker labels, and
word-level timing information depending on the request options.

## Context and Usage

Developers can use STT.ai when they need a transcription API that works with
common media formats and can be called from standard HTTP clients. In a Daytona
workspace, it can be wired into command-line tools such as Sapat so audio
processing, API configuration, and transcript validation happen in a repeatable
development environment.
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
---
title: "Run STT.ai transcription with Sapat in Daytona"
description: "Build a reproducible Daytona workspace for Sapat, add the STT.ai provider, and transcribe video or audio files with one CLI."
date: 2026-05-26
author: "abridegan-bit"
tags: ["daytona", "speech-to-text", "python"]
---

# Run STT.ai transcription with Sapat in Daytona

# Introduction

Audio and video transcripts are easiest to trust when the workflow is
repeatable. A local laptop can hide package versions, missing FFmpeg binaries,
and one-off environment variables. A
[Daytona workspace](../definitions/20240819_definition_daytona%20workspace.md)
gives the transcription job a clean development environment that can be
recreated when a teammate, reviewer, or future you needs to run it again.

This guide shows how to run [Sapat](https://github.com/nkkko/sapat), a small
Python CLI for converting video files to MP3 and sending them to a
speech-to-text provider, with a new STT.ai provider implementation. STT.ai is
useful for this workflow because its REST API accepts multipart audio or video
uploads, supports common media formats, and can return JSON, plain text, SRT,
or VTT output from a single `POST /v1/transcribe` endpoint.

The companion provider implementation is available in [nibzard/sapat#51](https://github.com/nibzard/sapat/pull/51). It adds `--api sttai`, documents the `STTAI_*` environment variables, and includes mocked tests so the request payload can be reviewed without a live API key.

![STT.ai Sapat workflow](assets/20260526_stt_ai_sapat_daytona_workflow.svg)

## Why this setup works well

Transcription jobs are deceptively stateful. The input format, conversion
quality, model choice, speaker settings, API key, and output format all affect
the final transcript. Keeping those choices inside a Daytona workspace turns
the job into a reproducible recipe instead of a memory exercise. Reviewers can
see the provider code, run the same tests, and decide whether a transcript
issue comes from media quality, local conversion, or the remote transcription
model.

That separation also makes it safer to experiment. You can start with the
anonymous STT.ai path for a tiny sample, move to a real API key when you need
larger files, and keep every secret outside the repository. If another provider
does better on an accent or noisy recording, Sapat still gives you the same CLI
surface for comparison.

## TL;DR

- **Use Daytona for repeatability**: Open Sapat in a clean workspace instead of relying on local machine state.
- **Use the STT.ai provider**: Select it with `--api sttai` and configure it with `STTAI_API_KEY`, `STTAI_MODEL`, and related environment variables.
- **Keep secrets out of Git**: Store API keys in `.env` or Daytona workspace environment settings, never in commits.
- **Validate before processing real media**: Confirm the CLI sees `sttai`, run the mocked tests, then transcribe a short sample file.

## Prerequisites

You will need:

- A GitHub account and access to fork or clone the Sapat repository.
- Daytona installed locally. Follow the [Daytona installation guide](https://www.daytona.io/docs/installation/installation/) if you do not already have the CLI.
- Docker or another runtime supported by your Daytona setup.
- An STT.ai API key for regular use. STT.ai documents limited anonymous transcription, but a key is better for repeatable work.
- A short test video or audio file, preferably under one minute while you are validating the workflow.

The commands below assume a Unix-like shell inside the workspace. The same project structure works on any host that Daytona supports because the important dependencies live inside the workspace.

## Step 1: Open Sapat in Daytona

Create a Daytona workspace from your fork or from the upstream repository:

```bash
daytona create https://github.com/nkkko/sapat --code
```

When the workspace opens, create a branch for the STT.ai provider work if you are applying the companion PR locally:

```bash
git checkout -b sttai-provider
```

If you want to test the submitted provider directly before it is merged, fetch the PR branch:

```bash
git fetch origin pull/51/head:sttai-provider
git checkout sttai-provider
```

This keeps the experiment isolated. If you later decide to compare the STT.ai path with OpenAI, Groq, or Azure, you can switch branches without rebuilding your host machine.

## Step 2: Install the project dependencies

Sapat is a Python package. Create a virtual environment inside the workspace and install the project in editable mode:

```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .
```

Confirm that the CLI is available:

```bash
sapat --help
```

The `--api` option should include `sttai`:

```text
--api [openai|groq|azure|sttai]
```

If you do not see `sttai`, check that you are on the branch that includes the provider implementation.

## Step 3: Install FFmpeg in the workspace

Sapat converts media to MP3 before transcription. The conversion step depends on FFmpeg:

```bash
ffmpeg -version
```

If the command is missing in your workspace, install it through your workspace image or package manager. On Debian or Ubuntu-based images:

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg
```

For a long-lived team workflow, put the FFmpeg installation in your dev container or workspace setup so every contributor gets the same runtime automatically.

## Step 4: Configure STT.ai

Create a local `.env` file. Keep it untracked:

```bash
cat > .env <<'EOF'
STTAI_API_KEY=replace-with-your-key
STTAI_MODEL=large-v3-turbo
STTAI_API_ENDPOINT=https://api.stt.ai/v1/transcribe
STTAI_DIARIZE=true
STTAI_SPEAKERS=0
STTAI_RESPONSE_FORMAT=json
EOF
```

The provider reads these values:

| Variable | Purpose | Default |
| --- | --- | --- |
| `STTAI_API_KEY` | Optional bearer token for STT.ai | Empty |
| `STTAI_MODEL` | Transcription model sent to the API | `large-v3-turbo` |
| `STTAI_API_ENDPOINT` | STT.ai transcription endpoint | `https://api.stt.ai/v1/transcribe` |
| `STTAI_DIARIZE` | Enables speaker diarization | `true` |
| `STTAI_SPEAKERS` | Expected speaker count, where `0` means auto | `0` |
| `STTAI_RESPONSE_FORMAT` | API output format | `json` |
| `STTAI_MAX_FILE_SIZE_MB` | Local file-size guard before upload | `500` |

STT.ai documents `json`, `txt`, `srt`, and `vtt` response formats. Keep `json` while testing because Sapat can write the returned `text` field directly to the output `.txt` file.

## Step 5: Run the mocked validation

Before using an API key, run the unit tests for the provider:

```bash
python -m unittest discover -s tests -v
```

The STT.ai tests check that Sapat:

- posts the media file as multipart form data;
- sends model, language, diarization, speaker, and response-format options;
- includes the bearer token only when `STTAI_API_KEY` is present;
- rejects missing input files before making a network request.

These tests do not call STT.ai. They are meant to make code review deterministic and safe.

## Step 6: Transcribe a short sample

Place a short video or audio file in the workspace, then run:

```bash
sapat ./samples/demo.mp4 --api sttai --language auto --quality M
```

Sapat will:

1. Convert `demo.mp4` to `demo.mp3` with FFmpeg.
2. Upload `demo.mp3` to STT.ai through `POST /v1/transcribe`.
3. Save the returned transcript to `demo.txt`.
4. Remove the temporary MP3 file.

For a single-speaker English clip, you can make the request more specific:

```bash
STTAI_DIARIZE=false STTAI_SPEAKERS=1 sapat ./samples/demo.mp4 --api sttai --language en --quality H
```

For a folder of videos:

```bash
sapat ./recordings --api sttai --language auto --quality M
```

Sapat currently scans directories for `.mp4` files. If your source files are already audio, run them one by one or extend the directory scan in a separate patch.

## Step 7: Review the output

Open the generated `.txt` file:

```bash
sed -n '1,120p' ./samples/demo.txt
```

Check for:

- speaker labels if diarization is enabled;
- obvious language-detection mistakes;
- domain-specific words that may need a follow-up correction pass;
- timestamps or subtitle output if you decide to switch `STTAI_RESPONSE_FORMAT` to `srt` or `vtt`.

If you need subtitle files, set the response format before running Sapat:

```bash
STTAI_RESPONSE_FORMAT=srt sapat ./samples/demo.mp4 --api sttai --language auto
```

In that mode, the provider returns text instead of JSON. Sapat will still save it to the `.txt` path, so rename the output to `.srt` after the run or add a future improvement that chooses the extension from the response format.

## Troubleshooting

**Problem:** `sapat --help` does not show `sttai`.

**Solution:** Check out the branch that contains the STT.ai provider and reinstall the package in editable mode:

```bash
git checkout sttai-provider
python -m pip install -e .
```

**Problem:** The command fails before uploading.

**Solution:** Confirm the file exists, the extension is supported, and FFmpeg can read it:

```bash
ffmpeg -i ./samples/demo.mp4 -f null -
```

**Problem:** STT.ai returns `401 Unauthorized`.

**Solution:** Verify `STTAI_API_KEY` in `.env` and make sure the workspace shell has loaded it. If you use Daytona-managed environment variables, restart the workspace shell after changing them.

**Problem:** STT.ai returns `402 Payment Required` or `429 Too Many Requests`.

**Solution:** Use a shorter sample while testing, check your account credits, or wait for the rate-limit window to reset. For CI and PR review, rely on the mocked unit tests instead of live transcription.

**Problem:** The transcript is empty or in the wrong language.

**Solution:** Try `--language auto` first, then specify an ISO language code such as `en`, `es`, or `fr`. Also test with a higher-quality MP3 conversion:

```bash
sapat ./samples/demo.mp4 --api sttai --language auto --quality H
```

## Conclusion

You now have a reproducible Daytona workflow for running Sapat with STT.ai. The workspace holds the Python environment, FFmpeg dependency, provider code, and validation commands in one place, while secrets stay in `.env` or Daytona environment settings.

The main benefit is reviewability. A teammate can fetch the same Sapat branch, run the mocked tests, inspect the exact `POST /v1/transcribe` request shape, and then try a short real sample with their own STT.ai key. That makes the transcription pipeline easier to debug than a one-off local script.

## References

- [Sapat repository](https://github.com/nkkko/sapat)
- [Companion STT.ai provider PR](https://github.com/nibzard/sapat/pull/51)
- [STT.ai API documentation](https://stt.ai/api/)
- [Daytona installation guide](https://www.daytona.io/docs/installation/installation/)
36 changes: 36 additions & 0 deletions guides/assets/20260526_stt_ai_sapat_daytona_workflow.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.