Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions authors/codeaustral-oss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Author: CodeAustral OSS Title: OSS Engineering Studio Description: CodeAustral
OSS builds small, reviewable open-source contributions with a focus on
maintainer-friendly pull requests, reproducible verification, and practical
developer tooling for modern engineering teams. Author Image: Author LinkedIn:
Author Twitter: Company Name: CodeAustral LLC Company Description:
Maintainer-friendly software engineering, developer tooling, and applied OSS
workflows.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: 'Mixed-Language Transcription Workflow'
description: 'A repeatable process for transcribing recordings that contain more than one spoken language.'
date: 2026-05-20
author: 'CodeAustral OSS'
---

# Mixed-Language Transcription Workflow

## Definition

A mixed-language transcription workflow is a repeatable process for preparing,
transcribing, reviewing, and storing audio or video recordings that contain more
than one spoken language. It combines clean input handling, provider selection,
language-aware prompts, transcript review, and handoff artifacts so the final
text can be trusted by engineering, support, research, or content teams.

## Context and Usage

Mixed-language workflows are useful when product demos, interviews, customer
calls, incident reviews, or community recordings switch between languages. A
good workflow keeps API keys out of source control, runs the same commands in a
reproducible development environment, records which provider and settings were
used, and includes a human review pass for names, acronyms, timestamps, and
domain-specific terms.
313 changes: 313 additions & 0 deletions guides/20260520_guide_soniox_sapat_mixed_language_transcription.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
---
title: "Run Soniox transcription with Sapat in Daytona"
description: "Build a reproducible Daytona workspace for mixed-language transcription using Sapat and Soniox Speech-to-Text."
date: 2026-05-20
author: "CodeAustral OSS"
tags: ["daytona", "python", "transcription", "soniox"]
---

# Run Soniox transcription with Sapat in Daytona

# Introduction

Transcription gets messy when a recording is not a neat single-language demo.
Customer calls, conference hallway interviews, product walkthroughs, and
community recordings often move between English and another language, mention
product names quickly, and include acronyms that a generic transcript can
distort. A useful workflow should make the environment reproducible, keep API
keys private, and leave a short trail showing how the transcript was produced.

This guide shows how to run Sapat, a small Python video transcription tool, in a
[Daytona workspace](../definitions/20240819_definition_daytona%20workspace.md)
with a Soniox Speech-to-Text provider. The companion Sapat implementation adds a
`--api soniox` option that uses the official Soniox Python SDK, sends the MP3
created by Sapat to an asynchronous Soniox transcription job, waits for the
result, writes the text file, and cleans up the remote transcription by default.

![Soniox and Sapat transcription workflow](assets/20260520_soniox_sapat_mixed_language_workflow.svg)

## TL;DR

- Use Daytona to create a clean Python workspace for Sapat.
- Configure Soniox through environment variables, not committed files.
- Run Sapat with `--api soniox` after the provider branch is installed.
- Keep a simple run manifest so transcripts are auditable later.
- Review mixed-language transcripts for names, acronyms, and speaker context.

## Prerequisites

You will need:

- Daytona installed and connected to your preferred IDE.
- Python 3.10 or later in the workspace.
- `ffmpeg`, because Sapat converts video files to MP3 before transcription.
- A Soniox account and project API key.
- A short `.mp4`, `.m4a`, `.wav`, or `.mp3` recording you are allowed to
process.

Do not commit recordings, transcripts with private information, or `.env` files.
Use a throwaway sample recording when you are testing the flow for the first
time.

## Step 1: Create the Daytona workspace

Start from the Sapat repository. If the Soniox provider has already been merged
upstream, create the workspace from the main repository:

```bash
daytona create https://github.com/nkkko/sapat --code
```

While the provider pull request is under review, you can test the same workflow
from the companion branch:

```bash
daytona create https://github.com/codeaustral-oss/sapat --code
git switch codeaustral/soniox-provider
```

Inside the workspace, install Sapat in editable mode:

```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install -e .
```

Confirm the CLI exposes the Soniox provider:

```bash
sapat --help
```

The API option should include `soniox` alongside `openai`, `groq`, and `azure`.

## Step 2: Configure Soniox without leaking secrets

Create a local `.env` file. This file should stay outside version control.

```bash
cat > .env <<'EOF'
SONIOX_API_KEY=replace-with-your-project-key
SONIOX_MODEL=stt-async-v4
SONIOX_DESTROY_AFTER_TRANSCRIPTION=true
EOF
```

`SONIOX_API_KEY` authenticates requests to Soniox. `SONIOX_MODEL` selects the
async speech-to-text model used for recorded files. The cleanup flag keeps the
workflow tidy by deleting the remote transcription job and uploaded file after
Sapat has pulled the transcript text.

If you are sharing the repository with a team, commit only a `.env.example` with
empty values. Never paste API keys into issues, pull requests, screenshots, or
articles.

## Step 3: Prepare a mixed-language sample

Place your test recording under a local folder such as `samples/`:

```bash
mkdir -p samples runs
cp ~/Downloads/customer-demo.mp4 samples/customer-demo.mp4
```

For a realistic mixed-language test, choose a recording that includes at least
one language switch and a few domain terms. For example, a developer might say:
"The webhook retry failed twice, entonces revisamos el payload, and then the
queue recovered." This kind of sentence is a good stress test because it mixes
language, product vocabulary, and operational context.

Before uploading any recording to a transcription provider, confirm that you
have permission to process it and that your data handling matches your team's
policies.

## Why use Soniox for this workflow?

Sapat already supports OpenAI, Groq, and Azure OpenAI. Soniox is useful when you
want a transcription provider that is designed around speech-to-text workflows,
including asynchronous file transcription and automatic handling for recordings
that may contain more than one language. In a team workflow, that means the
developer running the transcript job can keep the operational steps simple:
prepare the file, submit it, wait for the result, save the transcript, and clean
up the remote job.

The provider choice should still be deliberate. Use Soniox when the recording is
speech-heavy, when language switching matters, or when the transcript will be
reviewed and reused later. Use another provider when your team already has
approved infrastructure, data residency requirements, or billing controls tied
to that provider. The value of this guide is not that every recording must go
through Soniox; it is that the workflow remains portable and auditable because
Sapat keeps the command surface consistent.

## Step 4: Run Sapat with Soniox

Run the transcription command:

```bash
sapat samples/customer-demo.mp4 \
--quality M \
--language en \
--prompt "Product demo with English and Spanish technical vocabulary" \
--temperature 0.3 \
--api soniox
```

Sapat will convert the video to MP3 with `ffmpeg`, submit the audio to Soniox,
wait for the async job to finish, write a `.txt` file next to the input, and
remove the temporary MP3. With the cleanup flag enabled, the Soniox provider
also destroys the remote job after retrieving the transcript.

The result should appear as:

```text
samples/customer-demo.txt
```

Open the file and check the first pass. Do not expect any transcription provider
to know every internal product name or speaker nickname. The goal of the first
pass is to create a useful draft that can be reviewed quickly.

## Step 5: Save a run manifest

For repeatable work, keep a small run manifest alongside the transcript. This is
especially useful when several people review recordings or when the transcript
feeds a downstream search, support, or documentation workflow.

```bash
cat > runs/customer-demo-20260520.md <<'EOF'
# customer-demo transcription run

- Input file: samples/customer-demo.mp4
- Output file: samples/customer-demo.txt
- Workspace: Daytona
- Tool: Sapat
- Provider: Soniox
- Model: stt-async-v4
- Quality: M
- Language hint: en
- Prompt: Product demo with English and Spanish technical vocabulary
- Review status: needs human review
EOF
```

The manifest should not include API keys, private customer names, payout
details, or local machine paths that are not useful to another reviewer.

## Step 6: Review the transcript

A mixed-language transcript is not finished when the API returns text. Review it
with a short checklist:

- Product names, acronyms, and company-specific terms are spelled correctly.
- Language switches are preserved instead of flattened into one language.
- The transcript does not expose private information that should be redacted.
- Action items, decisions, and dates are readable without replaying the audio.
- Any uncertain words are marked for a second listener instead of guessed.

If the transcript is going into documentation or a customer-facing artifact,
create a cleaned copy rather than editing the raw transcript in place. Keep the
raw output, the reviewed transcript, and the manifest separate.

## Step 7: Package the reviewed output

Once the transcript is reviewed, create a small folder for the final artifacts.
This keeps raw text, edited text, and notes from being mixed together.

```bash
mkdir -p handoff/customer-demo
cp samples/customer-demo.txt handoff/customer-demo/raw-transcript.txt
cp runs/customer-demo-20260520.md handoff/customer-demo/run-manifest.md
touch handoff/customer-demo/review-notes.md
```

Use `review-notes.md` for anything a future reader needs to know:

```markdown
# review notes

- Speaker names normalized: "Gaby" -> "Gabriela".
- Product term checked: "webhook replay" is correct.
- Spanish section starts around the customer escalation discussion.
- Two uncertain words remain marked with `[inaudible]`.
```

If the transcript will feed a retrieval system, documentation draft, or support
handoff, create a second edited transcript rather than changing the raw output:

```bash
cp handoff/customer-demo/raw-transcript.txt \
handoff/customer-demo/reviewed-transcript.txt
```

That separation matters. Raw transcripts help you debug provider behavior,
reviewed transcripts help teammates read the content, and manifests explain how
to reproduce or compare future runs.

## Step 8: Keep the workspace clean

After the handoff is complete, remove temporary files that should not remain in
the repository:

```bash
rm -f samples/*.mp3
git status --short
```

You should see only the files you intentionally created for the guide or for
your private handoff. If `git status` shows `.env`, raw recordings, transcripts
with private data, or local cache files, update `.gitignore` or move those files
outside the repository before committing anything.

## Troubleshooting

**Problem:** `sapat --help` does not show `soniox`.

**Solution:** Confirm you are on the provider branch or a version of Sapat that
includes the Soniox implementation. Reinstall editable mode with
`python -m pip install -e .`.

**Problem:** Soniox authentication fails.

**Solution:** Check that `.env` exists in the workspace root and that
`SONIOX_API_KEY` is set for the active shell. Do not paste the key into logs or
GitHub comments when asking for help.

**Problem:** The transcript misses product vocabulary.

**Solution:** Use a more specific `--prompt`, keep the recording quality as high
as practical, and add a human review pass for product names and acronyms.

**Problem:** The recording is too noisy.

**Solution:** Try `--quality H` so the MP3 conversion keeps more audio detail.
If the original is poor, run a short sample first instead of spending time on a
large file.

**Problem:** A teammate cannot reproduce your transcript.

**Solution:** Compare the run manifest first. The usual differences are model
name, prompt wording, input file, or whether the reviewer edited the raw output.
If the input file is private and cannot be shared, keep a short synthetic sample
that exercises the same language-switching pattern.

## Conclusion

With Daytona, Sapat, and Soniox, you can keep transcription work reproducible
without turning it into a heavyweight application. The important pieces are the
same every time: a clean workspace, private API-key handling, a provider command
that can be repeated, and a review artifact that explains how the transcript was
created.

This pattern is especially useful for a
[mixed-language transcription workflow](../definitions/20260520_definition_mixed_language_transcription_workflow.md),
where the transcript is only valuable if it preserves both the spoken content
and the engineering context around it.

## References

- [Sapat repository](https://github.com/nkkko/sapat)
- [Companion Soniox provider PR](https://github.com/nibzard/sapat/pull/22)
- [Soniox Speech-to-Text get started](https://soniox.com/docs/stt/get-started)
- [Soniox Python SDK async transcription](https://soniox.com/docs/sdk/python-SDK/stt/async-transcription)
- [Daytona documentation](https://www.daytona.io/docs)
36 changes: 36 additions & 0 deletions guides/assets/20260520_soniox_sapat_mixed_language_workflow.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.