From 2d97f827915c487c0ea19c8d132a8a0f8487054d Mon Sep 17 00:00:00 2001 From: CodeAustral OSS Date: Wed, 20 May 2026 14:51:11 -0300 Subject: [PATCH] Add Soniox Sapat transcription guide Signed-off-by: CodeAustral OSS --- authors/codeaustral-oss.md | 7 + ...n_mixed_language_transcription_workflow.md | 25 ++ ...niox_sapat_mixed_language_transcription.md | 313 ++++++++++++++++++ ...0_soniox_sapat_mixed_language_workflow.svg | 36 ++ 4 files changed, 381 insertions(+) create mode 100644 authors/codeaustral-oss.md create mode 100644 definitions/20260520_definition_mixed_language_transcription_workflow.md create mode 100644 guides/20260520_guide_soniox_sapat_mixed_language_transcription.md create mode 100644 guides/assets/20260520_soniox_sapat_mixed_language_workflow.svg diff --git a/authors/codeaustral-oss.md b/authors/codeaustral-oss.md new file mode 100644 index 00000000..4f6bc883 --- /dev/null +++ b/authors/codeaustral-oss.md @@ -0,0 +1,7 @@ +Author: CodeAustral OSS Title: OSS Engineering Studio Description: CodeAustral +OSS builds small, reviewable open-source contributions with a focus on +maintainer-friendly pull requests, reproducible verification, and practical +developer tooling for modern engineering teams. Author Image: Author LinkedIn: +Author Twitter: Company Name: CodeAustral LLC Company Description: +Maintainer-friendly software engineering, developer tooling, and applied OSS +workflows. diff --git a/definitions/20260520_definition_mixed_language_transcription_workflow.md b/definitions/20260520_definition_mixed_language_transcription_workflow.md new file mode 100644 index 00000000..67891ef4 --- /dev/null +++ b/definitions/20260520_definition_mixed_language_transcription_workflow.md @@ -0,0 +1,25 @@ +--- +title: 'Mixed-Language Transcription Workflow' +description: 'A repeatable process for transcribing recordings that contain more than one spoken language.' +date: 2026-05-20 +author: 'CodeAustral OSS' +--- + +# Mixed-Language Transcription Workflow + +## Definition + +A mixed-language transcription workflow is a repeatable process for preparing, +transcribing, reviewing, and storing audio or video recordings that contain more +than one spoken language. It combines clean input handling, provider selection, +language-aware prompts, transcript review, and handoff artifacts so the final +text can be trusted by engineering, support, research, or content teams. + +## Context and Usage + +Mixed-language workflows are useful when product demos, interviews, customer +calls, incident reviews, or community recordings switch between languages. A +good workflow keeps API keys out of source control, runs the same commands in a +reproducible development environment, records which provider and settings were +used, and includes a human review pass for names, acronyms, timestamps, and +domain-specific terms. diff --git a/guides/20260520_guide_soniox_sapat_mixed_language_transcription.md b/guides/20260520_guide_soniox_sapat_mixed_language_transcription.md new file mode 100644 index 00000000..613b339b --- /dev/null +++ b/guides/20260520_guide_soniox_sapat_mixed_language_transcription.md @@ -0,0 +1,313 @@ +--- +title: "Run Soniox transcription with Sapat in Daytona" +description: "Build a reproducible Daytona workspace for mixed-language transcription using Sapat and Soniox Speech-to-Text." +date: 2026-05-20 +author: "CodeAustral OSS" +tags: ["daytona", "python", "transcription", "soniox"] +--- + +# Run Soniox transcription with Sapat in Daytona + +# Introduction + +Transcription gets messy when a recording is not a neat single-language demo. +Customer calls, conference hallway interviews, product walkthroughs, and +community recordings often move between English and another language, mention +product names quickly, and include acronyms that a generic transcript can +distort. A useful workflow should make the environment reproducible, keep API +keys private, and leave a short trail showing how the transcript was produced. + +This guide shows how to run Sapat, a small Python video transcription tool, in a +[Daytona workspace](../definitions/20240819_definition_daytona%20workspace.md) +with a Soniox Speech-to-Text provider. The companion Sapat implementation adds a +`--api soniox` option that uses the official Soniox Python SDK, sends the MP3 +created by Sapat to an asynchronous Soniox transcription job, waits for the +result, writes the text file, and cleans up the remote transcription by default. + +![Soniox and Sapat transcription workflow](assets/20260520_soniox_sapat_mixed_language_workflow.svg) + +## TL;DR + +- Use Daytona to create a clean Python workspace for Sapat. +- Configure Soniox through environment variables, not committed files. +- Run Sapat with `--api soniox` after the provider branch is installed. +- Keep a simple run manifest so transcripts are auditable later. +- Review mixed-language transcripts for names, acronyms, and speaker context. + +## Prerequisites + +You will need: + +- Daytona installed and connected to your preferred IDE. +- Python 3.10 or later in the workspace. +- `ffmpeg`, because Sapat converts video files to MP3 before transcription. +- A Soniox account and project API key. +- A short `.mp4`, `.m4a`, `.wav`, or `.mp3` recording you are allowed to + process. + +Do not commit recordings, transcripts with private information, or `.env` files. +Use a throwaway sample recording when you are testing the flow for the first +time. + +## Step 1: Create the Daytona workspace + +Start from the Sapat repository. If the Soniox provider has already been merged +upstream, create the workspace from the main repository: + +```bash +daytona create https://github.com/nkkko/sapat --code +``` + +While the provider pull request is under review, you can test the same workflow +from the companion branch: + +```bash +daytona create https://github.com/codeaustral-oss/sapat --code +git switch codeaustral/soniox-provider +``` + +Inside the workspace, install Sapat in editable mode: + +```bash +python -m venv .venv +source .venv/bin/activate +python -m pip install -e . +``` + +Confirm the CLI exposes the Soniox provider: + +```bash +sapat --help +``` + +The API option should include `soniox` alongside `openai`, `groq`, and `azure`. + +## Step 2: Configure Soniox without leaking secrets + +Create a local `.env` file. This file should stay outside version control. + +```bash +cat > .env <<'EOF' +SONIOX_API_KEY=replace-with-your-project-key +SONIOX_MODEL=stt-async-v4 +SONIOX_DESTROY_AFTER_TRANSCRIPTION=true +EOF +``` + +`SONIOX_API_KEY` authenticates requests to Soniox. `SONIOX_MODEL` selects the +async speech-to-text model used for recorded files. The cleanup flag keeps the +workflow tidy by deleting the remote transcription job and uploaded file after +Sapat has pulled the transcript text. + +If you are sharing the repository with a team, commit only a `.env.example` with +empty values. Never paste API keys into issues, pull requests, screenshots, or +articles. + +## Step 3: Prepare a mixed-language sample + +Place your test recording under a local folder such as `samples/`: + +```bash +mkdir -p samples runs +cp ~/Downloads/customer-demo.mp4 samples/customer-demo.mp4 +``` + +For a realistic mixed-language test, choose a recording that includes at least +one language switch and a few domain terms. For example, a developer might say: +"The webhook retry failed twice, entonces revisamos el payload, and then the +queue recovered." This kind of sentence is a good stress test because it mixes +language, product vocabulary, and operational context. + +Before uploading any recording to a transcription provider, confirm that you +have permission to process it and that your data handling matches your team's +policies. + +## Why use Soniox for this workflow? + +Sapat already supports OpenAI, Groq, and Azure OpenAI. Soniox is useful when you +want a transcription provider that is designed around speech-to-text workflows, +including asynchronous file transcription and automatic handling for recordings +that may contain more than one language. In a team workflow, that means the +developer running the transcript job can keep the operational steps simple: +prepare the file, submit it, wait for the result, save the transcript, and clean +up the remote job. + +The provider choice should still be deliberate. Use Soniox when the recording is +speech-heavy, when language switching matters, or when the transcript will be +reviewed and reused later. Use another provider when your team already has +approved infrastructure, data residency requirements, or billing controls tied +to that provider. The value of this guide is not that every recording must go +through Soniox; it is that the workflow remains portable and auditable because +Sapat keeps the command surface consistent. + +## Step 4: Run Sapat with Soniox + +Run the transcription command: + +```bash +sapat samples/customer-demo.mp4 \ + --quality M \ + --language en \ + --prompt "Product demo with English and Spanish technical vocabulary" \ + --temperature 0.3 \ + --api soniox +``` + +Sapat will convert the video to MP3 with `ffmpeg`, submit the audio to Soniox, +wait for the async job to finish, write a `.txt` file next to the input, and +remove the temporary MP3. With the cleanup flag enabled, the Soniox provider +also destroys the remote job after retrieving the transcript. + +The result should appear as: + +```text +samples/customer-demo.txt +``` + +Open the file and check the first pass. Do not expect any transcription provider +to know every internal product name or speaker nickname. The goal of the first +pass is to create a useful draft that can be reviewed quickly. + +## Step 5: Save a run manifest + +For repeatable work, keep a small run manifest alongside the transcript. This is +especially useful when several people review recordings or when the transcript +feeds a downstream search, support, or documentation workflow. + +```bash +cat > runs/customer-demo-20260520.md <<'EOF' +# customer-demo transcription run + +- Input file: samples/customer-demo.mp4 +- Output file: samples/customer-demo.txt +- Workspace: Daytona +- Tool: Sapat +- Provider: Soniox +- Model: stt-async-v4 +- Quality: M +- Language hint: en +- Prompt: Product demo with English and Spanish technical vocabulary +- Review status: needs human review +EOF +``` + +The manifest should not include API keys, private customer names, payout +details, or local machine paths that are not useful to another reviewer. + +## Step 6: Review the transcript + +A mixed-language transcript is not finished when the API returns text. Review it +with a short checklist: + +- Product names, acronyms, and company-specific terms are spelled correctly. +- Language switches are preserved instead of flattened into one language. +- The transcript does not expose private information that should be redacted. +- Action items, decisions, and dates are readable without replaying the audio. +- Any uncertain words are marked for a second listener instead of guessed. + +If the transcript is going into documentation or a customer-facing artifact, +create a cleaned copy rather than editing the raw transcript in place. Keep the +raw output, the reviewed transcript, and the manifest separate. + +## Step 7: Package the reviewed output + +Once the transcript is reviewed, create a small folder for the final artifacts. +This keeps raw text, edited text, and notes from being mixed together. + +```bash +mkdir -p handoff/customer-demo +cp samples/customer-demo.txt handoff/customer-demo/raw-transcript.txt +cp runs/customer-demo-20260520.md handoff/customer-demo/run-manifest.md +touch handoff/customer-demo/review-notes.md +``` + +Use `review-notes.md` for anything a future reader needs to know: + +```markdown +# review notes + +- Speaker names normalized: "Gaby" -> "Gabriela". +- Product term checked: "webhook replay" is correct. +- Spanish section starts around the customer escalation discussion. +- Two uncertain words remain marked with `[inaudible]`. +``` + +If the transcript will feed a retrieval system, documentation draft, or support +handoff, create a second edited transcript rather than changing the raw output: + +```bash +cp handoff/customer-demo/raw-transcript.txt \ + handoff/customer-demo/reviewed-transcript.txt +``` + +That separation matters. Raw transcripts help you debug provider behavior, +reviewed transcripts help teammates read the content, and manifests explain how +to reproduce or compare future runs. + +## Step 8: Keep the workspace clean + +After the handoff is complete, remove temporary files that should not remain in +the repository: + +```bash +rm -f samples/*.mp3 +git status --short +``` + +You should see only the files you intentionally created for the guide or for +your private handoff. If `git status` shows `.env`, raw recordings, transcripts +with private data, or local cache files, update `.gitignore` or move those files +outside the repository before committing anything. + +## Troubleshooting + +**Problem:** `sapat --help` does not show `soniox`. + +**Solution:** Confirm you are on the provider branch or a version of Sapat that +includes the Soniox implementation. Reinstall editable mode with +`python -m pip install -e .`. + +**Problem:** Soniox authentication fails. + +**Solution:** Check that `.env` exists in the workspace root and that +`SONIOX_API_KEY` is set for the active shell. Do not paste the key into logs or +GitHub comments when asking for help. + +**Problem:** The transcript misses product vocabulary. + +**Solution:** Use a more specific `--prompt`, keep the recording quality as high +as practical, and add a human review pass for product names and acronyms. + +**Problem:** The recording is too noisy. + +**Solution:** Try `--quality H` so the MP3 conversion keeps more audio detail. +If the original is poor, run a short sample first instead of spending time on a +large file. + +**Problem:** A teammate cannot reproduce your transcript. + +**Solution:** Compare the run manifest first. The usual differences are model +name, prompt wording, input file, or whether the reviewer edited the raw output. +If the input file is private and cannot be shared, keep a short synthetic sample +that exercises the same language-switching pattern. + +## Conclusion + +With Daytona, Sapat, and Soniox, you can keep transcription work reproducible +without turning it into a heavyweight application. The important pieces are the +same every time: a clean workspace, private API-key handling, a provider command +that can be repeated, and a review artifact that explains how the transcript was +created. + +This pattern is especially useful for a +[mixed-language transcription workflow](../definitions/20260520_definition_mixed_language_transcription_workflow.md), +where the transcript is only valuable if it preserves both the spoken content +and the engineering context around it. + +## References + +- [Sapat repository](https://github.com/nkkko/sapat) +- [Companion Soniox provider PR](https://github.com/nibzard/sapat/pull/22) +- [Soniox Speech-to-Text get started](https://soniox.com/docs/stt/get-started) +- [Soniox Python SDK async transcription](https://soniox.com/docs/sdk/python-SDK/stt/async-transcription) +- [Daytona documentation](https://www.daytona.io/docs) diff --git a/guides/assets/20260520_soniox_sapat_mixed_language_workflow.svg b/guides/assets/20260520_soniox_sapat_mixed_language_workflow.svg new file mode 100644 index 00000000..535b1607 --- /dev/null +++ b/guides/assets/20260520_soniox_sapat_mixed_language_workflow.svg @@ -0,0 +1,36 @@ + + Soniox and Sapat transcription workflow in Daytona + A workflow diagram showing a Daytona workspace, Sapat conversion, Soniox transcription, transcript review, and handoff artifacts. + + + Mixed-language transcription in Daytona + A reproducible workflow for turning multilingual recordings into reviewable text artifacts. + + + Daytona + Workspace with + Python, ffmpeg, Sapat + + + Sapat + Convert video to MP3 + and call provider + + + Soniox + Async STT job + mixed-language audio + + + Review + Names, acronyms, + handoff notes + + + Output: transcript.txt + run manifest + review notes, without committed API keys or private recordings. + + + + + +