diff --git a/authors/a1local.md b/authors/a1local.md new file mode 100644 index 00000000..0cfa9ea9 --- /dev/null +++ b/authors/a1local.md @@ -0,0 +1,8 @@ +Author: A1 Local Title: AI-assisted open-source contributor Description: +A1 Local contributes practical developer documentation and small product +improvements with a focus on source-checked workflows, reproducible validation, +and clear handoffs for maintainers. Author Image: +![A1 Local](https://github.com/a1local.png) Author LinkedIn: Author Twitter: +Company Name: A1 Local Company Description: Practical web, automation, and +technical documentation work for small teams. Company Logo Dark: Company Logo +White: diff --git a/definitions/20260520_definition_glossary_aware_transcription.md b/definitions/20260520_definition_glossary_aware_transcription.md new file mode 100644 index 00000000..c7f1cc5f --- /dev/null +++ b/definitions/20260520_definition_glossary_aware_transcription.md @@ -0,0 +1,21 @@ +--- +title: "Glossary-Aware Transcription" +description: + "Glossary-aware transcription uses a controlled term list to guide speech-to-text + output and review domain-specific words." +date: 2026-05-20 +author: "A1 Local" +tags: ["transcription", "speech-to-text", "ai"] +--- + +# Glossary-Aware Transcription + +Glossary-aware transcription is a speech-to-text workflow that gives the +transcription system a controlled list of names, acronyms, product terms, and +domain phrases before or during review. The glossary helps reduce common errors +around words that sound similar, proper nouns, and technical vocabulary. + +For AI engineering teams, glossary-aware transcription is useful because +transcripts often become source material for summaries, support notes, release +updates, search indexes, and retrieval systems. Correcting key terms early keeps +downstream tools from repeating the same mistake. diff --git a/guides/20260520_sapat_glossary_correction_daytona.md b/guides/20260520_sapat_glossary_correction_daytona.md new file mode 100644 index 00000000..4c79714d --- /dev/null +++ b/guides/20260520_sapat_glossary_correction_daytona.md @@ -0,0 +1,516 @@ +--- +title: "Build Glossary-Aware Transcription with Sapat" +description: + "Use Sapat in a Daytona workspace to transcribe recordings, preserve domain + terms, and review transcripts before AI handoff." +date: 2026-05-20 +author: "A1 Local" +tags: ["daytona", "sapat", "transcription", "workflow"] +--- + +# Build Glossary-Aware Transcription with Sapat + +# Introduction + +AI transcription is usually treated as a single command: upload a recording, +wait for a transcript, and move on. That works for simple audio, but it breaks +down when a recording contains product names, customer names, acronyms, roadmap +labels, feature flags, model names, or words from a specific industry. Those +terms are often the exact reason an AI engineer wants the transcript in the +first place. + +Sapat is a small command-line transcription tool that converts video files to +MP3 with FFmpeg, sends the audio to a supported provider, and writes a `.txt` +file beside the source video. The current Sapat CLI supports OpenAI, Groq Cloud, +and Azure OpenAI through the `--api` option. It also exposes useful controls +such as `--language`, `--prompt`, `--temperature`, `--quality`, and `--correct`. + +This guide shows how to run Sapat inside a [Daytona workspace](../definitions/20240819_definition_daytona workspace.md) +as a repeatable, [glossary-aware transcription](../definitions/20260520_definition_glossary_aware_transcription.md) +workflow. You will create a workspace, configure one provider, build a small +term glossary, run Sapat with a prompt that protects domain vocabulary, and +review the output before it becomes input for summaries, support notes, release +updates, or a retrieval pipeline. + +The goal is not just "get text from audio." The goal is to produce a transcript +that another engineer or [LLM](../definitions/20241219_definition_llm.md) can +trust. + +## TL;DR + +- Use Daytona to keep Sapat, FFmpeg, provider credentials, and review notes in a + reproducible workspace. +- Put product names, speaker names, acronyms, and uncommon terms into a + glossary before running transcription. +- Pass the glossary summary through Sapat's `--prompt` option and keep + `--temperature` low for stable output. +- Review the transcript against the glossary before handing it to another AI + workflow. +- Store the source metadata, raw transcript, corrected transcript, and review + notes as separate artifacts. + +## Prerequisites + +You will need: + +- Daytona installed and connected to your Git provider. +- Python 3.6 or later in the workspace. +- FFmpeg available in the workspace. +- API credentials for OpenAI, Groq Cloud, or Azure OpenAI. +- One or more `.mp4` recordings that you have permission to process. + +This guide uses placeholder credentials. Keep real [environment variables](../definitions/20241126_definition_environment_variables.md) +in `.env` and out of Git. + +## Workflow Overview + +![Glossary-aware Sapat transcription workflow](assets/20260520_sapat_glossary_correction_daytona_img1.svg) + +The workflow has four artifacts: + +- **Recording metadata**: what the file contains, who can see it, and what needs + review. +- **Glossary**: names, acronyms, product terms, and expected spelling. +- **Sapat transcript**: the `.txt` output written beside each video. +- **Handoff notes**: the cleaned transcript status and remaining review items. + +Separating these files makes the process easier to audit. It also helps when +you need to rerun transcription with a better prompt without losing your first +pass. + +## Step 1: Create a Daytona Workspace + +Create a workspace from the Sapat repository: + +```bash +daytona create https://github.com/nkkko/sapat --code +``` + +Inside the workspace, inspect the project: + +```bash +ls +find src/sapat -maxdepth 3 -type f +``` + +The important files are: + +- `README.md`, which lists provider credentials and CLI examples. +- `src/sapat/script.py`, which defines the Click command and supported flags. +- `src/sapat/transcription/base.py`, which converts video to MP3 and writes the + final `.txt` file. +- `src/sapat/transcription/openai.py`, `groq.py`, and `azure.py`, which contain + the provider implementations. + +Sapat processes a single file when `input_path` is a file. If `input_path` is a +directory, it loops over `.mp4` files in that directory. + +## Step 2: Install Sapat in the Workspace + +You can run Sapat directly from the source tree, but installing the package into +a virtual environment gives you a cleaner command-line workflow. + +```bash +python -m venv .venv +source .venv/bin/activate +python -m pip install --upgrade pip +python -m pip install build +python -m build +python -m pip install dist/sapat-0.1.2-py3-none-any.whl +``` + +Confirm FFmpeg is available: + +```bash +ffmpeg -version +``` + +If FFmpeg is missing, install it in the environment your Daytona workspace uses. +For a Debian or Ubuntu based workspace image, that usually means: + +```bash +sudo apt-get update +sudo apt-get install -y ffmpeg +``` + +## Step 3: Configure One Provider + +Create a local `.env` file. Start with one provider instead of adding every key +you own. + +For OpenAI: + +```bash +OPENAI_API_KEY=your_openai_api_key_here +OPENAI_MODEL=whisper-1 +OPENAI_API_ENDPOINT=https://api.openai.com/v1/audio/transcriptions +OPENAI_MODEL_NAME_CHAT=gpt-4o +``` + +For Groq Cloud: + +```bash +GROQCLOUD_API_KEY=your_groq_api_key_here +GROQCLOUD_MODEL=whisper-large-v3-turbo +GROQCLOUD_API_ENDPOINT=https://api.groq.com/openai/v1/audio/transcriptions +GROQCLOUD_MODEL_NAME_CHAT=llama3-8b-8192 +``` + +For Azure OpenAI: + +```bash +AZURE_OPENAI_API_KEY=your_azure_api_key_here +AZURE_OPENAI_ENDPOINT=https://DEPLOYMENTENDPOINTNAME.openai.azure.com +AZURE_OPENAI_DEPLOYMENT_NAME_WHISPER=whisper +AZURE_OPENAI_API_VERSION_WHISPER=2024-06-01 +AZURE_OPENAI_DEPLOYMENT_NAME_CHAT=gpt-4o +AZURE_OPENAI_API_VERSION_CHAT=2023-03-15-preview +``` + +Add `.env` to `.gitignore` if your working copy does not already ignore it: + +```bash +printf '\n.env\n' >> .gitignore +``` + +## Step 4: Prepare Recordings and Metadata + +Create a predictable folder structure: + +```bash +mkdir -p media/raw media/review media/handoff media/glossary +``` + +Put recordings in `media/raw`. Use names that describe the session without +leaking private data: + +```text +media/raw/ + customer-research-call-01.mp4 + roadmap-demo-voiceover.mp4 +``` + +Before you run transcription, write a short metadata note: + +```markdown +# Recording Metadata + +Source file: customer-research-call-01.mp4 +Primary language: English +Speaker count: 3 +Provider: OpenAI +Allowed use: internal summary and support insight extraction + +Review priority: +- Customer names +- Product names +- Plan names +- Numbers, dates, and pricing +- Sentences marked unclear + +Privacy notes: +- Do not publish raw audio. +- Remove personal phone numbers from public excerpts. +``` + +Save it as: + +```text +media/review/customer-research-call-01-metadata.md +``` + +This file is boring on purpose. It gives reviewers enough context to judge the +transcript without replaying the entire recording. + +## Step 5: Build a Glossary Prompt + +Create a glossary file with the terms that the provider may mishear: + +```markdown +# Transcript Glossary + +Product and company terms: +- Daytona +- Sapat +- Dev Container +- Workspace +- OpenAI +- Groq Cloud +- Azure OpenAI + +People and teams: +- Platform Engineering +- Developer Experience +- Support Operations + +Acronyms: +- CDE: cloud development environment +- QA: quality assurance +- RAG: retrieval-augmented generation + +Style notes: +- Keep product names in title case. +- Keep acronyms uppercase. +- Do not expand acronyms unless the speaker does. +``` + +Save it as: + +```text +media/glossary/customer-research-call-01-glossary.md +``` + +Now turn that glossary into a short transcription prompt. The prompt should be +brief enough to help the model without becoming another document to parse. + +```bash +cat > media/glossary/customer-research-call-01-prompt.txt <<'EOF' +This recording discusses Daytona, Sapat, Dev Containers, OpenAI, Groq Cloud, +Azure OpenAI, CDEs, QA, and RAG. Preserve product names and acronyms exactly. +Use normal punctuation. Do not invent speaker names. +EOF +``` + +## Step 6: Run a First Transcription Pass + +Run Sapat with a low temperature and the glossary prompt. + +```bash +sapat media/raw/customer-research-call-01.mp4 \ + --api openai \ + --language en \ + --quality H \ + --temperature 0 \ + --prompt "$(cat media/glossary/customer-research-call-01-prompt.txt)" +``` + +Sapat will: + +1. Convert the `.mp4` file to a temporary `.mp3`. +2. Send that MP3 to the selected provider. +3. Write a same-name `.txt` file beside the source video. +4. Delete the temporary MP3 file. + +Your output should look like this: + +```text +media/raw/customer-research-call-01.txt +``` + +Copy the raw transcript into the review folder before making edits: + +```bash +cp media/raw/customer-research-call-01.txt \ + media/review/customer-research-call-01-raw.txt +``` + +## Step 7: Review Terms Before Correcting Style + +Do not start by rewriting the transcript. Start by checking whether the core +terms survived the transcription pass. + +```bash +grep -n -E 'Daytona|Sapat|Groq|Azure|OpenAI|CDE|RAG|QA' \ + media/review/customer-research-call-01-raw.txt +``` + +Create a review note: + +```markdown +# Transcript Review + +File: customer-research-call-01-raw.txt + +Glossary matches: +- Daytona: OK +- Sapat: OK +- Groq Cloud: check two occurrences +- CDE: one occurrence transcribed as "CD" + +Numbers and dates: +- Pricing statement at 12:40 needs manual check. +- Launch date at 18:05 needs manual check. + +Unclear sections: +- 09:30 to 09:48: speaker overlap. +- 22:10 to 22:25: background noise. +``` + +Save it as: + +```text +media/review/customer-research-call-01-review.md +``` + +This step is where glossary-aware transcription pays off. If one important term +is wrong, you can rerun with a more specific prompt instead of editing every +downstream artifact later. + +## Step 8: Rerun with a Tighter Prompt When Needed + +If the transcript misses a term, update the prompt with a small correction +hint: + +```text +The speaker says "CDE", meaning cloud development environment. Do not write +"CD" or "city" when the context is developer workspaces. +``` + +Then rerun Sapat: + +```bash +sapat media/raw/customer-research-call-01.mp4 \ + --api openai \ + --language en \ + --quality H \ + --temperature 0 \ + --prompt "$(cat media/glossary/customer-research-call-01-prompt.txt)" +``` + +Copy the second pass separately: + +```bash +cp media/raw/customer-research-call-01.txt \ + media/review/customer-research-call-01-pass-2.txt +``` + +Compare the two passes: + +```bash +diff -u media/review/customer-research-call-01-raw.txt \ + media/review/customer-research-call-01-pass-2.txt | less +``` + +Keep the diff. It is useful evidence when you need to explain why a second +transcription pass was necessary. + +## Step 9: Use the Correction Pass Carefully + +Sapat includes a `--correct` flag that asks the configured provider to run a +chat correction pass after transcription. Use it when your selected provider is +configured for both transcription and chat completion. + +```bash +sapat media/raw/customer-research-call-01.mp4 \ + --api groq \ + --language en \ + --quality H \ + --temperature 0 \ + --correct \ + --prompt "$(cat media/glossary/customer-research-call-01-prompt.txt)" +``` + +Treat the corrected output as a new draft, not as the final source of truth. +Correction models are useful for punctuation and spelling, but they can also +smooth over uncertainty. Keep the raw pass, corrected pass, and review notes +together. + +```bash +cp media/raw/customer-research-call-01.txt \ + media/review/customer-research-call-01-corrected.txt +``` + +## Step 10: Package a Handoff File + +Create one final handoff file for the next workflow. This could feed a summary +prompt, a support insights report, a RAG ingestion job, or a release note draft. + +```markdown +# Transcript Handoff + +Source: customer-research-call-01.mp4 +Final transcript: customer-research-call-01-corrected.txt +Reviewer: your-name +Date: 2026-05-20 + +Status: +- Glossary terms reviewed. +- Numbers and dates reviewed. +- Private contact details removed from public excerpts. +- Two unclear sections left marked for human review. + +Do not use for: +- Legal record. +- Public quotation without speaker approval. +- Training data without consent. + +Ready for: +- Internal summary. +- Support theme extraction. +- Product feedback clustering. +``` + +Save it as: + +```text +media/handoff/customer-research-call-01-handoff.md +``` + +## Common Issues and Troubleshooting + +**Problem:** Sapat says the input path is invalid. + +**Solution:** Confirm the file exists inside the Daytona workspace, not only on +your local machine. Run `ls media/raw` before the Sapat command. + +**Problem:** FFmpeg conversion fails. + +**Solution:** Run `ffmpeg -version`. If FFmpeg is missing, install it in the +workspace image. If FFmpeg exists, test with a short sample file before +processing long recordings. + +**Problem:** The provider rejects the upload because of size. + +**Solution:** Split the source recording into smaller files or run with lower +audio quality. Sapat's OpenAI and Groq provider classes validate a 25 MB maximum +audio file size after MP3 conversion. + +**Problem:** The transcript keeps misspelling one product name. + +**Solution:** Put the exact spelling in the glossary and the `--prompt` text. +If the term is an acronym, include both the acronym and what it means. + +**Problem:** The corrected transcript reads too polished. + +**Solution:** Compare the corrected pass against the raw pass. Use correction +for punctuation and spelling, but keep unclear speech marked instead of guessing +what a speaker meant. + +**Problem:** Directory processing skips a file. + +**Solution:** Sapat's directory loop processes files ending in `.mp4`. Convert +or rename other source formats before running a directory batch. + +## Confirmation Checklist + +Before handing off the transcript, confirm: + +- The source recording has metadata and privacy notes. +- The glossary includes all known names, products, and acronyms. +- The Sapat command records provider, language, quality, and prompt choices. +- The raw transcript is preserved. +- The corrected transcript is reviewed against the glossary. +- Remaining unclear sections are marked instead of invented. +- The handoff file states what the transcript can and cannot be used for. + +## Conclusion + +Sapat is intentionally small, which makes it a good fit for repeatable +transcription workflows in Daytona. By adding a glossary before transcription +and a review loop after transcription, you turn a raw speech-to-text output into +a safer engineering artifact. The extra files are simple: metadata, glossary, +raw transcript, corrected transcript, and handoff notes. + +That structure matters when transcripts become inputs for AI summaries, +customer research, support workflows, or retrieval systems. The earlier you +protect product names, acronyms, numbers, and privacy constraints, the less +cleanup you need downstream. + +## References + +- [Sapat repository](https://github.com/nkkko/sapat) +- [Sapat README](https://github.com/nkkko/sapat/blob/main/README.md) +- [Daytona](https://www.daytona.io/) +- [OpenAI audio transcription API](https://platform.openai.com/docs/guides/speech-to-text) +- [Groq audio transcription docs](https://console.groq.com/docs/speech-to-text) +- [Azure OpenAI audio documentation](https://learn.microsoft.com/azure/ai-services/openai/) diff --git a/guides/assets/20260520_sapat_glossary_correction_daytona_img1.svg b/guides/assets/20260520_sapat_glossary_correction_daytona_img1.svg new file mode 100644 index 00000000..9a1901c7 --- /dev/null +++ b/guides/assets/20260520_sapat_glossary_correction_daytona_img1.svg @@ -0,0 +1,34 @@ + + Glossary-aware Sapat transcription workflow in Daytona + A four-step workflow from recording and glossary to Sapat transcription, transcript review, and downstream handoff. + + + + + + Recordings + MP4 files, metadata, + privacy notes + Glossary + Names, acronyms, + product terms + Sapat Run + Provider, prompt, + quality flags + Handoff + Clean transcript + and notes + + + + + Review loop + Check terms, rerun with a tighter prompt when needed, then package transcript evidence for downstream AI work. + + + + + + + +