diff --git a/articles/20260523_run_localai_transcription_with_sapat_in_daytona.md b/articles/20260523_run_localai_transcription_with_sapat_in_daytona.md new file mode 100644 index 00000000..6fbe072a --- /dev/null +++ b/articles/20260523_run_localai_transcription_with_sapat_in_daytona.md @@ -0,0 +1,326 @@ +--- +title: 'Run LocalAI Transcription With Sapat' +description: 'Build a reproducible Daytona workspace for private Sapat video transcription backed by a self-hosted LocalAI endpoint.' +date: 2026-05-23 +author: 'JJ Lin' +tags: ['Daytona', 'Sapat', 'Speech-to-Text', 'LocalAI'] +--- + +# Run LocalAI Transcription With Sapat + +## Introduction + +AI teams often collect short product demos, debugging recordings, user research +clips, and internal walkthroughs long before those recordings become useful +written artifacts. A transcript makes the material searchable, easier to +summarize, and safer to hand to downstream agents. The challenge is turning that +workflow into something another engineer can reproduce without guessing which +machine had `ffmpeg`, which shell had credentials, or which speech-to-text +provider was used. + +[Sapat](https://github.com/nkkko/sapat) is a compact Python CLI for this job. It +accepts an `.mp4` file or a folder of `.mp4` files, converts each video to MP3 +with `ffmpeg`, sends the audio to a selected transcription provider, and writes +a `.txt` transcript next to the source file. This guide adds a LocalAI path to +that workflow and runs it inside a [Daytona](https://www.daytona.io/docs/) +workspace so setup, configuration, and validation are explicit. The result is a +[self-hosted speech-to-text](/definitions/20260523_definition_self_hosted_speech_to_text.md) +workflow that stays reproducible without depending on one developer's laptop. + +The companion implementation is available in +[nibzard/sapat#43](https://github.com/nibzard/sapat/pull/43). It adds +`--api localai`, documents the `LOCALAI_*` environment variables, and includes +mocked tests for CLI routing and request construction. + +![LocalAI transcription workflow in Daytona](assets/20260523_run_localai_transcription_with_sapat_in_daytona.svg) + +## TL;DR + +- Use Daytona to open a clean Sapat workspace instead of relying on a hand-tuned + local machine. +- Run LocalAI where the workspace can reach it and expose its OpenAI-compatible + transcription endpoint. +- Configure `LOCALAI_BASE_URL`, `LOCALAI_MODEL`, and optional + `LOCALAI_API_KEY` outside source control. +- Run `sapat .mp4 --api localai` to convert the video, send the MP3 to + LocalAI, and save a transcript. +- Review the transcript before sharing it with another model, teammate, or + public issue. + +## What The LocalAI Provider Adds + +LocalAI is a self-hosted AI runtime that offers OpenAI-compatible APIs for +several model types, including audio-to-text. Its +[audio-to-text documentation](https://localai.io/features/audio-to-text/) +describes a `POST /v1/audio/transcriptions` endpoint that accepts multipart +form data with an audio `file` and a `model` value. That shape is a natural fit +for Sapat because Sapat already prepares an MP3 file before calling a provider. + +The LocalAI Sapat provider keeps the flow intentionally simple: + +```text +video.mp4 -> ffmpeg MP3 conversion -> LocalAI transcription request -> video.txt +``` + +It reads the following environment variables: + +Variable | Purpose | Default +--- | --- | --- +`LOCALAI_BASE_URL` | Base URL for the LocalAI server, such as `http://localhost:8080` | Required unless `LOCALAI_API_ENDPOINT` is set +`LOCALAI_API_ENDPOINT` | Full transcription endpoint override | Built from `LOCALAI_BASE_URL` +`LOCALAI_MODEL` | Audio-to-text model name sent in the request | `whisper-1` +`LOCALAI_API_KEY` | Optional bearer token if the LocalAI server requires auth | Not set + +The provider does not require a cloud transcription account. That makes it +useful when a team wants to keep internal recordings inside a self-hosted +boundary, test transcription quality against local models, or run the same +workflow in a controlled development environment. + +## Prerequisites + +Before you start, make sure you have: + +- A Daytona account and CLI that can create a sandbox. +- A LocalAI server with an audio-to-text model installed. +- A short `.mp4` recording with spoken audio. +- `ffmpeg` available in the Sapat workspace. +- Enough disk space for Sapat to create a temporary MP3 next to the video. + +The guide uses the companion Sapat branch until the provider is merged +upstream. After merge, create the workspace from `nkkko/sapat` directly and skip +the branch checkout step. + +## Start Or Reach A LocalAI Server + +Run LocalAI wherever the Daytona workspace can reach it. For a quick local test, +that may be a LocalAI process on the same machine or inside the same development +network. For a team setup, it may be an internal server with access control in +front of it. + +After the server is running and a Whisper-compatible model is installed, verify +the transcription endpoint with a small audio file: + +```bash +curl "$LOCALAI_BASE_URL/v1/audio/transcriptions" \ + -H "Content-Type: multipart/form-data" \ + -F file="@sample.wav" \ + -F model="whisper-1" +``` + +If the server requires authentication, include the bearer token: + +```bash +curl "$LOCALAI_BASE_URL/v1/audio/transcriptions" \ + -H "Authorization: Bearer $LOCALAI_API_KEY" \ + -H "Content-Type: multipart/form-data" \ + -F file="@sample.wav" \ + -F model="whisper-1" +``` + +Do this endpoint check before opening Sapat. It separates LocalAI setup problems +from Sapat workflow problems and makes troubleshooting much faster. + +## Create The Daytona Workspace + +Create a Daytona sandbox from the fork that contains the LocalAI provider: + +```bash +daytona create https://github.com/JJ-Lin/sapat --name sapat-localai +``` + +Open a terminal in the workspace, then check out the provider branch: + +```bash +git fetch origin feature/localai-transcription-provider +git checkout feature/localai-transcription-provider +``` + +Install Sapat in editable mode: + +```bash +python -m pip install -e . +``` + +Sapat uses `ffmpeg` to convert videos to MP3 before transcription. Confirm that +it is available: + +```bash +ffmpeg -version +``` + +If `ffmpeg` is missing, install it through the package manager available in your +Daytona environment or bake it into the workspace image. The important part is +to make this dependency visible in the workspace rather than leaving it as an +undocumented local-machine assumption. + +## Configure LocalAI Without Committing Secrets + +Create a `.env` file in the workspace root: + +```bash +LOCALAI_BASE_URL=http://localhost:8080 +LOCALAI_MODEL=whisper-1 +LOCALAI_API_KEY= +``` + +If the LocalAI server is not reachable through a simple base URL, set the full +endpoint instead: + +```bash +LOCALAI_API_ENDPOINT=https://localai.example.com/v1/audio/transcriptions +LOCALAI_MODEL=whisper-1 +LOCALAI_API_KEY=replace_if_required +``` + +Do not commit `.env`. A safe `.env.example` can show the required variable names +without exposing a server URL or token: + +```bash +LOCALAI_BASE_URL= +LOCALAI_API_ENDPOINT= +LOCALAI_MODEL=whisper-1 +LOCALAI_API_KEY= +``` + +This is where Daytona helps. The workspace setup stays reproducible, while +secrets and internal endpoints remain outside the repository. + +## Run A First Transcription + +Copy a short test video into the workspace. Start with a clip under a few +minutes so you can test the full loop quickly. + +```bash +sapat demo.mp4 --api localai --quality M --language en +``` + +Sapat will: + +1. Convert `demo.mp4` to `demo.mp3`. +2. Send the generated MP3 to LocalAI. +3. Save the returned transcript as `demo.txt`. +4. Remove the temporary MP3 file after processing. + +For clearer audio at the cost of a larger temporary file, use the high-quality +conversion option: + +```bash +sapat demo.mp4 --api localai --quality H --language en +``` + +For product names, acronyms, or internal terms, pass a prompt: + +```bash +sapat demo.mp4 \ + --api localai \ + --language en \ + --prompt "Product names: Daytona, Sapat, LocalAI" +``` + +The prompt gives the transcription model vocabulary hints. It is especially +useful for developer tools, repository names, customer names, and abbreviations +that are easy to misspell. + +## Process A Folder Of Recordings + +Sapat can process every `.mp4` file in a directory. This is useful for a batch +of product demos, design review clips, or field recordings from the same +project. + +```bash +mkdir recordings +``` + +Copy the videos into that folder, then run: + +```bash +sapat recordings --api localai --quality M --language en --prompt "Product names: Daytona, Sapat, LocalAI" +``` + +Sapat writes one `.txt` transcript for each `.mp4` file. Rename the videos before +you run the batch. A transcript named `workspace_setup_walkthrough.txt` is much +easier to reuse than `screen-recording-7.txt`. + +For larger folders, start with two or three representative recordings. Review +the outputs, adjust the model or prompt, then process the rest. That short +feedback loop saves time when the first model choice is not strong enough for +your audio. + +## Validate The Transcript + +Do not hand a raw transcript straight to another agent. Review it first: + +Check | What To Look For +--- | --- +Completeness | The transcript covers the full recording, not only the first segment. +Names | Product, speaker, company, and repository names match the prompt vocabulary. +Numbers | Dates, version numbers, ports, amounts, and IDs are accurate. +Private data | Secrets, customer names, or sensitive details are removed before sharing. +Next-step readiness | The text is clear enough for summarization, issue filing, or documentation. + +Open the transcript in the terminal: + +```bash +sed -n '1,160p' demo.txt +``` + +If the transcript stops early, check LocalAI logs, model limits, and temporary +file size. If names are wrong, rerun with a more specific prompt. If the audio +is noisy, try a cleaner source recording before tuning the provider. + +## Compare Local And Cloud Providers + +One reason to use Sapat is that it gives the same CLI shape to multiple +providers. After the LocalAI run works, compare it with another configured +provider on the same sample: + +```bash +sapat demo.mp4 --api localai --quality M --language en +mv demo.txt demo.localai.txt + +sapat demo.mp4 --api openai --quality M --language en +mv demo.txt demo.openai.txt + +diff -u demo.localai.txt demo.openai.txt +``` + +The goal is not to declare a universal winner from one clip. The goal is to +measure the tradeoffs that matter for your team: privacy boundary, latency, +cost, model availability, punctuation, code terms, and behavior on noisy audio. +Daytona keeps that comparison repeatable because both runs happen in the same +workspace with the same input file and command options. + +## Troubleshooting + +Problem | Fix +--- | --- +`LOCALAI_BASE_URL or LOCALAI_API_ENDPOINT must be set` | Add one of those values to `.env`, then open a new shell or rerun the command. +Connection refused | Confirm LocalAI is running and reachable from the Daytona workspace, not only from your laptop. +401 or 403 response | Set `LOCALAI_API_KEY` if the server requires a bearer token. +Unsupported audio file format | Let Sapat process the original `.mp4`; the provider receives the generated MP3. +Empty or poor transcript | Check the LocalAI model, language, prompt, and source audio quality. +`ffmpeg` not found | Install `ffmpeg` in the workspace image or through the workspace package manager. + +When debugging, keep the failing input small. A ten-second clip with known +speech is enough to verify the endpoint, request shape, and transcript writing +path. + +## Conclusion + +Sapat plus LocalAI gives AI engineers a private, reproducible transcription +workflow: Daytona supplies the clean workspace, Sapat supplies the video-to-text +CLI, and LocalAI supplies a self-hosted audio-to-text endpoint. The result is a +simple loop that can turn recordings into transcripts without committing +credentials or depending on a hidden local setup. + +Once the first transcript is reliable, use the same workspace for provider +comparisons, batch processing, and downstream summarization. Keep the input +videos, prompts, model names, and validation notes together so another engineer +can reproduce the result later. + +## References + +- [LocalAI Audio to Text](https://localai.io/features/audio-to-text/) +- [Daytona Documentation](https://www.daytona.io/docs/) +- [Daytona CLI Reference](https://www.daytona.io/docs/en/tools/cli/) +- [Sapat LocalAI provider PR](https://github.com/nibzard/sapat/pull/43) diff --git a/articles/assets/20260523_run_localai_transcription_with_sapat_in_daytona.svg b/articles/assets/20260523_run_localai_transcription_with_sapat_in_daytona.svg new file mode 100644 index 00000000..02247765 --- /dev/null +++ b/articles/assets/20260523_run_localai_transcription_with_sapat_in_daytona.svg @@ -0,0 +1,32 @@ + + LocalAI transcription workflow in Daytona + A Daytona workspace runs Sapat, converts video to MP3, sends audio to LocalAI, and writes a transcript. + + + Private Video Transcription In A Daytona Workspace + Sapat prepares audio, LocalAI transcribes it, and the transcript stays in your workflow. + + + MP4 Video + Demo or meeting + + Sapat CLI + ffmpeg to MP3 + --api localai + + LocalAI + /v1/audio/transcriptions + Self-hosted model + + TXT + Validated transcript + + + + + + + + Configure with LOCALAI_BASE_URL, LOCALAI_MODEL, and optional LOCALAI_API_KEY + + diff --git a/authors/jj_lin.md b/authors/jj_lin.md new file mode 100644 index 00000000..06ef856c --- /dev/null +++ b/authors/jj_lin.md @@ -0,0 +1,10 @@ +Author: JJ Lin +Title: AI Engineering Contributor +Description: JJ Lin works on practical AI engineering workflows, developer tooling, and reproducible automation. He focuses on turning small command-line tools into reliable, testable workflows that can run in clean development environments. +Author Image: +Author LinkedIn: +Author Twitter: +Company Name: Independent +Company Description: Independent software and AI engineering work. +Company Logo Dark: +Company Logo White: diff --git a/definitions/20260523_definition_self_hosted_speech_to_text.md b/definitions/20260523_definition_self_hosted_speech_to_text.md new file mode 100644 index 00000000..0dc67b90 --- /dev/null +++ b/definitions/20260523_definition_self_hosted_speech_to_text.md @@ -0,0 +1,36 @@ +--- +title: "Self-Hosted Speech-to-Text" +description: "A speech transcription workflow that runs on infrastructure controlled by the user or organization instead of a managed cloud API." +--- + +# Self-Hosted Speech-to-Text + +## Definition + +Self-hosted speech-to-text is the practice of converting spoken audio into +written text using a model and service deployed on infrastructure controlled by +the user, team, or organization. Instead of sending recordings to a managed +cloud transcription API, the audio is processed by an internally operated +runtime such as a local server, private cloud service, or secure development +environment. + +## Context and Usage + +Teams choose self-hosted speech-to-text when they need more control over data +flow, model selection, latency, or cost. It is common in workflows that handle +internal meetings, product demos, customer recordings, field notes, research +interviews, or regulated data. + +In a development workflow, self-hosted transcription is often paired with +environment variables, reproducible workspaces, and validation steps. The model +endpoint can stay private, while developers still use a consistent CLI or API to +generate transcripts. + +Common considerations include: + +- **Privacy boundary**: Audio stays inside infrastructure the team controls. +- **Model operations**: The team is responsible for installing, updating, and + monitoring the transcription model. +- **Resource planning**: Larger models may need more CPU, GPU, memory, or disk. +- **Quality checks**: Transcripts still need review for names, numbers, and + sensitive details before reuse.