From efa85b9bba747a3d0e2351c524a5ae8bb07bf86c Mon Sep 17 00:00:00 2001 From: aclanot Date: Wed, 20 May 2026 18:52:20 +0300 Subject: [PATCH] Add Cloudflare Sapat transcription guide Signed-off-by: aclanot --- authors/aclanot.md | 9 + ...260520_definition_edge_ai_transcription.md | 20 ++ ..._cloudflare_ai_transcription_with_sapat.md | 247 ++++++++++++++++++ ...0520_cloudflare_sapat_daytona_workflow.svg | 50 ++++ 4 files changed, 326 insertions(+) create mode 100644 authors/aclanot.md create mode 100644 definitions/20260520_definition_edge_ai_transcription.md create mode 100644 guides/20260520_cloudflare_ai_transcription_with_sapat.md create mode 100644 guides/assets/20260520_cloudflare_sapat_daytona_workflow.svg diff --git a/authors/aclanot.md b/authors/aclanot.md new file mode 100644 index 00000000..0a8b0c29 --- /dev/null +++ b/authors/aclanot.md @@ -0,0 +1,9 @@ +Author: Aclanot +Title: Open-source Contributor +Description: Aclanot builds and documents AI-assisted developer workflows, focusing on reproducible environments, practical automation, and secure handling of API-backed tools for engineering teams. +Author Image: ![Aclanot](https://avatars.githubusercontent.com/u/274915357?v=4) +Author LinkedIn: +Company Name: Independent +Company Description: Independent technical contributor focused on AI developer workflows and automation. +Company Logo Dark: +Company Logo White: diff --git a/definitions/20260520_definition_edge_ai_transcription.md b/definitions/20260520_definition_edge_ai_transcription.md new file mode 100644 index 00000000..f78a153d --- /dev/null +++ b/definitions/20260520_definition_edge_ai_transcription.md @@ -0,0 +1,20 @@ +--- +title: "Edge AI Transcription" +description: "A transcription workflow that sends audio to AI models running close to users or applications at the network edge." +date: 2026-05-20 +author: "Aclanot" +--- + +# Edge AI Transcription + +## Definition + +Edge AI transcription is the process of converting speech to text with AI models that run close to the user, application, or data source rather than only in a centralized application backend. + +## Context and Usage + +Engineering teams use edge AI transcription when they want a speech-to-text workflow that is easy to call from distributed applications, reproducible development environments, or regional intake pipelines. + +In practice, a tool prepares audio, sends it to an edge-capable AI provider, receives transcript text, and stores the result for review or downstream automation. + +This pattern is useful for demo recordings, support calls, research interviews, accessibility drafts, and transcript archives. Teams still need to manage credentials carefully, verify transcript quality, and decide which recordings are appropriate for third-party processing. diff --git a/guides/20260520_cloudflare_ai_transcription_with_sapat.md b/guides/20260520_cloudflare_ai_transcription_with_sapat.md new file mode 100644 index 00000000..c776cc6c --- /dev/null +++ b/guides/20260520_cloudflare_ai_transcription_with_sapat.md @@ -0,0 +1,247 @@ +--- +title: "Cloudflare AI Transcription with Sapat" +description: "Run Sapat with Cloudflare Workers AI inside Daytona to create a reproducible transcription workflow for engineering teams." +date: 2026-05-20 +author: "Aclanot" +tags: ["daytona", "sapat", "cloudflare", "transcription", "workers-ai"] +--- + +# Cloudflare AI Transcription with Sapat + +# Introduction + +AI transcription work usually starts as a quick local script and then becomes a repeatability problem. One teammate has `ffmpeg` installed, another has a different Python version, and someone runs a one-off command against a production API key. + +For engineering teams, that is a weak foundation for anything downstream: bug triage notes, customer-call summaries, release demos, accessibility captions, or retrieval pipelines. + +This guide shows how to run Sapat, a small Python video transcription tool, inside a Daytona workspace with a Cloudflare Workers AI provider. The workflow keeps the dev environment consistent and stores credentials through [environment variables](../definitions/20241126_definition_environment_variables.md). + +It also gives AI engineers a practical way to smoke-test [edge AI transcription](../definitions/20260520_definition_edge_ai_transcription.md) before they process a larger batch of recordings. + +The Cloudflare provider used in this guide is implemented in the companion Sapat pull request: [nibzard/sapat#21](https://github.com/nibzard/sapat/pull/21). While that PR is under review, the commands below use the contributor branch. After it is merged, use the upstream Sapat repository directly. + +![Cloudflare-backed Sapat workflow in Daytona](assets/20260520_cloudflare_sapat_daytona_workflow.svg) + +## TL;DR + +- Use Daytona to create a clean Python workspace for Sapat instead of depending on local machine state. +- Configure `CLOUDFLARE_ACCOUNT_ID` and `CLOUDFLARE_API_TOKEN` as environment variables, never in source code. +- Run `sapat --api cloudflare` to send Sapat's converted MP3 audio to Cloudflare Workers AI Whisper. +- Validate the transcript with a short smoke clip before running a larger directory of recordings. +- Keep the first pass focused on transcription; run correction or summarization as a separate review step. + +## Materials Checklist + +| Item | Why you need it | +| --- | --- | +| Daytona installed and configured | Creates a reproducible workspace for the Sapat project | +| Docker or a Daytona target | Runs the workspace in a clean development environment | +| Cloudflare account with Workers AI access | Provides the transcription API used by the Cloudflare provider | +| Workers AI API token | Authenticates calls to the Cloudflare REST API | +| Cloudflare account ID | Selects the account that runs the Workers AI model | +| One short `.mp4`, `.wav`, `.flac`, or `.mp3` sample | Lets you verify the workflow before processing real batches | + +## Step 1: Create a Daytona Workspace + +Start from a clean workspace so the transcription workflow is not tied to one laptop. If the Cloudflare provider PR is still under review, create the workspace from the fork that contains the provider branch: + +```bash +daytona create https://github.com/aclanot/sapat --code +``` + +Open the workspace terminal and switch to the Cloudflare provider branch: + +```bash +git fetch origin codex/cloudflare-workers-ai-provider +git checkout codex/cloudflare-workers-ai-provider +``` + +After the companion PR is merged, use the upstream repository instead: + +```bash +daytona create https://github.com/nibzard/sapat --code +``` + +This gives every engineer the same project checkout, the same dependency metadata, and the same command surface. It also makes the workspace easy to discard after a transcription run that handled sensitive recordings. + +## Step 2: Install Runtime Tools + +Sapat converts video files to MP3 before sending them to a provider, so the workspace needs `ffmpeg`. Install it in the workspace terminal: + +```bash +sudo apt-get update +sudo apt-get install -y ffmpeg +``` + +Then install Sapat in editable mode: + +```bash +python -m pip install -e . +``` + +Confirm that the CLI exposes the Cloudflare provider: + +```bash +sapat --help +``` + +The `--api` option should include `cloudflare` alongside the existing providers. If you are using the branch from the companion PR, the help output should include this provider choice: + +```text +--api [openai|groq|azure|cloudflare] +``` + +## Step 3: Configure Cloudflare Credentials + +Cloudflare's Workers AI REST API needs an account ID and an API token. Create a Workers AI API token from the Cloudflare dashboard, copy the account ID, and expose both values to the Daytona workspace. + +For a one-off smoke test, export the values in the workspace shell: + +```bash +export CLOUDFLARE_ACCOUNT_ID="your-account-id" +export CLOUDFLARE_API_TOKEN="your-workers-ai-token" +export CLOUDFLARE_WHISPER_MODEL="@cf/openai/whisper" +``` + +For repeated workspace use, store them with Daytona's environment management: + +```bash +daytona env set CLOUDFLARE_ACCOUNT_ID=your-account-id CLOUDFLARE_API_TOKEN=your-workers-ai-token +``` + +Do not commit `.env` files with real credentials. Sapat's `.env.example` documents the variable names, but the actual token should stay in Daytona, your shell, or your team's secrets manager. + +## Step 4: Prepare a Short Smoke Clip + +Before processing a full customer call or a directory of demo recordings, use a short clip with known content. A thirty-second sample is enough to verify credential access, audio conversion, provider routing, and transcript writing. + +Create a simple workspace layout: + +```bash +mkdir -p recordings transcripts +cp ~/Downloads/demo-call.mp4 recordings/demo-call.mp4 +``` + +Sapat writes the transcript next to the input file with the same base name and a `.txt` extension. Keeping recordings in their own directory makes the outputs easy to review: + +```text +recordings/ + demo-call.mp4 + demo-call.txt +``` + +If your input is already an `.mp3`, Sapat can use it directly. If it is a video file, Sapat uses `ffmpeg` to create a temporary MP3, sends that audio to the provider, writes the transcript, and removes the temporary MP3 after the run. + +## Step 5: Run Sapat with Cloudflare Workers AI + +Run the smoke clip through the Cloudflare provider: + +```bash +sapat recordings/demo-call.mp4 --api cloudflare --quality M +``` + +The provider sends the converted MP3 bytes to Cloudflare's model execution endpoint: + +```text +https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run/@cf/openai/whisper +``` + +Cloudflare's Whisper model accepts binary audio through the REST API and returns a JSON result that includes transcript text. The Sapat provider reads that response and writes the text field to `recordings/demo-call.txt`. + +Review the first transcript: + +```bash +sed -n '1,80p' recordings/demo-call.txt +``` + +Check for the operational details that usually decide whether a transcript is usable: + +- Product names, acronyms, and speaker names are recognizable. +- The first and last spoken sections are present. +- The transcript does not show a repeated failure phrase or an empty output. +- Important timestamps can be recovered from the original recording if needed. + +Cloudflare Workers AI Whisper can also return fields such as word-level data and VTT text from the model response. The current Sapat CLI writes the plain transcript text, which is the simplest artifact for engineering handoff. + +## Step 6: Process a Small Batch + +After the smoke test passes, process a directory of `.mp4` files: + +```bash +sapat recordings --api cloudflare --quality M +``` + +Use a small batch first. For example, start with three recordings, inspect the outputs, then process the rest. That keeps provider credentials, quota limits, and recording quality issues visible before a long run. + +A lightweight review loop works well: + +```bash +for transcript in recordings/*.txt; do + echo "===== $transcript =====" + sed -n '1,40p' "$transcript" +done +``` + +If a transcript needs domain correction, do that as a second pass. The Cloudflare provider in the companion Sapat PR is intentionally focused on transcription; it does not add a Cloudflare chat correction path. + +Keeping transcription and correction separate makes the workflow easier to audit. It also lets teams decide whether correction should use OpenAI, Groq, a local model, or a manual review checklist. + +## Step 7: Package the Output for Engineering Use + +Raw transcripts become more useful when they are packaged with enough context for another engineer to trust them. For each recording, keep a short metadata note: + +```markdown +# demo-call + +- Source file: recordings/demo-call.mp4 +- Provider: cloudflare +- Model: @cf/openai/whisper +- Quality setting: M +- Reviewed by: +- Known issues: product name "AcmeDB" appears once as "Acme DB" +``` + +That note turns a transcript into a handoff artifact. A teammate can see which [API](../definitions/20241212_definition_api.md) produced the text, which model was used, and which parts still need human review. + +This matters when transcripts feed a search index, a support escalation, or a release-note draft. + +## Common Issues and Troubleshooting + +**Problem:** `CLOUDFLARE_API_TOKEN is required for Cloudflare transcription.` + +**Solution:** Export the token in the workspace shell or set it with Daytona environment variables. Confirm the variable is visible with `printenv CLOUDFLARE_API_TOKEN`, but do not paste the token into logs, PR comments, or issue threads. + +**Problem:** Cloudflare returns an authorization error. + +**Solution:** Check that the token has Workers AI permissions and belongs to the same account ID used in `CLOUDFLARE_ACCOUNT_ID`. If your organization uses multiple Cloudflare accounts, it is easy to copy the right token and the wrong account ID. + +**Problem:** The transcript file exists but is empty. + +**Solution:** Start with a shorter input and run `ffprobe recordings/demo-call.mp4` to confirm the file has an audio stream. Silent videos, corrupted uploads, or unsupported audio streams can produce empty or low-quality transcripts. + +**Problem:** `ffmpeg` is missing. + +**Solution:** Install it in the Daytona workspace with `sudo apt-get install -y ffmpeg`. If the workspace image is locked down, add `ffmpeg` to the dev container or image configuration so future workspaces have it preinstalled. + +**Problem:** The transcript has the wrong domain terms. + +**Solution:** Treat provider output as the first pass. Add a review step for product names, acronyms, and speaker labels. If you need automated correction, run a second pass with an approved model. + +## Conclusion + +Running Sapat with Cloudflare Workers AI inside Daytona gives teams a reproducible transcription workflow instead of a one-off local command. Daytona keeps the workspace consistent, and Sapat handles file conversion and provider routing. + +Cloudflare Workers AI provides a simple REST-backed Whisper model for the transcription pass. + +The important habit is to keep the workflow boring: isolate credentials, start with a smoke clip, verify the transcript before a batch run, and package the output with enough metadata for another engineer to reproduce the result. + +That is the difference between "we got a transcript" and "we can trust this transcript in an engineering workflow." + +## References + +- [Sapat repository](https://github.com/nibzard/sapat) +- [Cloudflare Workers AI Whisper model](https://developers.cloudflare.com/workers-ai/models/whisper/) +- [Cloudflare Workers AI REST API guide](https://developers.cloudflare.com/workers-ai/get-started/rest-api/) +- [Cloudflare Execute AI Model API reference](https://developers.cloudflare.com/api/operations/workers-ai-post-run-model) +- [Daytona documentation](https://www.daytona.io/docs/) +- [Companion Sapat Cloudflare provider PR](https://github.com/nibzard/sapat/pull/21) diff --git a/guides/assets/20260520_cloudflare_sapat_daytona_workflow.svg b/guides/assets/20260520_cloudflare_sapat_daytona_workflow.svg new file mode 100644 index 00000000..045bfc01 --- /dev/null +++ b/guides/assets/20260520_cloudflare_sapat_daytona_workflow.svg @@ -0,0 +1,50 @@ + + Cloudflare-backed Sapat workflow in Daytona + A workflow diagram showing recordings moving through a Daytona workspace, Sapat, Cloudflare Workers AI Whisper, and transcript review. + + + Reproducible AI Transcription Flow + Run Sapat in Daytona, route audio to Cloudflare Workers AI, and review transcript artifacts before downstream use. + + + + Recordings + MP4, WAV, FLAC, + or existing MP3 + + + Daytona + Clean Python + workspace + + + Sapat + ffmpeg conversion + --api cloudflare + + + Workers AI + @cf/openai/whisper + REST transcription + + + Reviewed transcript package + Text output, provider metadata, known issues, and handoff notes + + + + + + + + + + + + + + + + + +