Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions authors/aclanot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Author: Aclanot
Title: Open-source Contributor
Description: Aclanot builds and documents AI-assisted developer workflows, focusing on reproducible environments, practical automation, and secure handling of API-backed tools for engineering teams.
Author Image: ![Aclanot](https://avatars.githubusercontent.com/u/274915357?v=4)
Author LinkedIn:
Company Name: Independent
Company Description: Independent technical contributor focused on AI developer workflows and automation.
Company Logo Dark:
Company Logo White:
20 changes: 20 additions & 0 deletions definitions/20260520_definition_edge_ai_transcription.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: "Edge AI Transcription"
description: "A transcription workflow that sends audio to AI models running close to users or applications at the network edge."
date: 2026-05-20
author: "Aclanot"
---

# Edge AI Transcription

## Definition

Edge AI transcription is the process of converting speech to text with AI models that run close to the user, application, or data source rather than only in a centralized application backend.

## Context and Usage

Engineering teams use edge AI transcription when they want a speech-to-text workflow that is easy to call from distributed applications, reproducible development environments, or regional intake pipelines.

In practice, a tool prepares audio, sends it to an edge-capable AI provider, receives transcript text, and stores the result for review or downstream automation.

This pattern is useful for demo recordings, support calls, research interviews, accessibility drafts, and transcript archives. Teams still need to manage credentials carefully, verify transcript quality, and decide which recordings are appropriate for third-party processing.
247 changes: 247 additions & 0 deletions guides/20260520_cloudflare_ai_transcription_with_sapat.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
---
title: "Cloudflare AI Transcription with Sapat"
description: "Run Sapat with Cloudflare Workers AI inside Daytona to create a reproducible transcription workflow for engineering teams."
date: 2026-05-20
author: "Aclanot"
tags: ["daytona", "sapat", "cloudflare", "transcription", "workers-ai"]
---

# Cloudflare AI Transcription with Sapat

# Introduction

AI transcription work usually starts as a quick local script and then becomes a repeatability problem. One teammate has `ffmpeg` installed, another has a different Python version, and someone runs a one-off command against a production API key.

For engineering teams, that is a weak foundation for anything downstream: bug triage notes, customer-call summaries, release demos, accessibility captions, or retrieval pipelines.

This guide shows how to run Sapat, a small Python video transcription tool, inside a Daytona workspace with a Cloudflare Workers AI provider. The workflow keeps the dev environment consistent and stores credentials through [environment variables](../definitions/20241126_definition_environment_variables.md).

It also gives AI engineers a practical way to smoke-test [edge AI transcription](../definitions/20260520_definition_edge_ai_transcription.md) before they process a larger batch of recordings.

The Cloudflare provider used in this guide is implemented in the companion Sapat pull request: [nibzard/sapat#21](https://github.com/nibzard/sapat/pull/21). While that PR is under review, the commands below use the contributor branch. After it is merged, use the upstream Sapat repository directly.

![Cloudflare-backed Sapat workflow in Daytona](assets/20260520_cloudflare_sapat_daytona_workflow.svg)

## TL;DR

- Use Daytona to create a clean Python workspace for Sapat instead of depending on local machine state.
- Configure `CLOUDFLARE_ACCOUNT_ID` and `CLOUDFLARE_API_TOKEN` as environment variables, never in source code.
- Run `sapat <recording> --api cloudflare` to send Sapat's converted MP3 audio to Cloudflare Workers AI Whisper.
- Validate the transcript with a short smoke clip before running a larger directory of recordings.
- Keep the first pass focused on transcription; run correction or summarization as a separate review step.

## Materials Checklist

| Item | Why you need it |
| --- | --- |
| Daytona installed and configured | Creates a reproducible workspace for the Sapat project |
| Docker or a Daytona target | Runs the workspace in a clean development environment |
| Cloudflare account with Workers AI access | Provides the transcription API used by the Cloudflare provider |
| Workers AI API token | Authenticates calls to the Cloudflare REST API |
| Cloudflare account ID | Selects the account that runs the Workers AI model |
| One short `.mp4`, `.wav`, `.flac`, or `.mp3` sample | Lets you verify the workflow before processing real batches |

## Step 1: Create a Daytona Workspace

Start from a clean workspace so the transcription workflow is not tied to one laptop. If the Cloudflare provider PR is still under review, create the workspace from the fork that contains the provider branch:

```bash
daytona create https://github.com/aclanot/sapat --code
```

Open the workspace terminal and switch to the Cloudflare provider branch:

```bash
git fetch origin codex/cloudflare-workers-ai-provider
git checkout codex/cloudflare-workers-ai-provider
```

After the companion PR is merged, use the upstream repository instead:

```bash
daytona create https://github.com/nibzard/sapat --code
```

This gives every engineer the same project checkout, the same dependency metadata, and the same command surface. It also makes the workspace easy to discard after a transcription run that handled sensitive recordings.

## Step 2: Install Runtime Tools

Sapat converts video files to MP3 before sending them to a provider, so the workspace needs `ffmpeg`. Install it in the workspace terminal:

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg
```

Then install Sapat in editable mode:

```bash
python -m pip install -e .
```

Confirm that the CLI exposes the Cloudflare provider:

```bash
sapat --help
```

The `--api` option should include `cloudflare` alongside the existing providers. If you are using the branch from the companion PR, the help output should include this provider choice:

```text
--api [openai|groq|azure|cloudflare]
```

## Step 3: Configure Cloudflare Credentials

Cloudflare's Workers AI REST API needs an account ID and an API token. Create a Workers AI API token from the Cloudflare dashboard, copy the account ID, and expose both values to the Daytona workspace.

For a one-off smoke test, export the values in the workspace shell:

```bash
export CLOUDFLARE_ACCOUNT_ID="your-account-id"
export CLOUDFLARE_API_TOKEN="your-workers-ai-token"
export CLOUDFLARE_WHISPER_MODEL="@cf/openai/whisper"
```

For repeated workspace use, store them with Daytona's environment management:

```bash
daytona env set CLOUDFLARE_ACCOUNT_ID=your-account-id CLOUDFLARE_API_TOKEN=your-workers-ai-token
```

Do not commit `.env` files with real credentials. Sapat's `.env.example` documents the variable names, but the actual token should stay in Daytona, your shell, or your team's secrets manager.

## Step 4: Prepare a Short Smoke Clip

Before processing a full customer call or a directory of demo recordings, use a short clip with known content. A thirty-second sample is enough to verify credential access, audio conversion, provider routing, and transcript writing.

Create a simple workspace layout:

```bash
mkdir -p recordings transcripts
cp ~/Downloads/demo-call.mp4 recordings/demo-call.mp4
```

Sapat writes the transcript next to the input file with the same base name and a `.txt` extension. Keeping recordings in their own directory makes the outputs easy to review:

```text
recordings/
demo-call.mp4
demo-call.txt
```

If your input is already an `.mp3`, Sapat can use it directly. If it is a video file, Sapat uses `ffmpeg` to create a temporary MP3, sends that audio to the provider, writes the transcript, and removes the temporary MP3 after the run.

## Step 5: Run Sapat with Cloudflare Workers AI

Run the smoke clip through the Cloudflare provider:

```bash
sapat recordings/demo-call.mp4 --api cloudflare --quality M
```

The provider sends the converted MP3 bytes to Cloudflare's model execution endpoint:

```text
https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run/@cf/openai/whisper
```

Cloudflare's Whisper model accepts binary audio through the REST API and returns a JSON result that includes transcript text. The Sapat provider reads that response and writes the text field to `recordings/demo-call.txt`.

Review the first transcript:

```bash
sed -n '1,80p' recordings/demo-call.txt
```

Check for the operational details that usually decide whether a transcript is usable:

- Product names, acronyms, and speaker names are recognizable.
- The first and last spoken sections are present.
- The transcript does not show a repeated failure phrase or an empty output.
- Important timestamps can be recovered from the original recording if needed.

Cloudflare Workers AI Whisper can also return fields such as word-level data and VTT text from the model response. The current Sapat CLI writes the plain transcript text, which is the simplest artifact for engineering handoff.

## Step 6: Process a Small Batch

After the smoke test passes, process a directory of `.mp4` files:

```bash
sapat recordings --api cloudflare --quality M
```

Use a small batch first. For example, start with three recordings, inspect the outputs, then process the rest. That keeps provider credentials, quota limits, and recording quality issues visible before a long run.

A lightweight review loop works well:

```bash
for transcript in recordings/*.txt; do
echo "===== $transcript ====="
sed -n '1,40p' "$transcript"
done
```

If a transcript needs domain correction, do that as a second pass. The Cloudflare provider in the companion Sapat PR is intentionally focused on transcription; it does not add a Cloudflare chat correction path.

Keeping transcription and correction separate makes the workflow easier to audit. It also lets teams decide whether correction should use OpenAI, Groq, a local model, or a manual review checklist.

## Step 7: Package the Output for Engineering Use

Raw transcripts become more useful when they are packaged with enough context for another engineer to trust them. For each recording, keep a short metadata note:

```markdown
# demo-call

- Source file: recordings/demo-call.mp4
- Provider: cloudflare
- Model: @cf/openai/whisper
- Quality setting: M
- Reviewed by: <name>
- Known issues: product name "AcmeDB" appears once as "Acme DB"
```

That note turns a transcript into a handoff artifact. A teammate can see which [API](../definitions/20241212_definition_api.md) produced the text, which model was used, and which parts still need human review.

This matters when transcripts feed a search index, a support escalation, or a release-note draft.

## Common Issues and Troubleshooting

**Problem:** `CLOUDFLARE_API_TOKEN is required for Cloudflare transcription.`

**Solution:** Export the token in the workspace shell or set it with Daytona environment variables. Confirm the variable is visible with `printenv CLOUDFLARE_API_TOKEN`, but do not paste the token into logs, PR comments, or issue threads.

**Problem:** Cloudflare returns an authorization error.

**Solution:** Check that the token has Workers AI permissions and belongs to the same account ID used in `CLOUDFLARE_ACCOUNT_ID`. If your organization uses multiple Cloudflare accounts, it is easy to copy the right token and the wrong account ID.

**Problem:** The transcript file exists but is empty.

**Solution:** Start with a shorter input and run `ffprobe recordings/demo-call.mp4` to confirm the file has an audio stream. Silent videos, corrupted uploads, or unsupported audio streams can produce empty or low-quality transcripts.

**Problem:** `ffmpeg` is missing.

**Solution:** Install it in the Daytona workspace with `sudo apt-get install -y ffmpeg`. If the workspace image is locked down, add `ffmpeg` to the dev container or image configuration so future workspaces have it preinstalled.

**Problem:** The transcript has the wrong domain terms.

**Solution:** Treat provider output as the first pass. Add a review step for product names, acronyms, and speaker labels. If you need automated correction, run a second pass with an approved model.

## Conclusion

Running Sapat with Cloudflare Workers AI inside Daytona gives teams a reproducible transcription workflow instead of a one-off local command. Daytona keeps the workspace consistent, and Sapat handles file conversion and provider routing.

Cloudflare Workers AI provides a simple REST-backed Whisper model for the transcription pass.

The important habit is to keep the workflow boring: isolate credentials, start with a smoke clip, verify the transcript before a batch run, and package the output with enough metadata for another engineer to reproduce the result.

That is the difference between "we got a transcript" and "we can trust this transcript in an engineering workflow."

## References

- [Sapat repository](https://github.com/nibzard/sapat)
- [Cloudflare Workers AI Whisper model](https://developers.cloudflare.com/workers-ai/models/whisper/)
- [Cloudflare Workers AI REST API guide](https://developers.cloudflare.com/workers-ai/get-started/rest-api/)
- [Cloudflare Execute AI Model API reference](https://developers.cloudflare.com/api/operations/workers-ai-post-run-model)
- [Daytona documentation](https://www.daytona.io/docs/)
- [Companion Sapat Cloudflare provider PR](https://github.com/nibzard/sapat/pull/21)
50 changes: 50 additions & 0 deletions guides/assets/20260520_cloudflare_sapat_daytona_workflow.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.