Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions authors/ness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Author: Ness Title: Open-source contributor Description: Ness contributes
workflow-focused developer documentation and small open-source tooling examples,
with an interest in reproducible environments, AI-assisted engineering, and
practical command line workflows. Author Image:
[https://avatars.githubusercontent.com/u/52550279?v=4] Author LinkedIn: Author
Twitter: Company Name: Independent Company Description: Independent contributor
21 changes: 21 additions & 0 deletions definitions/20260520_definition_synchronous_speech_recognition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: 'Synchronous Speech Recognition'
description: 'A speech recognition request pattern where one API call sends audio and waits for the transcript response.'
date: 2026-05-20
author: 'Ness'
---

# Synchronous Speech Recognition

## Definition

Synchronous speech recognition is a speech-to-text request pattern where a
client sends audio to a recognition service and waits for the completed
transcript response before the request finishes.

## Context and Usage

Synchronous speech recognition is commonly used for short recordings, smoke
tests, batch scripts, and simple command line tools. It is easier to integrate
than live streaming or asynchronous job polling, but it is best suited to audio
that can be processed within the provider's request limits.
287 changes: 287 additions & 0 deletions guides/20260520_run_ibm_watson_transcription_with_sapat_in_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
---
title: "Run IBM Watson Transcription with Sapat"
description:
"Build a reproducible Daytona workflow for transcribing video files with Sapat
and IBM Watson Speech to Text."
date: 2026-05-20
author: "Ness"
tags: ["ai", "transcription", "daytona", "python"]
---

# Run IBM Watson Transcription with Sapat

# Introduction

A transcription workflow is easiest to trust when every run is boring in the
same way. The same command should convert the source file, call the chosen
provider, write the transcript, and leave enough notes that another engineer can
repeat the job. That is hard to maintain when audio tools, Python dependencies,
API keys, and generated text files are scattered across a developer laptop.

[Sapat](https://github.com/nkkko/sapat) gives this workflow a small command line
shape. It accepts video files, converts them to MP3 with `ffmpeg`, sends the
audio to a transcription provider, and saves a `.txt` transcript beside the
source file. Running Sapat inside a [Daytona workspace](https://www.daytona.io/)
keeps the toolchain and secrets in an isolated development environment. Adding
IBM Watson Speech to Text gives teams another provider option for
[synchronous speech recognition](../definitions/20260520_definition_synchronous_speech_recognition.md),
where one HTTP request sends the audio and returns recognition results after the
service finishes processing the file.

This guide walks through a Daytona-based Sapat workflow backed by IBM Watson
Speech to Text. It is written around a companion Sapat implementation that adds
`--api ibm`, environment-based IBM configuration, and mocked tests for the
recognition request and transcript extraction path.

## TL;DR

- Use Daytona to run Sapat in a reproducible workspace rather than configuring
Python, `ffmpeg`, and credentials on your host machine.
- Add `IBM_WATSON_STT_API_KEY`, `IBM_WATSON_STT_URL`, and optional model
settings in `.env`.
- Run `sapat <file-or-directory> --api ibm` to convert MP4 files to MP3 and send
the generated audio to IBM Watson Speech to Text.
- Review the generated `.txt` file before sharing it downstream.
- Use mocked tests to verify the provider wiring without sending audio to IBM.

## Workflow Overview

![IBM Watson and Sapat workflow](./assets/20260520_run_ibm_watson_transcription_with_sapat_in_daytona_img1.svg)

Sapat is useful because it keeps the local file workflow simple:

- A source `.mp4` file is the input.
- A temporary `.mp3` file is generated with `ffmpeg`.
- The selected provider receives the audio.
- A `.txt` transcript is written next to the source file.
- The temporary MP3 is removed.

The IBM Watson Speech to Text HTTP API fits this flow because its basic
`POST /v1/recognize` request accepts an audio stream and returns a JSON response
with transcript alternatives. The provider can send the MP3 with basic
authentication, pass the configured IBM model as a query parameter, read
`results[].alternatives[0].transcript`, and return a single transcript string to
Sapat's existing file writer.

## Prerequisites

Before starting, make sure you have:

- Daytona installed and connected to your Git provider.
- An IBM Cloud Speech to Text service instance.
- The service API key and service URL from IBM Cloud.
- A short MP4 file for the first smoke test.
- Basic comfort with Python commands and `.env` files.

Start with short audio. A one minute product demo, lecture excerpt, or support
call sample is enough to validate the pipeline. Once the workspace and provider
configuration are proven, you can run longer recordings or directories of files.

## Step 1: Create a Daytona Workspace

Create a workspace from the Sapat repository:

```bash
daytona create https://github.com/nkkko/sapat --code
```

Open the workspace in your editor. If you are testing the companion IBM Watson
branch, fetch it inside the workspace:

```bash
git remote add enes https://github.com/EnesBrt/sapat.git
git fetch enes codex/ibm-watson-transcription-provider
git switch -c ibm-watson-transcription enes/codex/ibm-watson-transcription-provider
```

The companion implementation is available at
[nibzard/sapat#27](https://github.com/nibzard/sapat/pull/27).

## Step 2: Install Sapat and ffmpeg

Install Sapat in editable mode from the project root:

```bash
python -m pip install --upgrade pip
python -m pip install -e .
```

Then confirm `ffmpeg` is available:

```bash
ffmpeg -version
```

If `ffmpeg` is missing and your Daytona workspace uses a Debian or Ubuntu based
image, install it with:

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg
```

The important part is that this setup lives in the workspace. Your host machine
does not need another Python environment, another global command line tool, or a
set of long-lived media-processing dependencies.

## Step 3: Configure IBM Watson

Create a `.env` file in the Sapat project root:

```bash
IBM_WATSON_STT_API_KEY=your_ibm_watson_api_key_here
IBM_WATSON_STT_URL=https://api.us-south.speech-to-text.watson.cloud.ibm.com/instances/your_instance_id
IBM_WATSON_STT_MODEL=en-US_BroadbandModel
IBM_WATSON_STT_SMART_FORMATTING=true
```

The first two variables are required. `IBM_WATSON_STT_MODEL` defaults to
`en-US_BroadbandModel`, which matches IBM's basic recognition examples for
English broadband audio. `IBM_WATSON_STT_SMART_FORMATTING` is passed through as
a query parameter so the service can format dates, times, numbers, and similar
entities when the selected model supports it.

Keep `.env` out of commits. The API key belongs in the workspace environment or
secret manager, never in guide text, PR bodies, transcripts, or screenshots.

## Step 4: Run a Smoke Test

Run Sapat against one short file first:

```bash
sapat ./samples/customer-demo.mp4 --api ibm --language en --quality M
```

The IBM provider currently uses the configured Watson model rather than deriving
the model from `--language`. Keep `--language` in the command when it helps you
document the intent of the run, but choose the actual recognition model through
`IBM_WATSON_STT_MODEL`.

During the run, Sapat will:

1. Convert `customer-demo.mp4` to `customer-demo.mp3`.
2. Send the MP3 bytes to `{IBM_WATSON_STT_URL}/v1/recognize`.
3. Authenticate with basic auth using `apikey` and your API key.
4. Pass the configured model and smart-formatting setting as query parameters.
5. Extract the top transcript alternatives from IBM's JSON response.
6. Write `customer-demo.txt`.
7. Delete the temporary MP3 file.

For a folder of recordings, point Sapat at the folder:

```bash
sapat ./recordings --api ibm --quality M
```

Sapat processes `.mp4` files in the directory and writes one `.txt` file for
each input file.

## Step 5: Review and Package the Transcript

Open the generated transcript before it leaves the workspace. A good review pass
is simple:

- Confirm the transcript covers the full recording.
- Check product names, acronyms, and customer names.
- Remove private or irrelevant sections before sharing.
- Add a short run note beside the transcript.

Example run note:

```text
source: customer-demo.mp4
provider: ibm-watson-speech-to-text
model: en-US_BroadbandModel
quality: M
review status: names checked, roadmap details redacted
```

This note turns the transcript from a loose text file into a reviewable artifact.
When a teammate asks where the text came from, the answer is in the workspace
next to the output.

## Step 6: Validate Without Calling IBM

The companion provider includes a mocked unit test:

```bash
PYTHONPATH=src python -m unittest tests.test_ibm_watson
```

The test verifies that the provider:

- Sends audio to `/v1/recognize`.
- Uses `("apikey", IBM_WATSON_STT_API_KEY)` for basic auth.
- Sets `Content-Type` based on the generated audio file extension.
- Passes the Watson model and smart-formatting values as query parameters.
- Extracts transcript text from IBM's `results` array.

Run this test in CI or before opening provider changes. It catches wiring
mistakes without sending private audio to a third-party API.

## Common Issues and Troubleshooting

**Problem:** `IBM_WATSON_STT_API_KEY is required for IBM Watson transcription.`

**Solution:** Confirm that `.env` exists in the Sapat project root and that the
workspace terminal is running Sapat from that root.

**Problem:** `IBM_WATSON_STT_URL is required for IBM Watson transcription.`

**Solution:** Copy the service URL from the IBM Cloud Speech to Text instance,
including the `/instances/...` path, and store it in `.env`.

**Problem:** IBM returns an authentication error.

**Solution:** Check that the API key belongs to the same service instance as the
service URL. IBM's examples use basic auth with username `apikey` and the API
key as the password.

**Problem:** The transcript quality is poor.

**Solution:** Try `--quality H`, verify the source audio track, and choose an IBM
model that matches the language and audio type. For noisy sources, test a short
excerpt before running the full folder.

**Problem:** The workflow works locally but not in automation.

**Solution:** Keep the same variable names, but inject them through your CI or
workspace secret manager. Run the mocked test on every change and reserve real
provider calls for intentional smoke-test jobs.

## Why Daytona Helps

Transcription pipelines are easy to underestimate. They touch local media files,
provider credentials, generated artifacts, and sometimes sensitive customer or
internal conversation data. Daytona gives the workflow a clean boundary: create
the workspace, install the dependencies, run Sapat, review the transcript, keep
the useful artifacts, and remove the environment when the job is complete.

That boundary makes the process easier to hand to another engineer. Instead of
"install these tools and set these variables somewhere on your machine," the
workflow becomes "open this workspace, check `.env`, run this command, review
this output." It is a small operational difference, but it is the difference
between an ad hoc transcript and a repeatable engineering intake process.

## Conclusion

IBM Watson Speech to Text gives Sapat another practical provider for teams that
already use IBM Cloud or want a synchronous HTTP transcription path. Daytona
keeps the surrounding workflow contained: Python dependencies, `ffmpeg`, media
files, API credentials, generated transcripts, and run notes all live in the
workspace.

The habit is what matters: start with a short smoke test, configure the provider
through `.env`, review the generated text before sharing it, and keep the source
file and run note together. With that routine in place, Sapat can turn product
demos, customer calls, interviews, lectures, and QA recordings into text without
turning every transcription request into a fresh setup project.

## References

- [Sapat repository](https://github.com/nkkko/sapat)
- [IBM Speech to Text getting started](https://cloud.ibm.com/docs/speech-to-text?topic=speech-to-text-gettingStarted)
- [IBM Speech to Text synchronous HTTP interface](https://cloud.ibm.com/docs/speech-to-text?topic=speech-to-text-http)
- [IBM Speech to Text API reference](https://cloud.ibm.com/apidocs/speech-to-text)
- [Companion IBM Watson provider PR](https://github.com/nibzard/sapat/pull/27)
- [Daytona](https://www.daytona.io/)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.