Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions authors/dr_alex_mitre.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Author: Dr Alex Mitre Title: Software Developer Description: Dr Alex Mitre
builds web, Swift, JavaScript, and Python applications, with experience in
public-sector software, education technology, and reproducible AI workflows.
Author Image: [https://avatars.githubusercontent.com/u/30060514?v=4] Author
LinkedIn: Author Twitter: Company Name: Independent Company Description:
Independent software development and AI workflow automation.
25 changes: 25 additions & 0 deletions definitions/20260524_definition_baidu_speech_recognition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: 'Baidu Speech Recognition'
description:
'A Baidu Cloud speech-to-text service for converting short audio clips into
text through REST APIs and SDKs.'
date: 2026-05-24
author: 'Dr Alex Mitre'
---

# Baidu Speech Recognition

## Definition

Baidu Speech Recognition is a Baidu Cloud speech-to-text service that converts
spoken audio into text. Its short speech recognition API accepts complete audio
clips, uses a language model selected by `dev_pid`, and returns recognized text
in a JSON response.

## Context and Usage

In developer workflows, Baidu Speech Recognition can be used as a transcription
backend for short demo clips, command recordings, meeting snippets, and
language-specific speech-to-text tests. A reproducible setup stores credentials
in environment variables, normalizes audio before submission, and keeps the
generated transcript with the source clip and validation notes.
186 changes: 186 additions & 0 deletions guides/20260524_baidu_speech_sapat_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
---
title: 'Run Baidu Speech Recognition with Sapat'
description:
'Use Daytona and Sapat to transcribe short clips with Baidu Speech
Recognition from a reproducible workspace.'
date: 2026-05-24
author: 'Dr Alex Mitre'
tags: ['ai', 'speech-to-text', 'daytona']
---

# Run Baidu Speech Recognition with Sapat

# Introduction

Sapat is a small command-line transcription tool that converts video files to
MP3, sends the audio to a speech-to-text provider, and writes a sidecar text
file. That makes it useful for demos, meeting clips, QA recordings, and short
research notes where the transcript should be reproducible from one command.

This guide shows how to run Sapat with a Baidu Speech Recognition provider from
a Daytona workspace. The workflow keeps provider secrets in environment
variables, uses Sapat's existing file and directory processing model, and
documents the validation path before you use the transcript downstream.

## TL;DR

- Use a Daytona workspace so the ffmpeg and Python setup is repeatable.
- Install the Sapat branch that adds `--api baidu` support.
- Store `BAIDU_API_KEY` and `BAIDU_SECRET_KEY` in `.env`, not in Git.
- Use short audio clips because Baidu's short speech API is designed for files
up to 60 seconds.
- Verify the generated `.txt` output before using it in summaries, tickets, or
customer-facing notes.

## Step 1: Create a Daytona workspace

Create a workspace from the Sapat repository:

```bash
daytona create https://github.com/nkkko/sapat --code
```

Open the workspace terminal and confirm that Python and ffmpeg are available:

```bash
python --version
ffmpeg -version
```

Sapat uses ffmpeg to extract audio from the input video before it calls the
speech-to-text provider. Keeping that dependency inside the workspace makes the
same command easier to rerun later.

## Step 2: Install the Baidu-enabled Sapat branch

Install the companion provider implementation:

```bash
pip install \
git+https://github.com/mitre88/sapat.git@add-baidu-transcription-provider
```

Confirm that the new provider is available:

```bash
sapat --help
```

The `--api` option should include `baidu` alongside the existing OpenAI, Groq,
and Azure choices.

## Step 3: Configure Baidu credentials safely

Create a local `.env` file in the workspace root:

```bash
cat > .env <<'EOF'
BAIDU_API_KEY=replace_with_your_api_key
BAIDU_SECRET_KEY=replace_with_your_secret_key
BAIDU_DEV_PID=1737
BAIDU_SAMPLE_RATE=16000
EOF
```

Use `BAIDU_DEV_PID=1737` for English clips and `BAIDU_DEV_PID=1537` for
Mandarin clips. Leave `.env` uncommitted. A good workspace habit is to check the
Git state before and after each transcription run:

```bash
git status --short
```

If `.env` appears in the output, add it to `.gitignore` before continuing.

## Step 4: Prepare a short clip

Baidu's short speech endpoint is meant for complete audio files under 60
seconds. If the source video is longer, cut a smaller sample first:

```bash
ffmpeg -i long-demo.mp4 -t 45 -c copy baidu-demo-clip.mp4
```

For production workflows, split longer recordings into reviewed chunks and keep
a manifest with the source file, clip window, language, and expected topic.
That makes transcript review and reruns traceable.

## Step 5: Run Sapat with Baidu

Run Sapat on the prepared clip:

```bash
sapat baidu-demo-clip.mp4 --api baidu --language en --quality M
```

Sapat will create `baidu-demo-clip.txt` next to the input file. The Baidu
provider converts audio to mono 16 kHz MP3 before sending the request, fetches a
Baidu access token from the configured API key and secret, and then submits the
base64-encoded audio to Baidu's short speech recognition endpoint.

For Mandarin clips, use:

```bash
sapat baidu-demo-clip.mp4 --api baidu --language zh-CN --quality M
```

## Step 6: Review and record the result

Open the transcript and check it against the source clip:

```bash
sed -n '1,120p' baidu-demo-clip.txt
```

Record a small run note with the command, clip duration, language, provider, and
review status. For example:

```text
source: baidu-demo-clip.mp4
duration: 45 seconds
provider: baidu
language: en
command: sapat baidu-demo-clip.mp4 --api baidu --language en --quality M
review: checked for speaker names, product names, and missing sentences
```

That run note is useful when the transcript becomes input to a bug report,
customer summary, release note, or retrieval dataset.

## Common Issues and Troubleshooting

**Problem:** `BAIDU_API_KEY and BAIDU_SECRET_KEY must be set.`

**Solution:** Confirm the `.env` file is in the workspace root and that the key
names match exactly. Restart the shell if your workflow exports variables
outside `.env`.

**Problem:** The API returns an audio quality or format error.

**Solution:** Keep clips short, use one audio channel, and let the Baidu
provider convert the input to 16 kHz MP3. If the source file has unusual audio,
normalize it first with ffmpeg and rerun Sapat.

**Problem:** The transcript is empty or misses domain words.

**Solution:** Verify that the `BAIDU_DEV_PID` matches the spoken language.
Then rerun with a smaller clip and compare the result before processing a larger
batch.

## Conclusion

You now have a reproducible Daytona workflow for short Baidu-backed
transcription jobs in Sapat. The important parts are keeping credentials out of
Git, using short source clips that fit Baidu's API model, validating the output
before reuse, and keeping the command plus review notes with the generated
transcript.

## References

- [Companion Sapat PR](https://github.com/nibzard/sapat/pull/47)
- [Baidu short speech API](https://cloud.baidu.com/doc/SPEECH/s/Jlbxdezuf)
- [Baidu Python SDK reference](https://cloud.baidu.com/doc/SPEECH/s/0lbxfnc9b)
- [Sapat repository](https://github.com/nkkko/sapat)
- [Daytona repository](https://github.com/daytonaio/daytona)

![Baidu Sapat workflow](/assets/20260524_baidu_sapat_transcription_flow.svg)
28 changes: 28 additions & 0 deletions guides/assets/20260524_baidu_sapat_transcription_flow.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.