diff --git a/authors/dr_alex_mitre.md b/authors/dr_alex_mitre.md new file mode 100644 index 00000000..ef2da11b --- /dev/null +++ b/authors/dr_alex_mitre.md @@ -0,0 +1,6 @@ +Author: Dr Alex Mitre Title: Software Developer Description: Dr Alex Mitre +builds web, Swift, JavaScript, and Python applications, with experience in +public-sector software, education technology, and reproducible AI workflows. +Author Image: [https://avatars.githubusercontent.com/u/30060514?v=4] Author +LinkedIn: Author Twitter: Company Name: Independent Company Description: +Independent software development and AI workflow automation. diff --git a/definitions/20260524_definition_baidu_speech_recognition.md b/definitions/20260524_definition_baidu_speech_recognition.md new file mode 100644 index 00000000..de1e7f71 --- /dev/null +++ b/definitions/20260524_definition_baidu_speech_recognition.md @@ -0,0 +1,25 @@ +--- +title: 'Baidu Speech Recognition' +description: + 'A Baidu Cloud speech-to-text service for converting short audio clips into + text through REST APIs and SDKs.' +date: 2026-05-24 +author: 'Dr Alex Mitre' +--- + +# Baidu Speech Recognition + +## Definition + +Baidu Speech Recognition is a Baidu Cloud speech-to-text service that converts +spoken audio into text. Its short speech recognition API accepts complete audio +clips, uses a language model selected by `dev_pid`, and returns recognized text +in a JSON response. + +## Context and Usage + +In developer workflows, Baidu Speech Recognition can be used as a transcription +backend for short demo clips, command recordings, meeting snippets, and +language-specific speech-to-text tests. A reproducible setup stores credentials +in environment variables, normalizes audio before submission, and keeps the +generated transcript with the source clip and validation notes. diff --git a/guides/20260524_baidu_speech_sapat_daytona.md b/guides/20260524_baidu_speech_sapat_daytona.md new file mode 100644 index 00000000..341b92a2 --- /dev/null +++ b/guides/20260524_baidu_speech_sapat_daytona.md @@ -0,0 +1,186 @@ +--- +title: 'Run Baidu Speech Recognition with Sapat' +description: + 'Use Daytona and Sapat to transcribe short clips with Baidu Speech + Recognition from a reproducible workspace.' +date: 2026-05-24 +author: 'Dr Alex Mitre' +tags: ['ai', 'speech-to-text', 'daytona'] +--- + +# Run Baidu Speech Recognition with Sapat + +# Introduction + +Sapat is a small command-line transcription tool that converts video files to +MP3, sends the audio to a speech-to-text provider, and writes a sidecar text +file. That makes it useful for demos, meeting clips, QA recordings, and short +research notes where the transcript should be reproducible from one command. + +This guide shows how to run Sapat with a Baidu Speech Recognition provider from +a Daytona workspace. The workflow keeps provider secrets in environment +variables, uses Sapat's existing file and directory processing model, and +documents the validation path before you use the transcript downstream. + +## TL;DR + +- Use a Daytona workspace so the ffmpeg and Python setup is repeatable. +- Install the Sapat branch that adds `--api baidu` support. +- Store `BAIDU_API_KEY` and `BAIDU_SECRET_KEY` in `.env`, not in Git. +- Use short audio clips because Baidu's short speech API is designed for files + up to 60 seconds. +- Verify the generated `.txt` output before using it in summaries, tickets, or + customer-facing notes. + +## Step 1: Create a Daytona workspace + +Create a workspace from the Sapat repository: + +```bash +daytona create https://github.com/nkkko/sapat --code +``` + +Open the workspace terminal and confirm that Python and ffmpeg are available: + +```bash +python --version +ffmpeg -version +``` + +Sapat uses ffmpeg to extract audio from the input video before it calls the +speech-to-text provider. Keeping that dependency inside the workspace makes the +same command easier to rerun later. + +## Step 2: Install the Baidu-enabled Sapat branch + +Install the companion provider implementation: + +```bash +pip install \ + git+https://github.com/mitre88/sapat.git@add-baidu-transcription-provider +``` + +Confirm that the new provider is available: + +```bash +sapat --help +``` + +The `--api` option should include `baidu` alongside the existing OpenAI, Groq, +and Azure choices. + +## Step 3: Configure Baidu credentials safely + +Create a local `.env` file in the workspace root: + +```bash +cat > .env <<'EOF' +BAIDU_API_KEY=replace_with_your_api_key +BAIDU_SECRET_KEY=replace_with_your_secret_key +BAIDU_DEV_PID=1737 +BAIDU_SAMPLE_RATE=16000 +EOF +``` + +Use `BAIDU_DEV_PID=1737` for English clips and `BAIDU_DEV_PID=1537` for +Mandarin clips. Leave `.env` uncommitted. A good workspace habit is to check the +Git state before and after each transcription run: + +```bash +git status --short +``` + +If `.env` appears in the output, add it to `.gitignore` before continuing. + +## Step 4: Prepare a short clip + +Baidu's short speech endpoint is meant for complete audio files under 60 +seconds. If the source video is longer, cut a smaller sample first: + +```bash +ffmpeg -i long-demo.mp4 -t 45 -c copy baidu-demo-clip.mp4 +``` + +For production workflows, split longer recordings into reviewed chunks and keep +a manifest with the source file, clip window, language, and expected topic. +That makes transcript review and reruns traceable. + +## Step 5: Run Sapat with Baidu + +Run Sapat on the prepared clip: + +```bash +sapat baidu-demo-clip.mp4 --api baidu --language en --quality M +``` + +Sapat will create `baidu-demo-clip.txt` next to the input file. The Baidu +provider converts audio to mono 16 kHz MP3 before sending the request, fetches a +Baidu access token from the configured API key and secret, and then submits the +base64-encoded audio to Baidu's short speech recognition endpoint. + +For Mandarin clips, use: + +```bash +sapat baidu-demo-clip.mp4 --api baidu --language zh-CN --quality M +``` + +## Step 6: Review and record the result + +Open the transcript and check it against the source clip: + +```bash +sed -n '1,120p' baidu-demo-clip.txt +``` + +Record a small run note with the command, clip duration, language, provider, and +review status. For example: + +```text +source: baidu-demo-clip.mp4 +duration: 45 seconds +provider: baidu +language: en +command: sapat baidu-demo-clip.mp4 --api baidu --language en --quality M +review: checked for speaker names, product names, and missing sentences +``` + +That run note is useful when the transcript becomes input to a bug report, +customer summary, release note, or retrieval dataset. + +## Common Issues and Troubleshooting + +**Problem:** `BAIDU_API_KEY and BAIDU_SECRET_KEY must be set.` + +**Solution:** Confirm the `.env` file is in the workspace root and that the key +names match exactly. Restart the shell if your workflow exports variables +outside `.env`. + +**Problem:** The API returns an audio quality or format error. + +**Solution:** Keep clips short, use one audio channel, and let the Baidu +provider convert the input to 16 kHz MP3. If the source file has unusual audio, +normalize it first with ffmpeg and rerun Sapat. + +**Problem:** The transcript is empty or misses domain words. + +**Solution:** Verify that the `BAIDU_DEV_PID` matches the spoken language. +Then rerun with a smaller clip and compare the result before processing a larger +batch. + +## Conclusion + +You now have a reproducible Daytona workflow for short Baidu-backed +transcription jobs in Sapat. The important parts are keeping credentials out of +Git, using short source clips that fit Baidu's API model, validating the output +before reuse, and keeping the command plus review notes with the generated +transcript. + +## References + +- [Companion Sapat PR](https://github.com/nibzard/sapat/pull/47) +- [Baidu short speech API](https://cloud.baidu.com/doc/SPEECH/s/Jlbxdezuf) +- [Baidu Python SDK reference](https://cloud.baidu.com/doc/SPEECH/s/0lbxfnc9b) +- [Sapat repository](https://github.com/nkkko/sapat) +- [Daytona repository](https://github.com/daytonaio/daytona) + +![Baidu Sapat workflow](/assets/20260524_baidu_sapat_transcription_flow.svg) diff --git a/guides/assets/20260524_baidu_sapat_transcription_flow.svg b/guides/assets/20260524_baidu_sapat_transcription_flow.svg new file mode 100644 index 00000000..42cbeccc --- /dev/null +++ b/guides/assets/20260524_baidu_sapat_transcription_flow.svg @@ -0,0 +1,28 @@ + + Baidu Sapat transcription flow + A Daytona workspace runs Sapat, converts a short video clip, sends audio to Baidu Speech Recognition, and saves a reviewed transcript. + + + Daytona + Workspace + + Sapat + 16 kHz MP3 + + Baidu + Speech API + + Transcript + reviewed .txt + + + + video clip + API request + result + + + + + +