Audio Transcription

Intermediatev1.0.0

Transcribe audio files to text with optional diarization and known-speaker hints.

Content

Transcribe audio using OpenAI, with optional speaker diarization when requested. Prefer the bundled CLI for deterministic, repeatable runs.

Workflow

1. Collect inputs: audio file path(s), desired response format (text/json/diarized_json), optional language hint, and any known speaker references.

2. Verify OPENAI_API_KEY is set. If missing, ask the user to set it locally (do not ask them to paste the key).

3. Run the bundled transcribe_diarize.py CLI with sensible defaults (fast text transcription).

4. Validate the output: transcription quality, speaker labels, and segment boundaries; iterate with a single targeted change if needed.

5. Save outputs under output/transcribe/ when working in this repo.

Decision rules

-Default to gpt-4o-mini-transcribe with --response-format text for fast transcription.
-If the user wants speaker labels or diarization, use --model gpt-4o-transcribe-diarize --response-format diarized_json.
-If audio is longer than ~30 seconds, keep --chunking-strategy auto.
-Prompting is not supported for gpt-4o-transcribe-diarize.

Output conventions

-Use output/transcribe/<job-id>/ for evaluation runs.
-Use --out-dir for multiple files to avoid overwriting.

Dependencies (install if missing)

Prefer uv for dependency management.

code

uv pip install openai

If uv is unavailable:

code

python3 -m pip install openai

Environment

-OPENAI_API_KEY must be set for live API calls.
-If the key is missing, instruct the user to create one in the OpenAI platform UI and export it in their shell.
-Never ask the user to paste the full key in chat.

Skill path (set once)

bash

export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export TRANSCRIBE_CLI="$CODEX_HOME/skills/transcribe/scripts/transcribe_diarize.py"

User-scoped skills install under $CODEX_HOME/skills (default: ~/.codex/skills).

CLI quick start

Single file (fast text default):

code

python3 "$TRANSCRIBE_CLI" \
  path/to/audio.wav \
  --out transcript.txt

Diarization with known speakers (up to 4):

code

python3 "$TRANSCRIBE_CLI" \
  meeting.m4a \
  --model gpt-4o-transcribe-diarize \
  --known-speaker "Alice=refs/alice.wav" \
  --known-speaker "Bob=refs/bob.wav" \
  --response-format diarized_json \
  --out-dir output/transcribe/meeting

Plain text output (explicit):

code

python3 "$TRANSCRIBE_CLI" \
  interview.mp3 \
  --response-format text \
  --out interview.txt

Reference map

-references/api.md: supported formats, limits, response formats, and known-speaker notes.

FAQ

Discussion

Loading comments...

Workflow

1. Collect inputs: audio file path(s), desired response format (text/json/diarized_json), optional language hint, and any known speaker references.

2. Verify OPENAI_API_KEY is set. If missing, ask the user to set it locally (do not ask them to paste the key).

3. Run the bundled transcribe_diarize.py CLI with sensible defaults (fast text transcription).

4. Validate the output: transcription quality, speaker labels, and segment boundaries; iterate with a single targeted change if needed.

5. Save outputs under output/transcribe/ when working in this repo.

Decision rules

-Default to gpt-4o-mini-transcribe with --response-format text for fast transcription.

-If the user wants speaker labels or diarization, use --model gpt-4o-transcribe-diarize --response-format diarized_json.

-If audio is longer than ~30 seconds, keep --chunking-strategy auto.

-Prompting is not supported for gpt-4o-transcribe-diarize.

CLI quick start

Single file (fast text default):

code

python3 "$TRANSCRIBE_CLI" \
  path/to/audio.wav \
  --out transcript.txt

Diarization with known speakers (up to 4):

code

python3 "$TRANSCRIBE_CLI" \
  meeting.m4a \
  --model gpt-4o-transcribe-diarize \
  --known-speaker "Alice=refs/alice.wav" \
  --known-speaker "Bob=refs/bob.wav" \
  --response-format diarized_json \
  --out-dir output/transcribe/meeting

Plain text output (explicit):

code

python3 "$TRANSCRIBE_CLI" \
  interview.mp3 \
  --response-format text \
  --out interview.txt