5-Second Audiobook Passage Transcription: Plex + ffmpeg + mlx_whisper on Apple Silicon

Posted by Weixiang Zhang on May 05, 2026 · Topic: Audio Engineering

May 05, 2026 Audio Engineering web

TL;DR

“Hey Siri, audiomark” triggers a POST to /api/ingest/audiomark. The endpoint queries Plex for what’s playing, calculates album-wide progress, extracts a 2-minute audio clip via ffmpeg, transcribes it with mlx_whisper, and saves the passage as a searchable highlight. Total pipeline: ~5 seconds.

The Insight

Plex exposes three things via /status/sessions:

The file path on the NAS (/volume1/Everything/Audiobooks/...)
The exact playback offset in milliseconds
The parent metadata key (album-level, for multi-file books)

The NAS is mounted locally at /Volumes/Everything/.... That means ffmpeg can seek directly into a 15-hour .m4b file without downloading anything. And mlx_whisper runs natively on Apple Silicon’s Neural Engine.

The entire pipeline is local. No cloud APIs. No network latency beyond the Plex status check.

Architecture

Siri Shortcut (POST /api/ingest/audiomark)
    |
    v
Plex API (/status/sessions)  -->  file path + offset + metadata
    |
    v
Path mapping: /volume1/... --> /Volumes/...
    |
    v
ffmpeg -ss {offset} -t 120 -i {path} /tmp/clip.wav   (0.15s)
    |
    v
mlx_whisper --model whisper-base-mlx /tmp/clip.wav    (4s)
    |
    v
Store as UnifiedEvent (type="audiomark", source="bookmarks")

Album-Wide Progress Calculation

Audiobooks on Plex are often split across 50+ files. Plex reports progress within the current track, not the album. To get true progress:

Fetch /library/metadata/{parentRatingKey}/children — all tracks
Sum durations of all tracks preceding the current one
Add current track offset
Divide by total album duration

This gives consistent percentage progress regardless of whether a book is a single 15-hour m4b or 47 separate MP3s.

tracks = await plex_get_children(parent_key)
elapsed = sum(t.duration for t in tracks if t.index < current_index)
elapsed += current_offset
total = sum(t.duration for t in tracks)
progress = elapsed / total  # 0.0 - 1.0

ffmpeg: Why It’s Instant

For container formats like m4b/mp4, ffmpeg can seek without decoding. The -ss flag before -i performs an input seek using the container’s index. For a 15-hour file, seeking to hour 12 takes 0.15 seconds — identical to seeking to second 5.

ffmpeg -ss 43200 -t 120 -i /Volumes/Everything/Audiobooks/Dune/dune.m4b \
       -ar 16000 -ac 1 /tmp/audiomark_clip.wav

The output is downsampled to 16kHz mono (whisper’s expected format), which also reduces the transcription file size.

mlx_whisper: Apple Silicon Native

Using mlx-community/whisper-base-mlx (139M params). On M2 Pro:

2 minutes of audio: ~4 seconds wall time
Runs on Neural Engine / GPU via MLX framework
No Python overhead from torch/CUDA shims

Trade-off: base model mangles proper nouns. “Crysknife” becomes “Christ’s knife”, “Leto” becomes “Latos”. For passage bookmarking this is acceptable — you’re marking where something interesting was said, not creating a publication-quality transcript. The combination of book title + progress percentage + approximate text is enough to relocate any passage.

Output Schema

{
  "event_id": "sha256(bookmarks:1714924800000:dune-chapter-12...)",
  "ts": 1714924800000,
  "source": "bookmarks",
  "type": "audiomark",
  "payload": {
    "title": "Dune",
    "author": "Frank Herbert",
    "progress": 0.73,
    "progress_display": "73% (11h 02m / 15h 06m)",
    "highlights": "The mystery of life isn't a problem to solve, but a reality to experience...",
    "clip_duration_s": 120
  }
}

Plex API Gotchas

accountID filtering: Multi-user servers need accountID=1 (or your user ID) to filter sessions
Pagination: History endpoint requires X-Plex-Container-Start and X-Plex-Container-Size headers
Session keys are ephemeral: Use parentRatingKey (album) or ratingKey (track) for stable references
NAS path mapping: Plex stores the server-side path. You need a simple string replacement for local access

Backfill: 1,061 Events from History

Plex also exposes /status/sessions/history/all with full listening history. We backfilled 1,061 audiobook listening events spanning September 2022 through May 2026: 205 unique books, 204 with descriptions. Same deterministic event ID pattern means re-running the backfill is idempotent.

What Makes This Work

The convergence of four things that each seem unremarkable alone:

Plex exposes exact file paths (not just stream URLs)
The NAS is mounted locally (no download step)
ffmpeg’s container-aware seeking is O(1)
Apple Silicon runs whisper inference without GPU drivers or cloud calls

Remove any one of these and the pipeline becomes either slow (cloud transcription), complex (download-then-process), or impossible (no file path access).