5-Second Audiobook Passage Transcription: Plex + ffmpeg + mlx_whisper on Apple Silicon
TL;DR
“Hey Siri, audiomark” triggers a POST to
/api/ingest/audiomark. The endpoint queries Plex for what’s playing, calculates album-wide progress, extracts a 2-minute audio clip via ffmpeg, transcribes it with mlx_whisper, and saves the passage as a searchable highlight. Total pipeline: ~5 seconds.
The Insight
Plex exposes three things via /status/sessions:
- The file path on the NAS (
/volume1/Everything/Audiobooks/...) - The exact playback offset in milliseconds
- The parent metadata key (album-level, for multi-file books)
The NAS is mounted locally at /Volumes/Everything/.... That means ffmpeg can seek directly into a 15-hour .m4b file without downloading anything. And mlx_whisper runs natively on Apple Silicon’s Neural Engine.
The entire pipeline is local. No cloud APIs. No network latency beyond the Plex status check.
Architecture
Siri Shortcut (POST /api/ingest/audiomark)
|
v
Plex API (/status/sessions) --> file path + offset + metadata
|
v
Path mapping: /volume1/... --> /Volumes/...
|
v
ffmpeg -ss {offset} -t 120 -i {path} /tmp/clip.wav (0.15s)
|
v
mlx_whisper --model whisper-base-mlx /tmp/clip.wav (4s)
|
v
Store as UnifiedEvent (type="audiomark", source="bookmarks")
Album-Wide Progress Calculation
Audiobooks on Plex are often split across 50+ files. Plex reports progress within the current track, not the album. To get true progress:
- Fetch
/library/metadata/{parentRatingKey}/children— all tracks - Sum durations of all tracks preceding the current one
- Add current track offset
- Divide by total album duration
This gives consistent percentage progress regardless of whether a book is a single 15-hour m4b or 47 separate MP3s.
tracks = await plex_get_children(parent_key)
elapsed = sum(t.duration for t in tracks if t.index < current_index)
elapsed += current_offset
total = sum(t.duration for t in tracks)
progress = elapsed / total # 0.0 - 1.0
ffmpeg: Why It’s Instant
For container formats like m4b/mp4, ffmpeg can seek without decoding. The -ss flag before -i performs an input seek using the container’s index. For a 15-hour file, seeking to hour 12 takes 0.15 seconds — identical to seeking to second 5.
ffmpeg -ss 43200 -t 120 -i /Volumes/Everything/Audiobooks/Dune/dune.m4b \
-ar 16000 -ac 1 /tmp/audiomark_clip.wav
The output is downsampled to 16kHz mono (whisper’s expected format), which also reduces the transcription file size.
mlx_whisper: Apple Silicon Native
Using mlx-community/whisper-base-mlx (139M params). On M2 Pro:
- 2 minutes of audio: ~4 seconds wall time
- Runs on Neural Engine / GPU via MLX framework
- No Python overhead from torch/CUDA shims
Trade-off: base model mangles proper nouns. “Crysknife” becomes “Christ’s knife”, “Leto” becomes “Latos”. For passage bookmarking this is acceptable — you’re marking where something interesting was said, not creating a publication-quality transcript. The combination of book title + progress percentage + approximate text is enough to relocate any passage.
Output Schema
{
"event_id": "sha256(bookmarks:1714924800000:dune-chapter-12...)",
"ts": 1714924800000,
"source": "bookmarks",
"type": "audiomark",
"payload": {
"title": "Dune",
"author": "Frank Herbert",
"progress": 0.73,
"progress_display": "73% (11h 02m / 15h 06m)",
"highlights": "The mystery of life isn't a problem to solve, but a reality to experience...",
"clip_duration_s": 120
}
}
Plex API Gotchas
- accountID filtering: Multi-user servers need
accountID=1(or your user ID) to filter sessions - Pagination: History endpoint requires
X-Plex-Container-StartandX-Plex-Container-Sizeheaders - Session keys are ephemeral: Use
parentRatingKey(album) orratingKey(track) for stable references - NAS path mapping: Plex stores the server-side path. You need a simple string replacement for local access
Backfill: 1,061 Events from History
Plex also exposes /status/sessions/history/all with full listening history. We backfilled 1,061 audiobook listening events spanning September 2022 through May 2026: 205 unique books, 204 with descriptions. Same deterministic event ID pattern means re-running the backfill is idempotent.
What Makes This Work
The convergence of four things that each seem unremarkable alone:
- Plex exposes exact file paths (not just stream URLs)
- The NAS is mounted locally (no download step)
- ffmpeg’s container-aware seeking is O(1)
- Apple Silicon runs whisper inference without GPU drivers or cloud calls
Remove any one of these and the pipeline becomes either slow (cloud transcription), complex (download-then-process), or impossible (no file path access).