5-Second Audiobook Passage Transcription: Plex + ffmpeg + mlx_whisper on Apple Silicon

Audio Engineering web

TL;DR

“Hey Siri, audiomark” triggers a POST to /api/ingest/audiomark. The endpoint queries Plex for what’s playing, calculates album-wide progress, extracts a 2-minute audio clip via ffmpeg, transcribes it with mlx_whisper, and saves the passage as a searchable highlight. Total pipeline: ~5 seconds.

The Insight

Plex exposes three things via /status/sessions:

  1. The file path on the NAS (/volume1/Everything/Audiobooks/...)
  2. The exact playback offset in milliseconds
  3. The parent metadata key (album-level, for multi-file books)

The NAS is mounted locally at /Volumes/Everything/.... That means ffmpeg can seek directly into a 15-hour .m4b file without downloading anything. And mlx_whisper runs natively on Apple Silicon’s Neural Engine.

The entire pipeline is local. No cloud APIs. No network latency beyond the Plex status check.

Architecture

Siri Shortcut (POST /api/ingest/audiomark)
    |
    v
Plex API (/status/sessions)  -->  file path + offset + metadata
    |
    v
Path mapping: /volume1/... --> /Volumes/...
    |
    v
ffmpeg -ss {offset} -t 120 -i {path} /tmp/clip.wav   (0.15s)
    |
    v
mlx_whisper --model whisper-base-mlx /tmp/clip.wav    (4s)
    |
    v
Store as UnifiedEvent (type="audiomark", source="bookmarks")

Album-Wide Progress Calculation

Audiobooks on Plex are often split across 50+ files. Plex reports progress within the current track, not the album. To get true progress:

  1. Fetch /library/metadata/{parentRatingKey}/children — all tracks
  2. Sum durations of all tracks preceding the current one
  3. Add current track offset
  4. Divide by total album duration

This gives consistent percentage progress regardless of whether a book is a single 15-hour m4b or 47 separate MP3s.

tracks = await plex_get_children(parent_key)
elapsed = sum(t.duration for t in tracks if t.index < current_index)
elapsed += current_offset
total = sum(t.duration for t in tracks)
progress = elapsed / total  # 0.0 - 1.0

ffmpeg: Why It’s Instant

For container formats like m4b/mp4, ffmpeg can seek without decoding. The -ss flag before -i performs an input seek using the container’s index. For a 15-hour file, seeking to hour 12 takes 0.15 seconds — identical to seeking to second 5.

ffmpeg -ss 43200 -t 120 -i /Volumes/Everything/Audiobooks/Dune/dune.m4b \
       -ar 16000 -ac 1 /tmp/audiomark_clip.wav

The output is downsampled to 16kHz mono (whisper’s expected format), which also reduces the transcription file size.

mlx_whisper: Apple Silicon Native

Using mlx-community/whisper-base-mlx (139M params). On M2 Pro:

  • 2 minutes of audio: ~4 seconds wall time
  • Runs on Neural Engine / GPU via MLX framework
  • No Python overhead from torch/CUDA shims

Trade-off: base model mangles proper nouns. “Crysknife” becomes “Christ’s knife”, “Leto” becomes “Latos”. For passage bookmarking this is acceptable — you’re marking where something interesting was said, not creating a publication-quality transcript. The combination of book title + progress percentage + approximate text is enough to relocate any passage.

Output Schema

{
  "event_id": "sha256(bookmarks:1714924800000:dune-chapter-12...)",
  "ts": 1714924800000,
  "source": "bookmarks",
  "type": "audiomark",
  "payload": {
    "title": "Dune",
    "author": "Frank Herbert",
    "progress": 0.73,
    "progress_display": "73% (11h 02m / 15h 06m)",
    "highlights": "The mystery of life isn't a problem to solve, but a reality to experience...",
    "clip_duration_s": 120
  }
}

Plex API Gotchas

  • accountID filtering: Multi-user servers need accountID=1 (or your user ID) to filter sessions
  • Pagination: History endpoint requires X-Plex-Container-Start and X-Plex-Container-Size headers
  • Session keys are ephemeral: Use parentRatingKey (album) or ratingKey (track) for stable references
  • NAS path mapping: Plex stores the server-side path. You need a simple string replacement for local access

Backfill: 1,061 Events from History

Plex also exposes /status/sessions/history/all with full listening history. We backfilled 1,061 audiobook listening events spanning September 2022 through May 2026: 205 unique books, 204 with descriptions. Same deterministic event ID pattern means re-running the backfill is idempotent.

What Makes This Work

The convergence of four things that each seem unremarkable alone:

  1. Plex exposes exact file paths (not just stream URLs)
  2. The NAS is mounted locally (no download step)
  3. ffmpeg’s container-aware seeking is O(1)
  4. Apple Silicon runs whisper inference without GPU drivers or cloud calls

Remove any one of these and the pipeline becomes either slow (cloud transcription), complex (download-then-process), or impossible (no file path access).