Building a Personal Telemetry Platform in One Session

claude-codelife-dashboarddata-engineeringpostgres

Building a Personal Telemetry Platform in One Session

I have been collecting data about my life for five years. Clipboard copies with health snapshots. Driving logs. Sleep records. Book highlights. Journal entries. Location check-ins. All of it flowing into Data Jar on my iPhone, serialized into nested JSON structures that only iOS Shortcuts could love.

Sprint 1 mapped out the territory: pydantic models, config, package structure, a rough schema for what a “unified event” should look like. But nothing was connected. No database. No API. No visualization. Just a well-organized skeleton.

This session made it real.

SQLite to Postgres

The original plan was aiosqlite. Simple, single-file, no dependencies. It lasted about twenty minutes.

The moment I needed JSONB queries against nested health data and location context, SQLite’s JSON support felt like writing with mittens on. I switched to asyncpg with a connection pool (min 2, max 10) and never looked back.

The migration itself was straightforward, but the JSONB codec setup was the kind of thing that eats an hour if you don’t know the trick. Python dicts don’t serialize to JSONB automatically in asyncpg — you need a custom codec registered on every connection in the pool’s init callback:

async def _init_conn(conn):
    await conn.set_type_codec(
        'jsonb', encoder=json.dumps, decoder=json.loads, schema='pg_catalog'
    )

Two lines. But without them, every insert throws a type error that points you in exactly the wrong direction.

The Adapter Pipeline

Nine adapters. Eleven data sources. Each one a different flavor of messy.

Data Jar wraps every value in a {"type": ..., "value": ...} envelope. Recursively. A list of dictionaries becomes a type/value wrapper containing type/value wrappers containing type/value wrappers. The datajar.py unwrapper peels all of that off recursively before any adapter touches the data.

After unwrapping, each adapter converts raw records into UnifiedEvent objects with a deterministic event ID (SHA256 of source + timestamp + content prefix, truncated to 16 hex chars). This is the single best design decision in the project. Every event is idempotent. Ingest the same data ten times, get the same result. INSERT ... ON CONFLICT (event_id) DO NOTHING handles dedup at the database level.

The adapters themselves:

AdapterSourceEventsNotes
lifetrackerLifeTrackerLogcharger, driving, bedtime8K+ events, the backbone of the dataset
clipboardclipboard to textclipboard_copyHealth data embedded in every copy event
journaldailyjournaljournal_entryFreeform text, sometimes serialized BookmarkRecords
bookmarksBookmarkRecordshighlightBook highlights with author/title metadata
drivingdriving_recordsdrive_start, drive_endPaired events with location, from a later schema
milageHistorical Milageodometer_readingEV odometer snapshots
locationsLocations Parkedlocation_parkedGPS-tagged parking events
sleepByTheWeiCo sleep.jsonsleep_nightNightly sleep data with bedtime/wakeup
bytheweicoper-day JSONsocial, audiobook, readingCross-project events from the ByTheWeiCo pipeline

The Unicode Bug

The clipboard adapter was the most painful. Not because of complexity, but because of a single invisible character.

The timestamps come in two formats: 24-hour (Apr 24, 2026 at 17:20) and 12-hour with AM/PM (Jun 20, 2025 at 12:53 PM). Standard strptime parsing, two format strings, try both. Easy.

Except 91.8% of the 12-hour timestamps were silently failing and falling through to the epoch fallback (January 1, 2020). That is 4,185 events with wrong timestamps.

The cause: Unicode character U+202F, the “narrow no-break space,” sitting invisibly between the minutes and AM/PM. Apple’s date formatter inserts it by default. It looks identical to a regular space in every editor and terminal. strptime does not match it.

The fix:

normalized = dt_str.replace("\u202f", " ")

One line. 4,185 events recovered.

Old Backup Archaeology

The active Data Jar store only goes back to mid-2023 — at some point the Shortcut that manages it crashed and restarted with a fresh store. But the old data still existed.

I found it in three places: an old iCloud sync of store.json frozen at November 2022, a zip backup from June 2025 sitting in iCloud Drive, and the ByTheWeiCo project’s exported daily JSON files. Each one had a slightly different schema. The 2025 zip had driving_records that didn’t exist in the other stores. The old iCloud store had pre-2023 clipboard entries with the v1 health format (nested dict with getHRV/getHR/getStep keys instead of flat fields).

The loader tries all candidate paths for each backup source. The deterministic event IDs meant I could point it at everything and let the database sort out duplicates. No dedup logic needed — just ON CONFLICT DO NOTHING.

The Globe

Globe.gl rendering real geocoded locations was the visual payoff for all the data plumbing.

The geocoding pipeline runs as a background task on every server boot. It pulls all unique addresses from event context fields, checks them against a geocache Postgres table, and batch-resolves unknowns through Nominatim at one request per second (their rate limit). Results cache permanently — subsequent boots skip already-resolved addresses.

The globe shows three layers:

  • Points: color-coded by dominant event type (green for charger events, blue for driving, orange for clipboard copies, purple for bedtime). Sized on a log scale by event count.
  • Arcs: driving routes reconstructed from sequential drive_start / drive_end pairs. Deduplicated visually so repeated commutes show as a single arc with a count. 85 unique routes in the current dataset.
  • Rings: animated pulsing rings at locations with activity in the last 24 hours.

All three layers are backed by named SQL queries that join the events table against the geocache table. No application-level geocoding at request time.

SQL-First Architecture

Midway through the build, I noticed I was writing the same query logic in two places: once in the API route handlers, once in the SQL explorer presets. Classic drift.

The fix was queries.py — a single registry of named SQL queries. Every API endpoint calls run_query("globe_points") or run_query_val("event_count"). The SQL explorer has preset buttons that map to the exact same queries the API uses. One source of truth.

The explorer itself is a single HTML page served from the query route module. Monospace font, dark theme, preset buttons for every data domain (sources, locations, drives, health, reading, milage, audit) plus buttons that mirror every API endpoint. Read-only transactions enforced at the database level. Dangerous keywords blocked before the query reaches Postgres.

The Audit System

The audit runs on every boot, right after ingest. It uses Fibonacci-batch random sampling: batches of 1, 1, 2, 3, 5, 8, 13 events (33 samples total), each pulled with ORDER BY random().

Every sampled event gets seven checks:

  1. Event ID format (11-16 character hex string)
  2. Timestamp within valid range (2020-2028)
  3. ISO timestamp matches millisecond timestamp (within 1 minute drift)
  4. Source is in the known set of 11 valid sources
  5. Focus mode field doesn’t have trailing brace corruption
  6. Health field types are numeric (not strings or objects)
  7. Payload is a dict, not a string or null

The full audit adds source-level aggregate checks: how many events are stuck at the 2020 epoch fallback, how many have the trailing-brace focus mode bug, how many have sub-2020 timestamps.

The first run found three bugs. The trailing-brace issue was a Data Jar unwrapping edge case where a JSON fragment leaked into the focus mode field. The stuck-2020 timestamps were the Unicode bug. The third was a batch of milage events where the timestamp parser was receiving an integer instead of a string.

The Numbers

MetricValue
Total events55,354
Data sources11
Adapters9
Days in daily summary1,618
API endpoints30+
Named SQL queries18
Geocached addresses1,225
Unique driving routes85
Audit checks per sample7
Background tasks on bootingest, audit, daily summary refresh, geocode pipeline

The daily summary is a Postgres materialized view that aggregates per-day stats: steps, average HRV, drive count, clipboard copies, unique locations, highlights, social posts, audiobook listens, sleep hours, bedtime, wakeup time, total events, and dominant focus mode. 1,618 days of data. Refreshed concurrently on every boot and reload.

API Call Logging

Every API request (except static files) gets logged to an api_log table with method, path, query params, status code, response time in milliseconds, client IP, and user agent. A middleware handles it asynchronously so logging failures never block responses. The /api/logs endpoint lets me query the log with optional path filtering.

This was not in the original plan. I added it after realizing I had no way to know which endpoints were actually being hit by the Globe.gl frontend. It took ten minutes and immediately paid for itself.

What Is Next

The kanban has ten surveillance features queued:

  • Sleep quality score — composite metric from duration, bedtime consistency, and wakeup time
  • Daily life score — weighted rollup of steps, HRV, sleep, productivity signals
  • HRV stress alerts — threshold-based notifications via ntfy
  • Location anomaly detection — flag days where location patterns deviate from baseline
  • Predictive daily briefing — morning summary generated from historical patterns for the same day of week
  • Commute analytics — drive time trends, route frequency analysis
  • Health correlation matrix — which context variables (weather, focus mode, location) correlate with health metrics
  • Reading velocity tracking — highlights per day, books per month, reading streaks
  • Social activity patterns — posting cadence, engagement trends from Threads proxy
  • Battery life modeling — predict end-of-day battery from morning charge and usage patterns

The foundation is solid. Fifty-five thousand events with clean schemas, deterministic IDs, JSONB flexibility, and an audit system that catches problems before they compound. Everything else is just queries.