Building a Personal Telemetry Platform in One Session
Building a Personal Telemetry Platform in One Session
I have been collecting data about my life for five years. Clipboard copies with health snapshots. Driving logs. Sleep records. Book highlights. Journal entries. Location check-ins. All of it flowing into Data Jar on my iPhone, serialized into nested JSON structures that only iOS Shortcuts could love.
Sprint 1 mapped out the territory: pydantic models, config, package structure, a rough schema for what a “unified event” should look like. But nothing was connected. No database. No API. No visualization. Just a well-organized skeleton.
This session made it real.
SQLite to Postgres
The original plan was aiosqlite. Simple, single-file, no dependencies. It lasted about twenty minutes.
The moment I needed JSONB queries against nested health data and location context, SQLite’s JSON support felt like writing with mittens on. I switched to asyncpg with a connection pool (min 2, max 10) and never looked back.
The migration itself was straightforward, but the JSONB codec setup was the kind of thing that eats an hour if you don’t know the trick. Python dicts don’t serialize to JSONB automatically in asyncpg — you need a custom codec registered on every connection in the pool’s init callback:
async def _init_conn(conn):
await conn.set_type_codec(
'jsonb', encoder=json.dumps, decoder=json.loads, schema='pg_catalog'
)
Two lines. But without them, every insert throws a type error that points you in exactly the wrong direction.
The Adapter Pipeline
Nine adapters. Eleven data sources. Each one a different flavor of messy.
Data Jar wraps every value in a {"type": ..., "value": ...} envelope. Recursively. A list of dictionaries becomes a type/value wrapper containing type/value wrappers containing type/value wrappers. The datajar.py unwrapper peels all of that off recursively before any adapter touches the data.
After unwrapping, each adapter converts raw records into UnifiedEvent objects with a deterministic event ID (SHA256 of source + timestamp + content prefix, truncated to 16 hex chars). This is the single best design decision in the project. Every event is idempotent. Ingest the same data ten times, get the same result. INSERT ... ON CONFLICT (event_id) DO NOTHING handles dedup at the database level.
The adapters themselves:
| Adapter | Source | Events | Notes |
|---|---|---|---|
| lifetracker | LifeTrackerLog | charger, driving, bedtime | 8K+ events, the backbone of the dataset |
| clipboard | clipboard to text | clipboard_copy | Health data embedded in every copy event |
| journal | dailyjournal | journal_entry | Freeform text, sometimes serialized BookmarkRecords |
| bookmarks | BookmarkRecords | highlight | Book highlights with author/title metadata |
| driving | driving_records | drive_start, drive_end | Paired events with location, from a later schema |
| milage | Historical Milage | odometer_reading | EV odometer snapshots |
| locations | Locations Parked | location_parked | GPS-tagged parking events |
| sleep | ByTheWeiCo sleep.json | sleep_night | Nightly sleep data with bedtime/wakeup |
| bytheweico | per-day JSON | social, audiobook, reading | Cross-project events from the ByTheWeiCo pipeline |
The Unicode Bug
The clipboard adapter was the most painful. Not because of complexity, but because of a single invisible character.
The timestamps come in two formats: 24-hour (Apr 24, 2026 at 17:20) and 12-hour with AM/PM (Jun 20, 2025 at 12:53 PM). Standard strptime parsing, two format strings, try both. Easy.
Except 91.8% of the 12-hour timestamps were silently failing and falling through to the epoch fallback (January 1, 2020). That is 4,185 events with wrong timestamps.
The cause: Unicode character U+202F, the “narrow no-break space,” sitting invisibly between the minutes and AM/PM. Apple’s date formatter inserts it by default. It looks identical to a regular space in every editor and terminal. strptime does not match it.
The fix:
normalized = dt_str.replace("\u202f", " ")
One line. 4,185 events recovered.
Old Backup Archaeology
The active Data Jar store only goes back to mid-2023 — at some point the Shortcut that manages it crashed and restarted with a fresh store. But the old data still existed.
I found it in three places: an old iCloud sync of store.json frozen at November 2022, a zip backup from June 2025 sitting in iCloud Drive, and the ByTheWeiCo project’s exported daily JSON files. Each one had a slightly different schema. The 2025 zip had driving_records that didn’t exist in the other stores. The old iCloud store had pre-2023 clipboard entries with the v1 health format (nested dict with getHRV/getHR/getStep keys instead of flat fields).
The loader tries all candidate paths for each backup source. The deterministic event IDs meant I could point it at everything and let the database sort out duplicates. No dedup logic needed — just ON CONFLICT DO NOTHING.
The Globe
Globe.gl rendering real geocoded locations was the visual payoff for all the data plumbing.
The geocoding pipeline runs as a background task on every server boot. It pulls all unique addresses from event context fields, checks them against a geocache Postgres table, and batch-resolves unknowns through Nominatim at one request per second (their rate limit). Results cache permanently — subsequent boots skip already-resolved addresses.
The globe shows three layers:
- Points: color-coded by dominant event type (green for charger events, blue for driving, orange for clipboard copies, purple for bedtime). Sized on a log scale by event count.
- Arcs: driving routes reconstructed from sequential
drive_start/drive_endpairs. Deduplicated visually so repeated commutes show as a single arc with a count. 85 unique routes in the current dataset. - Rings: animated pulsing rings at locations with activity in the last 24 hours.
All three layers are backed by named SQL queries that join the events table against the geocache table. No application-level geocoding at request time.
SQL-First Architecture
Midway through the build, I noticed I was writing the same query logic in two places: once in the API route handlers, once in the SQL explorer presets. Classic drift.
The fix was queries.py — a single registry of named SQL queries. Every API endpoint calls run_query("globe_points") or run_query_val("event_count"). The SQL explorer has preset buttons that map to the exact same queries the API uses. One source of truth.
The explorer itself is a single HTML page served from the query route module. Monospace font, dark theme, preset buttons for every data domain (sources, locations, drives, health, reading, milage, audit) plus buttons that mirror every API endpoint. Read-only transactions enforced at the database level. Dangerous keywords blocked before the query reaches Postgres.
The Audit System
The audit runs on every boot, right after ingest. It uses Fibonacci-batch random sampling: batches of 1, 1, 2, 3, 5, 8, 13 events (33 samples total), each pulled with ORDER BY random().
Every sampled event gets seven checks:
- Event ID format (11-16 character hex string)
- Timestamp within valid range (2020-2028)
- ISO timestamp matches millisecond timestamp (within 1 minute drift)
- Source is in the known set of 11 valid sources
- Focus mode field doesn’t have trailing brace corruption
- Health field types are numeric (not strings or objects)
- Payload is a dict, not a string or null
The full audit adds source-level aggregate checks: how many events are stuck at the 2020 epoch fallback, how many have the trailing-brace focus mode bug, how many have sub-2020 timestamps.
The first run found three bugs. The trailing-brace issue was a Data Jar unwrapping edge case where a JSON fragment leaked into the focus mode field. The stuck-2020 timestamps were the Unicode bug. The third was a batch of milage events where the timestamp parser was receiving an integer instead of a string.
The Numbers
| Metric | Value |
|---|---|
| Total events | 55,354 |
| Data sources | 11 |
| Adapters | 9 |
| Days in daily summary | 1,618 |
| API endpoints | 30+ |
| Named SQL queries | 18 |
| Geocached addresses | 1,225 |
| Unique driving routes | 85 |
| Audit checks per sample | 7 |
| Background tasks on boot | ingest, audit, daily summary refresh, geocode pipeline |
The daily summary is a Postgres materialized view that aggregates per-day stats: steps, average HRV, drive count, clipboard copies, unique locations, highlights, social posts, audiobook listens, sleep hours, bedtime, wakeup time, total events, and dominant focus mode. 1,618 days of data. Refreshed concurrently on every boot and reload.
API Call Logging
Every API request (except static files) gets logged to an api_log table with method, path, query params, status code, response time in milliseconds, client IP, and user agent. A middleware handles it asynchronously so logging failures never block responses. The /api/logs endpoint lets me query the log with optional path filtering.
This was not in the original plan. I added it after realizing I had no way to know which endpoints were actually being hit by the Globe.gl frontend. It took ten minutes and immediately paid for itself.
What Is Next
The kanban has ten surveillance features queued:
- Sleep quality score — composite metric from duration, bedtime consistency, and wakeup time
- Daily life score — weighted rollup of steps, HRV, sleep, productivity signals
- HRV stress alerts — threshold-based notifications via ntfy
- Location anomaly detection — flag days where location patterns deviate from baseline
- Predictive daily briefing — morning summary generated from historical patterns for the same day of week
- Commute analytics — drive time trends, route frequency analysis
- Health correlation matrix — which context variables (weather, focus mode, location) correlate with health metrics
- Reading velocity tracking — highlights per day, books per month, reading streaks
- Social activity patterns — posting cadence, engagement trends from Threads proxy
- Battery life modeling — predict end-of-day battery from morning charge and usage patterns
The foundation is solid. Fifty-five thousand events with clean schemas, deterministic IDs, JSONB flexibility, and an audit system that catches problems before they compound. Everything else is just queries.