Building a Personal Telemetry Platform in One Session

I have been collecting data about my life for five years. Clipboard copies with health snapshots. Driving logs. Sleep records. Book highlights. Journal entries. Location check-ins. All of it flowing into Data Jar on my iPhone, serialized into nested JSON structures that only iOS Shortcuts could love.

Sprint 1 mapped out the territory: pydantic models, config, package structure, a rough schema for what a “unified event” should look like. But nothing was connected. No database. No API. No visualization. Just a well-organized skeleton.

This session made it real.

SQLite to Postgres

The original plan was aiosqlite. Simple, single-file, no dependencies. It lasted about twenty minutes.

The moment I needed JSONB queries against nested health data and location context, SQLite’s JSON support felt like writing with mittens on. I switched to asyncpg with a connection pool (min 2, max 10) and never looked back.

The migration itself was straightforward, but the JSONB codec setup was the kind of thing that eats an hour if you don’t know the trick. Python dicts don’t serialize to JSONB automatically in asyncpg — you need a custom codec registered on every connection in the pool’s init callback:

async def _init_conn(conn):
    await conn.set_type_codec(
        'jsonb', encoder=json.dumps, decoder=json.loads, schema='pg_catalog'
    )

Two lines. But without them, every insert throws a type error that points you in exactly the wrong direction.

The Adapter Pipeline

Nine adapters. Eleven data sources. Each one a different flavor of messy.

Data Jar wraps every value in a {"type": ..., "value": ...} envelope. Recursively. A list of dictionaries becomes a type/value wrapper containing type/value wrappers containing type/value wrappers. The datajar.py unwrapper peels all of that off recursively before any adapter touches the data.

After unwrapping, each adapter converts raw records into UnifiedEvent objects with a deterministic event ID (SHA256 of source + timestamp + content prefix, truncated to 16 hex chars). This is the single best design decision in the project. Every event is idempotent. Ingest the same data ten times, get the same result. INSERT ... ON CONFLICT (event_id) DO NOTHING handles dedup at the database level.

The adapters themselves:

Adapter	Source	Events	Notes
lifetracker	LifeTrackerLog	charger, driving, bedtime	8K+ events, the backbone of the dataset
clipboard	clipboard to text	clipboard_copy	Health data embedded in every copy event
journal	dailyjournal	journal_entry	Freeform text, sometimes serialized BookmarkRecords
bookmarks	BookmarkRecords	highlight	Book highlights with author/title metadata
driving	driving_records	drive_start, drive_end	Paired events with location, from a later schema
milage	Historical Milage	odometer_reading	EV odometer snapshots
locations	Locations Parked	location_parked	GPS-tagged parking events
sleep	ByTheWeiCo sleep.json	sleep_night	Nightly sleep data with bedtime/wakeup
bytheweico	per-day JSON	social, audiobook, reading	Cross-project events from the ByTheWeiCo pipeline

The Unicode Bug

The clipboard adapter was the most painful. Not because of complexity, but because of a single invisible character.

The timestamps come in two formats: 24-hour (Apr 24, 2026 at 17:20) and 12-hour with AM/PM (Jun 20, 2025 at 12:53 PM). Standard strptime parsing, two format strings, try both. Easy.

Except 91.8% of the 12-hour timestamps were silently failing and falling through to the epoch fallback (January 1, 2020). That is 4,185 events with wrong timestamps.

The cause: Unicode character U+202F, the “narrow no-break space,” sitting invisibly between the minutes and AM/PM. Apple’s date formatter inserts it by default. It looks identical to a regular space in every editor and terminal. strptime does not match it.

The fix:

normalized = dt_str.replace("\u202f", " ")

One line. 4,185 events recovered.

Old Backup Archaeology

The active Data Jar store only goes back to mid-2023 — at some point the Shortcut that manages it crashed and restarted with a fresh store. But the old data still existed.

I found it in three places: an old iCloud sync of store.json frozen at November 2022, a zip backup from June 2025 sitting in iCloud Drive, and the ByTheWeiCo project’s exported daily JSON files. Each one had a slightly different schema. The 2025 zip had driving_records that didn’t exist in the other stores. The old iCloud store had pre-2023 clipboard entries with the v1 health format (nested dict with getHRV/getHR/getStep keys instead of flat fields).

The loader tries all candidate paths for each backup source. The deterministic event IDs meant I could point it at everything and let the database sort out duplicates. No dedup logic needed — just ON CONFLICT DO NOTHING.

The Globe

Globe.gl rendering real geocoded locations was the visual payoff for all the data plumbing.

The geocoding pipeline runs as a background task on every server boot. It pulls all unique addresses from event context fields, checks them against a geocache Postgres table, and batch-resolves unknowns through Nominatim at one request per second (their rate limit). Results cache permanently — subsequent boots skip already-resolved addresses.

The globe shows three layers:

Points: color-coded by dominant event type (green for charger events, blue for driving, orange for clipboard copies, purple for bedtime). Sized on a log scale by event count.
Arcs: driving routes reconstructed from sequential drive_start / drive_end pairs. Deduplicated visually so repeated commutes show as a single arc with a count. 85 unique routes in the current dataset.
Rings: animated pulsing rings at locations with activity in the last 24 hours.

All three layers are backed by named SQL queries that join the events table against the geocache table. No application-level geocoding at request time.

SQL-First Architecture

Midway through the build, I noticed I was writing the same query logic in two places: once in the API route handlers, once in the SQL explorer presets. Classic drift.

The fix was queries.py — a single registry of named SQL queries. Every API endpoint calls run_query("globe_points") or run_query_val("event_count"). The SQL explorer has preset buttons that map to the exact same queries the API uses. One source of truth.

The explorer itself is a single HTML page served from the query route module. Monospace font, dark theme, preset buttons for every data domain (sources, locations, drives, health, reading, milage, audit) plus buttons that mirror every API endpoint. Read-only transactions enforced at the database level. Dangerous keywords blocked before the query reaches Postgres.

The Audit System

The audit runs on every boot, right after ingest. It uses Fibonacci-batch random sampling: batches of 1, 1, 2, 3, 5, 8, 13 events (33 samples total), each pulled with ORDER BY random().

Every sampled event gets seven checks:

Event ID format (11-16 character hex string)
Timestamp within valid range (2020-2028)
ISO timestamp matches millisecond timestamp (within 1 minute drift)
Source is in the known set of 11 valid sources
Focus mode field doesn’t have trailing brace corruption
Health field types are numeric (not strings or objects)
Payload is a dict, not a string or null

The full audit adds source-level aggregate checks: how many events are stuck at the 2020 epoch fallback, how many have the trailing-brace focus mode bug, how many have sub-2020 timestamps.

The first run found three bugs. The trailing-brace issue was a Data Jar unwrapping edge case where a JSON fragment leaked into the focus mode field. The stuck-2020 timestamps were the Unicode bug. The third was a batch of milage events where the timestamp parser was receiving an integer instead of a string.

The Numbers

Metric	Value
Total events	55,354
Data sources	11
Adapters	9
Days in daily summary	1,618
API endpoints	30+
Named SQL queries	18
Geocached addresses	1,225
Unique driving routes	85
Audit checks per sample	7
Background tasks on boot	ingest, audit, daily summary refresh, geocode pipeline

The daily summary is a Postgres materialized view that aggregates per-day stats: steps, average HRV, drive count, clipboard copies, unique locations, highlights, social posts, audiobook listens, sleep hours, bedtime, wakeup time, total events, and dominant focus mode. 1,618 days of data. Refreshed concurrently on every boot and reload.

API Call Logging

Every API request (except static files) gets logged to an api_log table with method, path, query params, status code, response time in milliseconds, client IP, and user agent. A middleware handles it asynchronously so logging failures never block responses. The /api/logs endpoint lets me query the log with optional path filtering.

This was not in the original plan. I added it after realizing I had no way to know which endpoints were actually being hit by the Globe.gl frontend. It took ten minutes and immediately paid for itself.

What Is Next

The kanban has ten surveillance features queued:

Sleep quality score — composite metric from duration, bedtime consistency, and wakeup time
Daily life score — weighted rollup of steps, HRV, sleep, productivity signals
HRV stress alerts — threshold-based notifications via ntfy
Location anomaly detection — flag days where location patterns deviate from baseline
Predictive daily briefing — morning summary generated from historical patterns for the same day of week
Commute analytics — drive time trends, route frequency analysis
Health correlation matrix — which context variables (weather, focus mode, location) correlate with health metrics
Reading velocity tracking — highlights per day, books per month, reading streaks
Social activity patterns — posting cadence, engagement trends from Threads proxy
Battery life modeling — predict end-of-day battery from morning charge and usage patterns

The foundation is solid. Fifty-five thousand events with clean schemas, deterministic IDs, JSONB flexibility, and an audit system that catches problems before they compound. Everything else is just queries.