What If You Treated Your Personal Data Like Credit Card Numbers?

Posted by Weixiang Zhang on Apr 29, 2026 · Topic: Data Engineering

Apr 29, 2026 Data Engineering web

TL;DR

Build a personal telemetry pipeline the way you’d build a payment processing system. Deterministic event IDs for idempotent ingestion. Random sampling audits on every boot. SQL as the single source of truth. API call logging for full audit trail. The data is yours. Treat it like it matters.

The Premise

Financial compliance frameworks like PCI DSS exist because credit card data is valuable enough to protect. Your personal data — five years of driving records, sleep patterns, location history, clipboard contents, health snapshots — is arguably more sensitive. Nobody applies the same rigor to it.

This post covers the patterns that emerged from building a personal telemetry dashboard with enterprise-grade data engineering practices. FastAPI + asyncpg + Postgres + Globe.gl + HTMX. 55,354 events across 11 sources spanning August 2021 through April 2026.

Deterministic Event IDs

Every event gets a SHA256 hash derived from source + timestamp + content_prefix. This makes ingestion idempotent: INSERT ... ON CONFLICT DO NOTHING. You can re-run the entire ingest pipeline as many times as you want and the database stays clean. No duplicates, no upsert logic, no “did this already get imported?” bookkeeping.

This is the same principle behind idempotency keys in payment APIs. If a network timeout means you retry a charge, you don’t want the customer billed twice. Same energy here: if you re-import your Data Jar backup, you don’t want 55K duplicate events.

def make_event_id(source: str, ts: int, content: str) -> str:
    raw = f"{source}:{ts}:{content[:64]}"
    return hashlib.sha256(raw.encode()).hexdigest()

The content prefix is capped at 64 chars. Long enough to differentiate entries that share a timestamp, short enough that minor trailing edits don’t create phantom duplicates.

Fibonacci-Batch Random Sampling

Data quality audits need to be cheap enough to run on every server boot, but thorough enough to catch real problems. Fibonacci batching gives you both.

The idea: sample at Fibonacci-spaced offsets through your sorted event list. Batch sizes of 1, 1, 2, 3, 5, 8, 13 yield 33 samples from any dataset size. Each sample gets validated against schema rules, timestamp sanity checks, and source-specific invariants.

Why Fibonacci? The spacing is non-uniform by design. It over-samples the boundaries (first and last events, which are most likely to have edge-case timestamps) while still covering the middle. A uniform random sample of 33 would miss boundary conditions. A sequential scan of the first 33 would miss everything else.

This runs on every boot. If any sample fails validation, the dashboard flags it before serving data. You find out about broken timestamps at startup, not when a user reports a weird gap in their timeline.

SQL as Single Source of Truth

Every API endpoint is a thin wrapper around a named SQL query registered in queries.py. The route handler does three things: parse parameters, call the query, format the response. No ORM. No query builder. No business logic in the route layer.

QUERIES = {
    "today_summary": """
        SELECT * FROM daily_summary
        WHERE day = CURRENT_DATE
    """,
    "timeline_range": """
        SELECT * FROM events
        WHERE ts BETWEEN $1 AND $2
        ORDER BY ts DESC
        LIMIT $3
    """,
}

This pattern has a name in enterprise architecture: the query registry. It makes every data access auditable, testable, and explainable. When someone asks “where does this number come from?” the answer is always a SQL string you can run directly against the database.

The alternative — scattering query construction across route handlers, service layers, and ORM model methods — is how you end up with three different ways to count “active users” that all return different numbers.

JSONB for Flexible Telemetry Schemas

Personal data is messy. A clipboard event has health data (steps, heart rate, battery). A driving event has distance and duration. A journal entry is freeform text. A bookmark record has title, author, highlight, page number.

Postgres JSONB handles this without a 47-column table or a rats’ nest of junction tables. The events table has typed columns for the universal fields (event_id, timestamp, source, type) and a JSONB payload column for everything source-specific.

CREATE TABLE events (
    event_id  TEXT PRIMARY KEY,
    ts        BIGINT NOT NULL,
    source    TEXT NOT NULL,
    type      TEXT NOT NULL,
    context   JSONB,
    health    JSONB,
    payload   JSONB
);

Custom asyncpg codecs handle the serialization. JSONB indexes on frequently queried payload fields keep performance reasonable. The schema stays stable while the data stays flexible.

Background Geocoding Pipeline

Location data arrives as lat/lng pairs. Humans think in addresses. A background task runs Nominatim reverse geocoding with rate limiting (1 request/second, per the usage policy) and caches results.

525+ addresses resolved so far. The pipeline is resumable: it tracks which coordinates have been geocoded and picks up where it left off. If the server restarts mid-batch, no work is lost and no coordinates are re-queried.

This is the “eventual consistency” pattern from distributed systems applied to a single-machine pipeline. The data is correct immediately (coordinates are there), and it becomes more useful over time (addresses fill in).

Daily Summary Materialized View

Eleven data sources. Five years. The question “what happened on March 14, 2024?” requires joining across all of them. A materialized view pre-computes one row per day, fusing all sources:

CREATE MATERIALIZED VIEW daily_summary AS
SELECT
    date_trunc('day', to_timestamp(ts/1000)) AS day,
    count(*) FILTER (WHERE source = 'lifetracker') AS tracker_events,
    count(*) FILTER (WHERE source = 'clipboard') AS clipboard_events,
    count(*) FILTER (WHERE source = 'driving') AS driving_events,
    -- ... 8 more sources
    count(*) AS total_events
FROM events
GROUP BY 1;

1,618 days covered. The view refreshes on ingest. Queries against it are instant because the aggregation is pre-computed. This is the same pattern data warehouses use for summary tables, just at personal scale.

API Call Logging

Every API request gets logged to an api_log table: timestamp, path, method, response time, status code. This is the audit trail pattern from enterprise SAD methodology.

Why bother for a personal dashboard? Because “self-surveillance as a service” means you should know who (or what) is querying your data. When you proxy 16 upstream service endpoints, you want to see the access patterns. When you add an iOS Shortcut that POSTs telemetry, you want to verify it’s actually calling home.

The SQL explorer with presets matching every API endpoint closes the loop: you can audit the auditor.

Broken Timestamps and the 4,185 Fix

Data quality is never “done.” The ingest pipeline discovered 4,185 timestamps with Unicode narrow no-break spaces (U+202F) embedded in AM/PM formatted time strings. Python’s strptime silently failed on these, producing wrong-day timestamps.

The fix was a normalize-before-parse step that strips non-ASCII whitespace. But the real lesson is that this bug existed across years of data. Without the Fibonacci sampling audit catching suspicious timestamp distributions, it would have silently corrupted every daily summary.

This is why financial compliance frameworks mandate data quality checks at ingestion boundaries. The data is only as good as your worst parser.

The Stack

Layer	Tech
API	FastAPI + asyncpg
DB	Postgres (JSONB, materialized views)
Frontend	HTMX fragments + Globe.gl
Geocoding	Nominatim (background, rate-limited)
Ingest	9 adapters, deterministic IDs
Audit	Fibonacci sampling, API call logging
Docs	Enterprise SAD methodology (500-line architecture doc)

Self-Surveillance as a Service

The irony is real. You carry a device that logs your location, health, driving, sleep, and reading habits. That data sits in proprietary silos — Apple Health, Data Jar, various apps — with no unified view and no quality guarantees.

Building the unified view with the same rigor you’d apply to a payment processing system is overkill by any reasonable measure. But “reasonable” is how you end up with five years of data you can’t query, timestamps you can’t trust, and no way to answer “what was I doing on this day two years ago?”

The compliance framework is the point. Not because personal data needs PCI certification, but because the engineering patterns that protect credit card numbers also happen to make personal data actually useful. Deterministic IDs prevent duplicates. Random sampling catches corruption. SQL registries make queries auditable. Materialized views make aggregation instant.

Your data. Your infrastructure. Your rules. Might as well make the rules good ones.