Treating Your Personal Data Like It's PCI-Scoped

Posted by Weixiang Zhang on Apr 29, 2026 · Topic: data engineering

Apr 29, 2026 data engineering web

TL;DR

PCI DSS exists because financial institutions learned — through catastrophic breaches — that undisciplined data handling kills companies. The engineering patterns that emerged from that pain are equally useful for personal projects. Not because your step count needs a SOC 2 report, but because the discipline produces systems that actually work six months later.

Why Compliance Patterns Transfer

PCI DSS is a 12-requirement framework for protecting cardholder data. It covers network segmentation, access control, monitoring, testing, and policy. Most developers encounter it as “that thing that makes deploying to production annoying.” But strip away the audit theater and what remains is a set of engineering principles:

Know where your data lives
Log every access
Validate data integrity automatically
Make ingestion idempotent
Document your architecture before you build it

These aren’t compliance requirements. They’re just good engineering. The compliance industry formalized them because the consequences of ignoring them in finance are measured in millions of dollars. In a personal project, the consequences are measured in hours of debugging — but the fix is the same.

The SAD as a Design Tool

Enterprise shops produce Software Architecture Documents. Fifteen sections covering system context, data flows, security boundaries, deployment topology, operational procedures, disaster recovery. Most developers see these as bureaucratic overhead generated after the system is built to satisfy an auditor.

Flip it around. Write the SAD before you write code.

Not a 40-page Word document. A structured markdown file in your repo that forces you to answer questions you’d otherwise skip:

What are the data sources and their schemas?
What are the trust boundaries?
How does data flow from ingestion to presentation?
What happens when a dependency is unavailable?
What are the operational procedures for backup and recovery?

You don’t need all 15 sections. But writing even five of them before touching code changes the architecture you end up with. It’s a forcing function. The SAD doesn’t describe the system — it shapes it.

In practice, this looks like a CLAUDE.md or ARCHITECTURE.md at the repo root. Every future session — human or AI — reads it first and builds within the constraints it defines.

Fibonacci-Batch Random Sampling

In financial compliance, you validate transaction data by sampling. Check every record and you’ll never ship. Check none and you’ll ship garbage. The standard approach is statistical sampling with defined confidence levels.

For a personal telemetry pipeline, I use Fibonacci-sized batches: sample 33 random events across 7 batches of sizes 1, 1, 2, 3, 5, 8, 13. Run it on every server boot.

FIBONACCI_BATCHES = [1, 1, 2, 3, 5, 8, 13]  # 33 total samples

async def validate_on_boot(db):
    events = await db.fetch("SELECT * FROM events ORDER BY RANDOM() LIMIT 33")
    offset = 0
    for batch_size in FIBONACCI_BATCHES:
        batch = events[offset:offset + batch_size]
        offset += batch_size
        for event in batch:
            assert event["event_id"] is not None
            assert event["ts"] > 0
            assert event["source"] in KNOWN_SOURCES
            validate_payload_schema(event["source"], event["payload"])

Why Fibonacci? The increasing batch sizes catch both one-off corruption (the single-element batches) and systematic issues (the 13-element batch). It’s not statistically rigorous in the way a compliance auditor would demand, but it catches the classes of errors that actually occur in personal pipelines: null fields from a bad adapter, timestamps in seconds instead of milliseconds, payloads that changed schema between app versions.

The key insight from PCI: automated validation that runs without human intervention. You don’t “remember to check.” The system checks itself.

SQL as the Single Source of Truth

PCI Requirement 10 is about logging and monitoring. The principle underneath it: every data access path should be auditable and consistent. In practice, that means you don’t write ad-hoc queries scattered across route handlers.

Define every query once in a registry:

# queries.py
QUERIES = {
    "events_today": """
        SELECT source, type, ts, payload
        FROM events
        WHERE ts >= :start_of_day
        ORDER BY ts DESC
    """,
    "daily_summary": """
        SELECT date, steps, hrv_avg, drive_count, sleep_hours
        FROM daily_rollup
        WHERE date >= :since
        ORDER BY date DESC
    """,
    "source_counts": """
        SELECT source, COUNT(*) as n
        FROM events
        GROUP BY source
        ORDER BY n DESC
    """,
}

Routes become thin wrappers:

@router.get("/today")
async def today(db=Depends(get_db)):
    rows = await db.fetch(QUERIES["events_today"], start_of_day=today_ms())
    return rows

The SQL explorer endpoint runs the exact same queries. When you debug from the browser, you see the same data the API returns. One source of truth. No divergence between “what the dashboard shows” and “what the database contains.”

This pattern comes directly from financial systems where auditors need to verify that reports match the underlying data. In a personal project, it saves you from the classic bug: the API returns one thing, the debug query returns another, and you spend an hour discovering they’re hitting different tables.

API Call Audit Logging

Every HTTP request to your server gets logged to a database table. Not to a rotating log file that gets deleted. To a queryable table.

CREATE TABLE api_log (
    id SERIAL PRIMARY KEY,
    ts TIMESTAMPTZ DEFAULT NOW(),
    method TEXT,
    path TEXT,
    query_params JSONB,
    status INT,
    duration_ms FLOAT,
    client_ip TEXT
);

Middleware captures it automatically:

@app.middleware("http")
async def log_requests(request, call_next):
    start = time.monotonic()
    response = await call_next(request)
    duration = (time.monotonic() - start) * 1000
    await db.execute(
        "INSERT INTO api_log (method, path, query_params, status, duration_ms, client_ip) "
        "VALUES ($1, $2, $3, $4, $5, $6)",
        request.method, request.url.path,
        dict(request.query_params), response.status_code,
        duration, request.client.host
    )
    return response

This isn’t security theater. It’s operational visibility. When your dashboard loads slowly on Tuesday, you query the audit log and discover that /timeline is taking 1,200ms because a geocoding cache miss is blocking the response. Without the log, you guess. With the log, you know.

PCI Requirement 10.2 mandates logging all access to cardholder data. The personal-project version: log all access to everything. Storage is cheap. Debugging without data is expensive.

Deterministic Event IDs

Financial systems use deterministic transaction IDs to prevent double-processing. A payment processed twice is a catastrophe. The same principle applies to data ingestion: if you run your pipeline twice, you should get the same result.

Generate event IDs from the content itself:

import hashlib

def make_event_id(source: str, ts: int, content: str) -> str:
    raw = f"{source}:{ts}:{content[:64]}"
    return hashlib.sha256(raw.encode()).hexdigest()[:16]

Sixteen hex characters from a SHA-256 of source, timestamp, and the first 64 characters of content. The truncation gives you 64 bits of entropy — collision probability is negligible for datasets under a few billion events.

Ingestion becomes idempotent:

INSERT INTO events (event_id, ts, source, type, payload)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (event_id) DO NOTHING;

Run it 100 times. Same result. No deduplication logic. No “check if exists first” queries. No race conditions. The database handles it with a unique constraint and a single atomic operation.

This is the single most valuable pattern I borrowed from financial systems. It transforms data ingestion from a fragile, order-dependent process into something you can run on a cron job without worrying.

JSONB for Heterogeneous Telemetry

When your data comes from four different sources with incompatible schemas — two datetime formats, nested dictionaries in some records and flat fields in others, schema changes between app versions — you have two options: fight the schema or embrace the flexibility.

CREATE TABLE events (
    event_id TEXT PRIMARY KEY,
    ts BIGINT NOT NULL,
    source TEXT NOT NULL,
    type TEXT NOT NULL,
    payload JSONB
);

The fixed columns (event_id, ts, source, type) give you fast indexed queries for the operations you always need: filter by time range, filter by source, filter by type. The payload column absorbs everything else.

A charger event’s payload might look like:

{"action": "plug", "location_id": "loc_a7f3", "battery_pct": 42}

A sleep event’s payload:

{"duration_hrs": 7.2, "quality": "fair", "hr_avg": 58}

No schema migration when you add a new data source. No nullable columns proliferating across the table. The tradeoff is that you lose column-level type checking on the payload, but for telemetry data that changes shape regularly, that tradeoff is correct.

PCI systems use similar patterns for transaction metadata — a fixed envelope with a flexible payload — because the alternative is a migration every time a payment processor changes their response format.

Background Geocoding Pipelines

If your telemetry includes location data, you’ll want human-readable place names. Geocoding APIs (like Nominatim) are rate-limited — typically 1 request per second for free tiers. If you have 1,200+ unique addresses to resolve, that’s 20 minutes of geocoding.

Do not block server startup on this.

async def geocode_background(db):
    uncached = await db.fetch(
        "SELECT DISTINCT location_hash FROM events "
        "WHERE location_hash NOT IN (SELECT hash FROM geocode_cache)"
    )
    for loc in uncached:
        result = await nominatim_lookup(loc["location_hash"])
        await db.execute(
            "INSERT INTO geocode_cache (hash, display_name, lat, lon) "
            "VALUES ($1, $2, $3, $4) ON CONFLICT DO NOTHING",
            loc["location_hash"], result["display_name"],
            result["lat"], result["lon"]
        )
        await asyncio.sleep(1.0)  # respect rate limit

Launch it as a background task on startup. The server responds immediately. Geocoded names appear as the cache fills. Pages that need location names fall back to the hash ID until the cache catches up.

The compliance parallel: PCI requires that security scans don’t impact production availability. The engineering principle is the same — long-running data operations should never block the critical path.

Materialized Views as Analytics Layer

You have 11 data sources. Steps from one, heart rate from another, driving events from a third, social posts from a fourth. Querying across all of them for a daily summary means joining 11 tables or running 11 queries. Both are slow and fragile.

Build a materialized view:

CREATE MATERIALIZED VIEW daily_rollup AS
SELECT
    date_trunc('day', to_timestamp(ts / 1000)) AS date,
    SUM(CASE WHEN source = 'health' AND type = 'steps'
        THEN (payload->>'value')::int END) AS steps,
    AVG(CASE WHEN source = 'health' AND type = 'hrv'
        THEN (payload->>'value')::float END) AS hrv_avg,
    COUNT(CASE WHEN source = 'vehicle' AND type = 'drive' THEN 1 END) AS drive_count,
    MAX(CASE WHEN source = 'sleep' AND type = 'night'
        THEN (payload->>'duration_hrs')::float END) AS sleep_hours
FROM events
GROUP BY 1
ORDER BY 1 DESC;

One row per day. Every metric in one place. Refresh it on a schedule or after ingestion:

REFRESH MATERIALIZED VIEW CONCURRENTLY daily_rollup;

The CONCURRENTLY keyword means reads aren’t blocked during refresh. Your dashboard stays responsive while the view rebuilds.

Financial reporting systems have used this pattern for decades — pre-aggregated summary tables that can be queried without touching the raw transaction log. The raw data stays intact for auditing. The rollup exists for speed.

The Build Commandments Pattern

PCI DSS Requirement 12 is about maintaining a security policy. The personal-project equivalent: write down your architecture decisions as numbered rules and put them where every future contributor (including your future self, including an AI agent) will read them.

## Build Commandments

1. Every query lives in queries.py. Routes are thin wrappers.
2. Event IDs are deterministic SHA-256 hashes. No UUIDs.
3. Ingestion is idempotent. ON CONFLICT DO NOTHING.
4. All timestamps are milliseconds since epoch.
5. JSONB for flexible payload. Fixed columns for indexed fields.
6. Background tasks never block server startup.
7. API audit log captures every request.
8. Fibonacci sampling validates data on every boot.
9. Materialized views for cross-source analytics.
10. No raw coordinates or addresses in API responses.

This isn’t documentation in the traditional sense. It’s a contract. When you come back to the project in three months and can’t remember why you chose millisecond timestamps, rule 4 tells you. When an AI agent is making changes, it reads the commandments first and builds within them.

The pattern works because it’s small enough to read in 30 seconds and specific enough to be actionable. A 40-page architecture document gets ignored. Ten numbered rules get followed.

The Takeaway

PCI DSS is a response to real engineering failures in systems that process billions of dollars. The patterns it codified — deterministic IDs, audit logging, automated validation, documented architecture, idempotent operations — aren’t specific to credit cards. They’re specific to systems that need to work reliably with data that matters.

Your personal telemetry pipeline processes data that matters to you. Treat it accordingly. Not with compliance paperwork, but with the engineering discipline that compliance paperwork was trying to enforce in the first place.