ReAct Agent Architecture for Personal Data

Agent Architecture + LLM Tool Calling web

TL;DR: A ReAct loop (observe, think, act, repeat) over 39 tools turns a regex chatbot into a genuine data agent. The key insight: user questions map to three cognitive axes — immediate (page context, free), vertical (one tool call, fast), horizontal (multi-tool search, expensive). Design your tool set around this cost model, not around your API surface.

The Pattern

ReAct (Reason + Act) is the simplest useful agent architecture. The LLM receives a system prompt containing tool descriptions, structured page context, and conversation history. It decides whether to answer directly or call a tool. If it calls a tool, the server executes it, feeds the result back as a new prompt, and the LLM decides again. Loop until the LLM produces plain text (no JSON tool call) or hits the step cap.

User question + page context + tool descriptions
  -> LLM Step 0: sees everything, decides action
     Option A: answer directly (plain text) -> done
     Option B: emit {"tool": "name", "args": {...}} -> execute
  -> Server: _parse_tool_call() extracts JSON
  -> Server: _normalize_tool_call() handles format variants
  -> Server: PARAM_ALIASES fixes parameter name mismatches
  -> Server: tool_def["execute"](**args) runs the tool
  -> Server: _truncate_result(2000 chars) caps output
  -> LLM Step 1: sees user question + tool result
     "give final answer or call another tool"
  -> ... (max 6 steps, token budget: 800 for tool selection, 2000 for synthesis)
  -> Final answer returned with {response, steps, tools_used, elapsed_ms}
  -> _log_conversation() fires as ensure_future (non-blocking DB write)

This is structurally identical to Vercel AI SDK’s stopWhen: stepCountIs(N) pattern, Intercom Fin’s three-phase pipeline (query refinement, RAG retrieval, accuracy validation), and Sprinklr Copilot’s dashboard summarization architecture. The data layer changes. The loop doesn’t.

The key difference from Vercel’s approach: we can’t use native function calling because DeepSeek V4 Flash on Ollama Cloud doesn’t support it. So we simulate it with structured prompting and a JSON parser. This forces us to handle format drift, which turns out to be the hardest part of the whole system.

Three Cognitive Axes

The most useful framing that emerged from building this: every user question maps to one of three cost levels.

Immediate (L1) — answerable from page context alone. “What book is playing?” The LLM already has this from the structured page context injected into the system prompt. Zero tool calls. Free.

Vertical (L2) — requires depth into one data source. “What are my highlights from Dune?” One tool call to get_highlights, done. Fast, single round-trip.

Horizontal (L3) — requires synthesis across sources. “What themes connect my journal entries this week to what I’m reading?” Multiple tool calls: get_journal_entries + get_highlights + maybe search_library. The LLM synthesizes across results. Expensive, 3-4 steps.

Design your tools to serve these axes, not to mirror your REST API. Combine endpoints that always get called together. Split endpoints that serve different cognitive intents.

Structured Page Context (Not DOM Scraping)

The first instinct is to scrape the DOM and dump it into the prompt. That’s what most open-source chat widgets do (ai-chat-widget by gmen1057 sends URL + title + headings + body text). It works for simple pages but breaks for rich applications where the same page has wildly different states.

Instead, each page has a dedicated context provider in _getPageContext() that returns structured JSON metadata. The LLM gets a curated summary, not raw HTML.

// Reading page context (what the LLM actually sees)
{
  "page": "/reading",
  "title": "The Chronology of Water",
  "author": "Lidia Yuknavitch",
  "playing": true,
  "position_secs": 2341,
  "duration_secs": 18000,
  "progress_pct": 13,
  "chapter": "The Living Book",
  "reading_mode": "epub",
  "reading_window": [
    "paragraph two before the active one...",
    "paragraph right before...",
    ">>> through your body. A sound, a smell, an image, and your body becomes a quivering wobble. Memory, for me, poses a kind of crisis in representation...",
    "paragraph right after...",
    "paragraph two after..."
  ],
  "active_paragraph_index": 142,
  "total_paragraphs": 6520
}

The >>> marker flags the active paragraph. The LLM gets 5 paragraphs of context (2 before, active, 2 after) — enough to discuss the passage meaningfully without flooding the context window. Position data comes directly from the <audio> element (audio.currentTime, audio.duration), not from any API.

Eight page-specific providers cover every page in the dashboard:

PageContext injected
Home /now-playing hero, today stats grid, recently-played shelf (6 books), timeline preview (3 events)
Research /researchselected book + author, all highlight books (title/author/meta), highlight count, 5 passage previews (80 chars each)
Reading /readingrecently-played (8 books), cached books count, reading window, playback state
Book /book/*rating key, genre, duration, summary, active tab, chapter/passage counts, epub state
Log /logtotal entries, visible count, entry type breakdown (data-etype counts), active facets
Map /static/globe.htmlHUD stats, time filter, active/inactive type pills, trip previews (3), drive stats panel
Journal /journalselected day, entry count, 3 entry previews (text + time + location), calendar state
Social /socialKPI values, active period filter, visible posts, semantic search query

Context placement follows the “lost in the middle” research: static instructions (personality, tool descriptions) go at the top of the system prompt (cacheable by KV cache), dynamic page context goes at the bottom (near the user’s question, where LLM attention is strongest).

Reading Window Persistence

The reading window is ephemeral — it changes every 30 seconds as audio plays. But the conversation history needs to remember what was playing during each exchange. Solution: when the user sends a message, the active paragraph text gets embedded into the message history:

// In weiBotSend()
if (ctx.reading_window) {
  const activeText = ctx.reading_window.find(p => p.startsWith('>>>'));
  if (activeText) {
    enrichedContent = text + ` [listening to: "${activeText.slice(4).slice(0, 300)}"]`;
  }
}
messages.push({ role: 'user', content: enrichedContent });

So at turn 3, the LLM sees the conversation history with turn 1’s passage embedded in the user message and turn 2’s passage in its message. It can trace thematic threads across a listening session without the page context having to preserve old state.

Tool Call Normalization

DeepSeek V4 Flash does not follow tool call format consistently. Across a single conversation, it will emit:

{"function": "get_highlights", "params": {"title": "Dune"}}
{"name": "get_highlights", "args": {"book_title": "Dune"}}
{"action": "get_highlights", "parameters": {"search": "Dune"}}

Three different key names for the function, three for the parameters, three for the same parameter (title vs book_title vs search). You need a normalization layer:

def _normalize_tool_call(raw: dict) -> tuple[str, dict]:
    name = raw.get("function") or raw.get("name") or raw.get("action")
    params = raw.get("params") or raw.get("args") or raw.get("parameters", {})
    # Alias common mismatches
    for alias, canonical in PARAM_ALIASES.items():
        if alias in params and canonical not in params:
            params[canonical] = params.pop(alias)
    return name, params

PARAM_ALIASES maps title to book_title, search to query, n to limit. This catches 90% of format drift. Without it, half your tool calls silently fail.

Hallucinated Tool Results

The critical prompt engineering lesson: DeepSeek will fabricate tool results. Given “search for books about consciousness,” it will output the tool call AND a made-up response in the same message. The agent loop then skips actual execution because it thinks the tool already ran.

The fix is a two-part prompt constraint:

  1. System prompt: “Output ONLY a JSON tool call. No other text. Do not fabricate results.”
  2. After each real tool result, reinforce: “The above is the REAL result. Base your next action on this data only.”

Strict output-only prompting drops hallucinated results from ~40% to under 5%.

Prefetch Engine

Loading all 39 tools’ data on every page load is wasteful. Most users never open the chat. The insight from Gmail’s attachment pre-upload pattern: start work when intent is signaled, not when the page loads.

The activation signal is the first message. Before that, the bot is just a floating button — zero network cost.

Bot closed -> nothing happens
User opens panel -> still nothing (just UI toggle)
User sends first message -> _activatePrefetch() fires
  |
  +-- VERTICAL: fetch get_highlights(current_book) via /api/agent/prefetch
  |   (5 recent highlights, ready for "analyze my passages")
  |
  +-- HORIZONTAL: fetch search_library(active_paragraph_keywords)
  |   (2 LoB connections, ready for "what connects to this?")
  |
  +-- CONTEXTUAL: fetch based on page
  |   /social -> social_overview + social_drift
  |   / -> today_summary
  |
  +-- Start paragraph observer (setInterval, 15s)
      On active paragraph change -> refresh LoB connections cache

The prefetch endpoint (GET /api/agent/prefetch?tool=X&param=Y) runs tools directly without the LLM — pure data fetch for caching. Parameters are validated against the tool’s declared schema and capped at limit=50 to prevent abuse.

Prefetched data gets injected into the system prompt with an explicit instruction:

[PRELOADED -- recent highlights for this book (5)]
- The Chronology of Water (13%): "through your body. A sound, a smell..."
- The Chronology of Water (11%): "the desire to capture what really happened..."

IMPORTANT: PRELOADED data is already available above. If it answers the
user's question, respond directly WITHOUT calling any tools.

The LLM sees the data before the tool descriptions. If the preloaded data answers the question, it skips the tool call entirely — zero round-trip latency. First question: ~8 seconds. Third question about the same book: instant.

The paragraph observer is the most interesting piece. As the narrator reads, it fires every 15 seconds, checks if the active paragraph changed, and silently refreshes the Library of Babel connections cache with keywords from the new paragraph. By the time you ask “what connects to this?”, the answer is already loaded. The observer cleans up on beforeunload to prevent stacking across navigations.

Conversation Logging as Training Corpus

Every agent exchange gets persisted as an agent_chat event in PostgreSQL via insert_events() (the same pipeline as audiomarks, journal entries, and social posts). The payload captures the full interaction:

{
  "type": "agent_chat",
  "source": "weibot",
  "payload": {
    "prompt": "What is she talking about?",
    "response": "She's laying out the central question of the memoir...",
    "tools_used": [],
    "steps": 0,
    "elapsed_ms": 4200,
    "page": "/reading",
    "book_title": "The Chronology of Water",
    "reading_passage": "through your body. A sound, a smell, an image..."
  }
}

The embedding worker (nomic-embed-text-v2-moe, 768d, 60s poll cycle) auto-embeds these alongside the 73K+ existing events. So agent conversations become vector-searchable across your entire life timeline. A future query like “what questions have I asked about memory and identity?” will find both your audiomark passages AND your bot conversations about those passages.

This is the training data flywheel: every question you ask teaches the system what questions get asked. Every answer (good or bad) is a labeled example. The fine-tuning dataset builds itself through normal usage. When LoRA on Gemma/Qwen via MLX is ready, the corpus is already waiting.

The Enterprise Transfer

The architecture maps directly to enterprise agent products. Swap the data layer:

PersonalEnterprise
Plex audiobook APICRM/ERP APIs
PostgreSQL eventsData warehouse
Library of BabelKnowledge base
39 personal toolsN domain tools
1 userMulti-tenant

The ReAct loop, tool normalization, prefetch engine, and conversation logging are all portable. The reading companion pattern (injecting active context from what the user is currently doing) generalizes to any product where the agent should be aware of the user’s current task.

DB-Direct Tools vs HTTP Proxies

Early version routed all tool calls through internal HTTP endpoints via httpx ASGITransport. Five tools silently broke — the endpoints accepted ?q= parameters but ignored them, returning unfiltered results. The timeline_search tool with q=anxiety returned random events. The get_highlights tool with book_title=White Album returned Han Kang.

The fix: bypass HTTP for any tool that needs search. Hit PostgreSQL directly with ILIKE queries on JSONB fields:

@tool("get_highlights", "Get highlight passages for a specific book title.",
      params={"book_title": "title of the book", "limit": "max passages"})
async def get_highlights(book_title: str = "", limit: int = 10, **kw):
    pool = get_pool()
    async with pool.acquire() as conn:
        rows = await conn.fetch(
            """SELECT event_id, ts, payload, context
               FROM events
               WHERE type = 'audiomark'
                 AND payload->>'book_title' ILIKE $1
               ORDER BY ts DESC LIMIT $2""",
            f"%{book_title}%", limit,
        )
    return [
        {
            "book_title": _safe_payload(r["payload"], "book_title"),
            "author": _safe_payload(r["payload"], "author"),
            "transcript": _safe_payload(r["payload"], "transcript")[:300],
            "progress_pct": _safe_payload(r["payload"], "progress_pct"),
        }
        for r in rows
    ]

The _safe_payload() helper extracted during the code review replaces 13 repetitive isinstance(r["payload"], dict) checks across the codebase.

Rule of thumb: use HTTP proxies for read-only aggregate endpoints (stats, overviews). Use DB-direct for anything with user-specified search parameters.

Key Constraints

  • Max 6 steps. Prevents runaway tool chains. If the LLM can’t answer in 6 steps, the question is too broad.
  • Token budget split. 800 max_tokens for tool selection steps (the LLM only needs to emit a JSON object). 2000 for synthesis steps (the final answer needs room).
  • Strict JSON output. No prose in tool call responses. The LLM fabricates results when given room. Reinforce after every real tool result.
  • Parameter aliases. LLMs don’t respect your schema. _normalize_tool_call() + PARAM_ALIASES catches 90% of format drift.
  • Prefetch on engagement, not on load. Zero cost for users who never chat.
  • Sanitize prefetch params. The /api/agent/prefetch endpoint validates incoming params against the tool’s declared schema and caps limit at 50.
  • Log everything. The conversation history is more valuable than the conversation. agent_chat events get embedded alongside 73K other events.
  • Cross-page memory. sessionStorage for in-session persistence (30 messages). PostgreSQL for permanent record. Navigation recovery re-fires pending requests on page change.