asyncpg Dual Pools for Cross-DB Vector Search

asyncpg + pgvector web

TL;DR

asyncpg can maintain separate connection pools to different databases on the same Postgres instance. pgvector columns come back as raw strings like '[0.1,0.2,...]' — not Python lists. Use json.loads() or pass them straight to SQL.

The Setup

life-dashboard stores events with 768-dimensional embeddings in life_dashboard. LibraryOfBabel stores 31k+ book chunk embeddings in knowledge_base. Both live on the same Postgres instance, both use pgvector.

The goal: find library passages similar to a user’s highlights. That means querying both databases from one FastAPI process.

Dual Pool Pattern

# app/db.py
_ld_pool: asyncpg.Pool | None = None   # life_dashboard
_kb_pool: asyncpg.Pool | None = None   # knowledge_base

async def get_pool() -> asyncpg.Pool:
    global _ld_pool
    if _ld_pool is None:
        _ld_pool = await asyncpg.create_pool(
            "postgresql://user@localhost:5432/life_dashboard"
        )
    return _ld_pool

async def get_kb_pool() -> asyncpg.Pool:
    global _kb_pool
    if _kb_pool is None:
        _kb_pool = await asyncpg.create_pool(
            "postgresql://user@localhost:5432/knowledge_base"
        )
    return _kb_pool

Each pool is independent. No cross-database joins, no foreign data wrappers, no dblink. You fetch the embedding from DB A, then query DB B with it. Simple, no Postgres extensions needed beyond pgvector itself.

The pgvector String Gotcha

asyncpg doesn’t have a native pgvector type codec. When you SELECT embedding FROM events, you get a Python string:

row['embedding']  # '[0.0123,0.0456,...,0.0789]'  -- a string, not a list

This actually works fine for the common case: passing it right back into another SQL query. Postgres will cast the string to vector automatically:

SELECT * FROM chunk_embeddings
ORDER BY embedding <=> $1::vector
LIMIT 5

If you need the values in Python (for averaging, etc.), json.loads() gives you a float list. Don’t try to register a custom asyncpg codec for pgvector — it’s not worth the complexity.

Embedding Model Alignment

The cross-DB query only works if both databases used the same embedding model. LibraryOfBabel has chunks embedded with both nomic-embed-text (v1) and nomic-embed-text-v2-moe. These produce different vector spaces. Always filter:

WHERE embedding_model = 'nomic-embed-text-v2-moe'

Mixing models gives you results that look plausible but are semantically wrong. There’s no error, no warning — just bad similarity scores that are hard to distinguish from “nothing relevant found.”

Patch Matching: Finding Exact Passages

Whisper transcriptions are approximate. To find the exact passage in the ebook, use a sliding window approach:

  1. Split the transcription into 6-word windows
  2. Start from the middle (whisper is most accurate there, edges drift)
  3. ILIKE '%six word window%' against the book’s chunks
  4. Fallback: pg_trgm <-> distance operator for fuzzy matches

The ILIKE pattern must be built in Python. asyncpg parameterized queries won’t concatenate '%' || $1 || '%' — you need f"%{window}%" passed as the parameter value.

655 exact substring matches and 15 trigram fallback matches across 1,408 highlights. The middle-out strategy matters: starting from edges drops the match rate by ~30%.

Practical Takeaways

  • Dual asyncpg pools are the simplest way to query across databases. No Postgres config changes needed.
  • pgvector values are strings in asyncpg. Don’t fight it.
  • Always filter by embedding model name when querying shared vector tables.
  • For fuzzy text matching, middle-out sliding windows beat sequential scanning.
  • Build ILIKE wildcards in Python, not SQL, when using asyncpg parameterized queries.

Developer Perspective

The temptation with cross-database queries is to reach for foreign data wrappers or some unified schema. Resist it. Two pools, two queries, combine in Python. It’s 10 lines of code, zero Postgres config, and trivially debuggable. The “proper” solutions add operational complexity for zero user-facing benefit.

The pgvector string thing will trip you up exactly once. After that it’s a non-issue. The real trap is model alignment — silently wrong results from mismatched embedding spaces will waste hours before you think to check.