Gemini Embeddings: $3.50 to Re-embed 362K Chunks
TL;DR
Gemini’s embedding API (gemini-embedding-001, 768d) replaced local Ollama inference for 362K+ text chunks at ~$3.50 total cost. The bigger win: removing YAKE keyword extraction and sending raw prose directly to the embedder produced far better semantic search results.
The Migration: Ollama to Gemini
The Library of Babel indexes ~5,000 books and 2.4M text passages. The original pipeline used nomic-embed-text-v2-moe running locally via Ollama. It worked, but local inference on an M2 Pro is slow at scale, and the semantic quality had a ceiling.
Gemini’s embedding API changed the math:
- Batch endpoint: 100 texts per call, paid tier ~200 RPM
- Cost: $25 prepaid Google Cloud credits, actual spend ~$3.50 for 362K chunks
- Throttle: 0.3s between calls after upgrading from free tier (which rate-limits aggressively)
- Dimensions: 768d vectors, same as what pgvector was already storing
The batch re-embed script handles rate limits with exponential backoff, budget caps to prevent runaway spend, and checkpoint resumption so a killed process picks up where it left off.
YAKE Extraction Was Actively Harmful
This was the session’s most counterintuitive finding. The pipeline previously ran YAKE keyword extraction on passages before embedding — the theory being that keywords would produce tighter, more relevant vectors.
Wrong. Removing YAKE and embedding raw prose produced dramatically better semantic search. The proof: querying for themes of “religious prohibition against thinking machines” with zero Dune-specific words returns Dune books about the Butlerian Jihad. YAKE was stripping the semantic context that made this possible.
The lesson: modern embedding models understand prose. Preprocessing that reduces text to keywords throws away exactly the contextual signal the model needs.
Multi-Model pgvector Queries
After migration, the database contains embeddings from two models: Gemini for re-embedded chunks, nomic-v2-moe for anything not yet migrated. You cannot mix cosine similarity across models — the vector spaces are incompatible.
The solution: query each model’s embeddings separately, merge results, deduplicate by book, and apply a minimum similarity threshold (0.45). This lets the system degrade gracefully during migration rather than requiring a complete cutover.
-- Query Gemini embeddings
SELECT ... FROM chunks WHERE model = 'gemini' ORDER BY embedding <=> $query_gemini LIMIT 20;
-- Query nomic embeddings separately
SELECT ... FROM chunks WHERE model = 'nomic' ORDER BY embedding <=> $query_nomic LIMIT 20;
-- Merge in application code, dedup by book
Practical Takeaways
- Gemini embedding API batch endpoint: max 100 texts per call. Budget ~$0.01 per 1,000 chunks at current pricing.
- Rate limits: Free tier is nearly unusable for batch work. Paid tier at 200 RPM with 0.3s throttle is stable.
- screen > nohup: Long-running embed jobs need screen sessions. nohup processes get killed by signal interference from other tools. screen survives.
- chunk_id string comparison: When start_position columns are unpopulated, chunk_id ordering works for maintaining passage sequence within a book.
- Similarity threshold: 0.45 cosine similarity is a reasonable floor for “related” results. Below that, you’re returning noise.
Developer Perspective
The embedding API market has commoditized faster than anyone expected. Local inference made sense in 2024 when API costs were high and privacy mattered. In 2026, embedding 362K chunks costs less than a coffee. The competitive advantage has shifted entirely to what you do with the vectors — the search architecture, the deduplication logic, the multi-model merge strategy.
The YAKE finding is the one I keep coming back to. We spent years building keyword extraction pipelines because “help the model focus.” Modern embedding models do not need help focusing. They need the full text. Every preprocessing step you add is a bet that you understand language better than the model. That bet increasingly loses.