From Retrieval to Production: Reranking, Caching, and the Streaming Architecture Behind Real-Scale RAG
Part 1 covered what to store and how to retrieve it. Part 2 covers what breaks when real users arrive — and how production systems like Perplexity and ChatGPT are actually wired to handle it
Part 2 of 2 — Running at Scale
Part 1 covered what to store and how to retrieve it.
Part 2 covers what happens when a real user hits your system — and ten thousand more arrive right behind them.
The gap between a working RAG pipeline and a production RAG system is not the graph. It's everything around the graph.
After finishing the LangGraph retrieval pipeline for this system — question rewriting, domain routing, vector search, document grading — I ran it against real queries. The architecture was correct. The numbers were not.
Vector similarity scores were hovering between 0.18 and 0.23. The LLM document grader was rejecting 4 out of every 5 retrieved chunks, then billing for the GPT call anyway. And every query was taking 1.5–3 seconds before the user saw a single character of response.
These are not edge cases. They're the default state of a RAG system before you've tuned it for production. This post walks through the three changes that actually moved the needle, and the architecture that makes the whole thing scalable.
The problem with cosine similarity as your only relevance signal
Before getting to solutions, it's worth being precise about why the system was returning low scores in the first place.
A bi-encoder embedding model — whether that's text-embedding-3-small or BAAI/bge-m3 — encodes the query into a vector independently of any document. It has no idea what documents exist. It's projecting language into a geometric space, and cosine distance in that space approximates semantic relatedness — not true task-level relevance.
This is a fundamental limitation, not a bug.
When a user asks a specific question, the query embedding lands in a broad semantic region. Nearby documents may be topically related, but still fail to precisely answer the intent due to phrasing mismatch, granularity mismatch, or missing alignment between query and document structure.
Cosine similarity is a recall mechanism. It is designed to cast a wide net. The precision problem — identifying which of those retrieved chunks actually answers the question — requires something different.
That something is a reranker.
Stage 1: Replace the LLM grader with a cross-encoder reranker
The LLM document grader I had written was doing exactly one thing: taking each retrieved chunk and asking GPT-4o-mini whether it was relevant. Five chunks, five separate LLM calls, five round trips. At ~$0.40 per 1k input tokens for GPT-4o-mini, grading 5 documents per query at 1000 characters each costs roughly $2.00 per 1,000 queries. A cross-encoder reranker does the same job for $0.10 per 1,000 queries with better accuracy.
Here's the architectural difference that matters:
Bi-encoder (embedding model): Encodes query and document separately into vectors. Fast. Scalable. Used at retrieval time. Loses the semantic relationship between the two.
Cross-encoder (reranker): Takes query and document together as a single input. Full attention across both sequences. The model can see how specific words in the query relate to specific words in the document. Used at reranking time. 50–400ms for a batch. Not scalable as a retriever, but perfect as a precision filter.
This is the canonical two-stage architecture in information retrieval: bi-encoder for recall, cross-encoder for precision.
Databricks benchmarks show reranked results reduce LLM hallucinations by 35% compared to raw embedding similarity. Research comparing dedicated rerankers against LLM-as-grader approaches consistently finds that specialized cross-encoders are more accurate, faster, and 20× cheaper than using a general-purpose LLM for this task.
Choosing the right reranker model
Model selection matters more when dealing with multilingual or domain-specific data:
| Model | Languages | Latency (30 docs, GPU) | Cost | Notes |
|---|---|---|---|---|
| BAAI/bge-reranker-v2-m3 | 100+ | ~50ms | Free (self-hosted) | Best open-source multilingual option |
| Cohere rerank-multilingual-v3.0 | 100+ | ~200ms (API) | $0.10/1k queries | Managed, SLA-backed |
| cross-encoder/ms-marco-MiniLM-L-6-v2 | English | ~30ms | Free | Strong for English only |
| LLM grader (GPT-4o-mini) | Any | 1–3s (sequential) | ~$2.00/1k | What we're replacing |
Reranking 30 documents takes approximately 100–150ms on CPU, or 30–50ms on a T4 GPU. Even CPU latency is a fraction of the time saved by removing multiple sequential LLM calls from your hot path.
The updated graph node
Remove grade_documents entirely. The new rerank_documents node replaces it:
import cohere
co = cohere.Client(api_key=os.getenv("COHERE_API_KEY"))
def rerank_documents(state: QueryState):
documents = state.get('documents', [])
if not documents:
state["proceed_to_generate"] = False
return state
question = state['rephrased_query']
doc_texts = [
doc.get("document", {}).get("text", "")[:1000]
for doc in documents
]
# Single API call — scores all docs against the query simultaneously
response = co.rerank(
model="rerank-multilingual-v3.0",
query=question,
documents=doc_texts,
top_n=3,
return_documents=False
)
THRESHOLD = 0.30 # calibrate against your domain
reranked = [
{**documents[r.index], "rerank_score": r.relevance_score}
for r in response.results
if r.relevance_score >= THRESHOLD
]
if not reranked:
best = response.results[0]
reranked = [{**documents[best.index], "rerank_score": best.relevance_score}]
state["documents"] = reranked
state["proceed_to_generate"] = True
return state
Also update fiqh_retrieve to request more candidates — the reranker handles precision:
results = qdrant_client.search(
collection_name=COLLECTION_NAME,
query_vector=query_embedding,
limit=20, # was 5 — more candidates improves reranker effectiveness
score_threshold=0.15
)What your logs look like after this change
Before:
cosine_score=0.21 cosine_score=0.19
After:
rerank_score=0.847 rerank_score=0.612
Rerank scores are calibrated relevance. A 0.847 indicates the document is strongly relevant to the query. A cosine score of 0.21 provides very little actionable signal.
The rerank score is also useful for observability and can be surfaced in UI layers as a confidence indicator alongside citations.
Stage 3: The streaming architecture — how Perplexity handles 780M monthly queries
When Perplexity reached scale, they faced the same question every LLM product eventually hits: how do you keep the HTTP thread from blocking while a pipeline that takes 1–3 seconds runs in the background?
Their answer — and the industry answer — is a pattern that has three moving parts: a task queue that decouples request acceptance from processing, a pub/sub channel that decouples token generation from token delivery, and Server-Sent Events that deliver the stream to the browser over a standard HTTP connection.
This is exactly how ChatGPT's streaming works. When you send a message, the client opens an SSE connection. The server generates tokens and pushes them as newline-delimited strings. The browser receives each chunk and appends it to the UI. You perceive it as the model "thinking" in real time.
The difference between a toy implementation and a production one is what sits between the request and the model call.
Why SSE instead of WebSockets
SSE is server-to-client only — unidirectional. That's exactly what LLM streaming needs. WebSockets are bidirectional, which means heavier connection management, more infrastructure complexity, and a protocol designed for problems you don't have. SSE runs over HTTP/2, which multiplexes it cleanly across the same connection as your other requests. LinkedIn uses SSE for their real-time feed updates. OpenAI uses it for ChatGPT. For token delivery, it's the right choice.
The request lifecycle — end to end
Step 1 — POST /query returns 202 in under 5ms.
The FastAPI server does exactly one thing: generate a task_id, push the task onto the Celery queue, and return. No blocking, no LLM call, no graph execution.
@app.post("/query")
async def submit_query(request: QueryRequest):
task_id = str(uuid.uuid4())
run_rag_pipeline.apply_async(
args=[task_id, request.question, request.session_id],
task_id=task_id
)
return {"task_id": task_id, "status": "queued"}Step 2 — The client immediately opens an SSE stream.
FastAPI subscribes to a Redis channel named after the task_id and holds the connection open. No token is consumed on the API server while waiting — it's pure async I/O.
@app.get("/stream/{task_id}")
async def stream_response(task_id: str):
async def generator():
pubsub = redis_client.pubsub()
await pubsub.subscribe(f"rag:stream:{task_id}")
async for message in pubsub.listen():
if message["type"] == "message":
data = json.loads(message["data"])
yield f"event: {data['event']}\ndata: {json.dumps(data)}\n\n"
if data["event"] == "done":
break
return EventSourceResponse(generator())Step 3 — A Celery worker picks up the task. This is where the LangGraph pipeline runs. The semantic cache is checked first — a hit skips the entire graph and publishes the stored answer directly to the Redis channel. A miss runs the full pipeline.
Inside the generate_answer node, instead of collecting the full response and returning it, the LLM is streamed and each token is published immediately:
def generate_answer(state: QueryState):
task_id = state.get("task_id")
for chunk in llm.stream(prompt):
redis_client.publish(
f"rag:stream:{task_id}",
json.dumps({"event": "token", "data": chunk.content})
)
redis_client.publish(
f"rag:stream:{task_id}",
json.dumps({
"event": "done",
"sources": [doc["document"]["reference"] for doc in state["documents"]]
})
)Step 4 — Tokens flow from Redis to SSE to browser.
The FastAPI server was subscribed to that channel. Every publish in the worker flows through Redis pub/sub to the open SSE connection to the browser. The user sees text appearing word by word.
The reason Redis pub/sub is essential — not optional — is that the Celery worker and the FastAPI server are on different processes, potentially different machines. The worker has no reference to any HTTP connection. Redis is the shared bus they both see. Without it, the worker would publish tokens into a void.
Why this scales
The FastAPI servers are stateless. They hold no pipeline state, just open SSE connections. Each server instance can handle thousands of concurrent SSE connections because async I/O doesn't consume a thread per connection. You can run 50 of them behind a load balancer.
The Celery worker pool scales independently, driven by queue depth. When the task queue has 500 items, Kubernetes HPA spins up more worker pods. Workers are the expensive part — they call Qdrant, the reranker, the LLM. Scaling them on actual demand, not predicted CPU load, is how you control cost at scale.
This is precisely Perplexity's architecture. They built their retrieval layer on Vespa.ai for hybrid search at scale, but the request/worker/stream pattern is identical. By May 2025, they were handling 780 million monthly queries — a 239% increase from the year prior — on this fundamental architecture.
Observability — LangSmith for every node
Every node in your LangGraph graph should emit a trace. LangSmith provides this natively: you can see exactly which embedding model was used, what Qdrant returned, how many documents passed the reranker threshold, what prompt was constructed, and what the LLM generated — down to the millisecond.
This is how you find slow nodes in production. It's also how you build your evaluation dataset: export traces where the rerank scores were low, or where the user sent a follow-up question (a signal the first answer was unsatisfying), and use those as your hardest test cases.
Measuring whether your RAG is actually working — RAGAS
Adding a reranker and semantic cache improves the system. But "improves" is not a production metric. You need numbers — and those numbers need to come from the domain you're operating in, not a generic benchmark.
RAGAS (Retrieval-Augmented Generation Assessment) provides four reference-free metrics that work without manually labeled ground truth:
Faithfulness — does every claim in the generated answer appear in the retrieved context? This is your hallucination detector. A score below 0.85 means the model is confabulating from its weights, not from your documents.
Answer relevancy — does the answer actually address the question asked? A system can be faithful (all claims are in the context) but still irrelevant (it answered a different question).
Context precision — of the chunks passed to the LLM, what fraction were actually useful? Low precision means your reranker threshold is too permissive.
Context recall — were the chunks containing the correct answer actually retrieved? Low recall means your vector search or hybrid retrieval is missing relevant documents.
from ragas import evaluate
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextPrecision, ContextRecall
from langchain_openai import ChatOpenAI
evaluator_llm = ChatOpenAI(model="gpt-4o-mini")
results = evaluate(
dataset=your_test_dataset, # {question, contexts, answer, ground_truth}
metrics=[
Faithfulness(),
AnswerRelevancy(),
ContextPrecision(),
ContextRecall()
],
llm=evaluator_llm
)
# Example output:
# faithfulness: 0.93
# answer_relevancy: 0.88
# context_precision: 0.71
# context_recall: 0.84Building your evaluation dataset without manual labeling
RAGAS can generate test questions directly from your knowledge base:
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import DirectoryLoader
generator = TestsetGenerator.from_langchain(
generator_llm=ChatOpenAI(model="gpt-4o"),
critic_llm=ChatOpenAI(model="gpt-4o-mini"),
embeddings=your_embedding_model
)
testset = generator.generate_with_langchain_docs(
documents=your_chunks,
test_size=100, # 100 generated Q&A pairs
distributions={"simple": 0.5, "multi_context": 0.4, "reasoning": 0.1}
)For an chatbot like Islamic, supplement generated questions with a manual set of known-hard cases: questions that use Arabic terms your embedding model may not handle well, questions requiring multi-hop reasoning across chapters, and questions where the correct answer explicitly says "this is not permitted" (negation handling is where many RAG systems silently fail).
The one number to track in production
If you can only monitor one metric after deployment, make it faithfulness on a rolling window of recent queries. A drop in faithfulness is the earliest signal that something has changed — whether that's a new document that conflicts with existing content, a retrieval failure that's sending the LLM irrelevant context, or a model update that changed generation behavior.
LangSmith integrates with RAGAS directly. Every trace can be evaluated against these metrics automatically and the results surface in your dashboard. The combination of node-level tracing (what happened) and RAGAS metrics (how well it worked) gives you complete observability over a system that would otherwise be a black box.
Incremental ingestion — only re-embed what actually changed
Every optimization above operates at query time. This one operates at ingestion time, and the savings compound with every document update.
The naive approach to updating a PDF source is to delete all its chunks and re-process the entire document. For a 200-chunk document, that's 200 embedding API calls regardless of how many chunks actually changed. In practice, a document revision often changes 5–15% of chunks — a corrected ruling, an added footnote, an updated reference.
The production approach: store a SHA-256 hash of each chunk's text at ingestion time. When a document is updated, diff the incoming chunks against stored hashes. Only chunks with changed content get re-embedded.
def _hash_text(self, text: str) -> str:
return hashlib.sha256(text.strip().encode("utf-8")).hexdigest()
async def ingest(self, pdf_id, source_name, chunks):
incoming = {
i: {"text": text, "hash": self._hash_text(text)}
for i, text in enumerate(chunks)
}
existing = self._fetch_existing_hashes(pdf_id)
# existing: {chunk_index: {hash, qdrant_id}}
to_embed = [
chunk for i, chunk in incoming.items()
if i not in existing or existing[i]["hash"] != chunk["hash"]
]
to_delete = [i for i in existing if i not in incoming]
# One batched embedding call for only the changed chunks
if to_embed:
embeddings = await batch_embedder.embed_many([c["text"] for c in to_embed])
# upsert into Qdrant, update MongoDB hashesOn a 200-chunk PDF where 10 chunks changed, this reduces embedding calls from 200 to 10. At scale, across frequent document updates, this is where real cost savings compound.
After ingestion, call semantic_cache.invalidate_by_source(source_name) to prevent stale cache hits from serving outdated content.
The numbers that matter for the blog
| Scenario | API calls | Time | Embedding cost | |---|---|---|---| | Naive full re-index (200 chunks) | 200 × 1 | ~10s | $0.004 | | Batch only (no diff) | 2 × 100 | ~1.2s | $0.004 | | Incremental + batch (10 changed) | 1 × 10 | ~0.2s | $0.0002 |
The cost column is the one that compounds. If you update documents weekly, incremental ingestion saves 95% of embedding costs indefinitely.
What the system looks like now
The LangGraph graph is largely unchanged structurally — the same nodes, the same routing logic. What changed is everything around it:
At retrieval: grade_documents → rerank_documents. 5 serial LLM calls → 1 batch cross-encoder call. Relevance scores went from meaningless cosine floats (0.18–0.23) to calibrated reranker scores (0.84, 0.61).
At the request boundary: Every query now checks the semantic cache before touching LangGraph. A 40% cache hit rate means 40% of queries return in under 10ms.
At the infrastructure layer: FastAPI returns 202 in 5ms. Celery workers run the graph. Redis pub/sub carries tokens back to SSE. The browser sees a streaming response regardless of where in the worker pool the query ran.
At ingestion: Content hashing means document updates re-embed only what changed. The semantic cache is invalidated for affected sources automatically.
At observability: LangSmith traces every node. RAGAS measures faithfulness and relevancy on a rolling sample. You have numbers, not intuitions.
The honest version of this post is that none of these are novel ideas. Two-stage retrieval with a cross-encoder is a 2019 technique from the MS MARCO paper. Semantic caching predates LLMs. SSE is older than most of the engineers building on it.
What makes them production-grade isn't invention — it's knowing which ones to apply, in which order, and why the others can wait.
The question rewriter node and the fiqh retrieval route were the right things to build first. These are the right things to build next. The Quran and Hadith routes, multi-tenant namespace isolation, and evaluation pipelines against Arabic NLP benchmarks — those come after you've confirmed the core loop is working at scale.
Measure faithfulness. Watch your rerank scores. Check your cache hit rate after a week of traffic. Those three numbers will tell you where to go next.
Subscribe to Updates
Get notified about new projects and articles.
Comments
Loading comments...