May 18, 2026

Debug a slow RAG pipeline with nested traces

A RAG answer that takes four seconds could be slow retrieval, a fat prompt, or the model itself. Nested traces tell you which one — here's how to find the bottleneck.

TUTORIAL6 min readThe Currai team / Engineering

Currai

"The chatbot feels slow" is the least actionable bug report in the world. A retrieval-augmented answer is at least three steps — embed the query, search the vector store, call the model — and any one of them can be the culprit. Nested traces turn that vague complaint into a timeline you can point at.

Instrument each step as its own span

The trick is to wrap every stage, not just the model call. Each span records its own start and end, so the trace becomes a waterfall.

trace = currai.trace(name="rag-answer", user_id="user-1")

embed = trace.span(name="embed-query", input={"query": question})
qvec = embedder.embed(question)
embed.end(output={"dims": len(qvec)})

search = trace.span(name="vector-search", input={"k": 8})
docs = vector_store.search(qvec, k=8)
search.end(output={"doc_ids": [d.id for d in docs]})

gen = trace.generation(name="answer", model="gpt-4o", input=build_prompt(docs))
gen.end(output=answer, usage=usage)

Read the waterfall

Open the trace and the timing of each span is laid out end to end. Now "it's slow" resolves to one of a few concrete stories:

Search dominates. Your vector store is the bottleneck — check the index, the k value, or whether you're searching cold storage.
The generation dominates but tokens are huge. You're stuffing too many documents into the prompt. The model is slow because the prompt is fat.
Embedding dominates. You're paying a network round-trip per query — batch or cache it.
The gaps between spans dominate. The time isn't in any step; it's in your own glue code between them.

That last one is the case logs almost never reveal, because the slow part is the absence of work, not any single call.

Tighten the loop

Once you can see the bottleneck, the fix is usually small. Cut k from 8 to 4 and watch the prompt shrink. Cache the query embedding for repeated questions. Move the vector store to the same region as your app. After each change, compare the new trace against the slow one — the waterfall makes the improvement obvious.

Keep the spans in production

It's tempting to add this instrumentation only while debugging and rip it out after. Don't. Retrieval gets slower as your corpus grows, and prompts creep upward as features are added. Leaving the spans in means the next slowdown shows up as a shifting waterfall on a dashboard — long before a user files the vague report that started this whole exercise.

Back to blog

Debug a slow RAG pipeline with nested traces

Instrument each step as its own span

Read the waterfall

Tighten the loop

Keep the spans in production

How to build an AI FAQ chatbot trained on your documentation

How to build a customer support chatbot for your website (step-by-step)

LLM red teaming: a step-by-step guide

Debug a slow RAG pipeline with nested traces

Instrument each step as its own span

Read the waterfall

Tighten the loop

Keep the spans in production

Related articles

How to build an AI FAQ chatbot trained on your documentation

How to build a customer support chatbot for your website (step-by-step)

LLM red teaming: a step-by-step guide