Debug a slow RAG pipeline with nested traces
The Currai team, Engineering — May 18, 2026
"The chatbot feels slow" is the least actionable bug report in the world. A retrieval-augmented answer is at least three steps — embed the query, search the vector store, call the model — and any one of them can be the culprit. Nested traces turn that vague complaint into a timeline you can point at.
Instrument each step as its own span
The trick is to wrap every stage, not just the model call. Each span records its own start and end, so the trace becomes a waterfall.
Read the waterfall
Open the trace and the timing of each span is laid out end to end. Now "it's slow" resolves to one of a few concrete stories:
- Search dominates. Your vector store is the bottleneck — check the index,
the
kvalue, or whether you're searching cold storage. - The generation dominates but tokens are huge. You're stuffing too many documents into the prompt. The model is slow because the prompt is fat.
- Embedding dominates. You're paying a network round-trip per query — batch or cache it.
- The gaps between spans dominate. The time isn't in any step; it's in your own glue code between them.
That last one is the case logs almost never reveal, because the slow part is the absence of work, not any single call.
Tighten the loop
Once you can see the bottleneck, the fix is usually small. Cut k from 8 to 4
and watch the prompt shrink. Cache the query embedding for repeated questions.
Move the vector store to the same region as your app. After each change, compare
the new trace against the slow one — the waterfall makes the improvement obvious.
Keep the spans in production
It's tempting to add this instrumentation only while debugging and rip it out after. Don't. Retrieval gets slower as your corpus grows, and prompts creep upward as features are added. Leaving the spans in means the next slowdown shows up as a shifting waterfall on a dashboard — long before a user files the vague report that started this whole exercise.
currai