From my master’s thesis in Artificial Intelligence.

Abstract

Retrieval-augmented generation (RAG) is widely used to ground large language models in external knowledge by retrieving relevant documents from a knowledge base and using them as context. The quality of the response depends on the quality of the retrieval step, since the generator cannot ground its answer in evidence the retriever did not surface. Agentic retrieval extends RAG by adding a language model to the retrieval process itself, letting it grade the retrieved documents, rewrite the query when retrieval fails, and reorder the final ranking. These extra calls aim to improve retrieval performance, but add latency and token cost. This thesis tests whether agentic retrieval is worth the extra cost in a technical-support setting, where retrieval failures are common and operational cost matters. Three retrieval families (BM25, dense, and hybrid) are evaluated on the WixQA benchmark in both agentic and non-agentic configurations, with each agentic pipeline compared against the strongest non-agentic alternative in the same family rather than against weaker baselines. On the hybrid retriever, agentic retrieval lifts nDCG@10 over this baseline by 13.08 percent on expertwritten queries and 11.95 percent on simulated queries. A per-query breakdown shows the listwise reranker driving both the gain and the regressions, since it runs unconditionally and can reorder a correct ranking into a worse one. The pipeline adds two to eight seconds of latency per query and roughly 16,000 language-model tokens.

Keywords: Retrieval-augmented generation, Agentic retrieval, Information retrieval, Technical support, Reranking

Read the full study and find the code and experiments here: https://github.com/MaiHenry/evaluating-agentic-retrieval