GARAGe: how ada uses LLMs to make retrieval smarter

Christos Melidis

Staff Machine Learning Scientist

Ambitious Innovators | AI & Automation | 12 min read

Everyone’s talking about how to make generative AI more accurate. Fewer hallucinations. Better answers. Faster response times. But here’s the thing: you can’t fix bad answers with better models alone. You need better inputs. And that starts with better retrieval.

At Ada, we don’t just build AI agents—we build the platform that empowers companies to deploy their own. That means we give businesses the tools to create, train, and manage agents that not only generate responses, but generate them with precision.

Retrieval is a critical part of that equation, because how you find, rank, and pass context to a model determines how effective the agent will be in production. Standard Retrieval-Augmented Generation (RAG) methods aren’t always enough. That’s where GARAGe comes in.

GARAGe (Generative-Augmented Retrieval-Augmented Generation) flips the RAG script. Instead of using retrieval to improve generation, we use generation to improve retrieval. Sounds recursive? It is. And it works.

Let’s break it down.

how we train large language models for Ada’s AI agent

the problem: RAG isn’t always retrieval-ready

RAG is a technique where a model retrieves relevant content—like knowledge base articles or documents—to help generate more accurate, grounded responses.

Instead of relying solely on what the model already “knows,” RAG gives it the ability to look things up, using real, retrievable content as context. This dramatically reduces hallucinations and ensures answers are aligned with up-to-date information.

Traditionally, RAG systems rely heavily on embedding models to match customer queries with chunks of relevant content. But there’s a catch. Customer queries and knowledge base content don’t always align neatly:

Queries are short, questions are long.
Queries are in first person, articles are not.
Queries are casual, content is formal.
Queries expect coverage, content assumes structure.

This semantic mismatch leads to imperfect retrieval - and imperfect answers. Even the best model can’t perform if it’s handed the wrong context.

So we asked ourselves: what if we didn’t wait for the query? What if we could pre-generate the kinds of questions customers are likely to ask, and then pre-index their best answers?

That’s what GARAGe does.

What is GARAGe?

To understand what makes GARAGe different, it helps to look at how retrieval typically works in a standard RAG setup. RAG uses text embedding models to convert document content—split into smaller components called “chunks”—into vector representations. When a user asks a question, that question is also embedded, and the system retrieves the chunks with the most similar embeddings to use as context for the model’s response.

Retrieving chunks from a standard RAG setup. The query is embedded and used to retrieve the closest matching chunks from the index.

GARAGe takes a straightforward yet innovative approach inspired by dataset creation methodologies:

Generate representative questions and topics targeting specific content chunks.
Ask the questions back to find their optimal answer.
Index the questions while retrieving their corresponding answers at runtime.
Align questions and inference queries to mirror each other in length, language, and perspective for better semantic matching.

Instead of using retrieval to help make generations better, GARAGe focuses on using generation to make retrieval better.

It’s a system that proactively imagines the universe of questions customers might ask, answers those questions offline, and builds a new retrieval index from that process.

Offline population of the GARAGe Index from a document. The document is fed to an LLM and question-answer pairs are generated. The questions are then embedded and inserted into the retrieval index, acting as the reference points for retrieving their respective answer.

Retrieving chunks from a GARAGe setup. The query is embedded and used to retrieve matching chunks and also questions from the GARAGe Index. Matched questions are then mapped to their corresponding GARAGe chunks.

Why it works: The benefits of generative-first retrieval

GARAGe flips the traditional RAG workflow by using generation to improve retrieval—imagining questions, answering them offline, and indexing them for smarter runtime performance. This shift unlocks major gains in semantic accuracy, cost-efficiency, and scalability.

Here’s what makes it powerful:

1. two-hop retrieval

GARAGe enables a two-hop retrieval system where embeddings don't just link queries to chunks directly—they route them through embedded, representative questions. This indirect approach ensures greater semantic alignment, enhancing the accuracy and relevance of retrieved chunks. Think of it like a semantic translator between the customer and the content.

Here’s a simple breakdown of how the two-hop retrieval system works:

Match the incoming query to a synthetic question.
Retrieve the pre-selected chunk that best answers that question.

This indirection improves semantic alignment. It smooths over the language mismatch between natural user input and knowledge base formalities, giving the model a cleaner starting point.

2. optimal semantic chunks

Chunking content for retrieval is notoriously tricky. When you split documents into arbitrary sections, it’s easy to lose context or break up meaning in ways that confuse the model. Traditional RAG systems rely heavily on getting these chunks just right—but that’s a fragile setup.

GARAGe breaks that dependency.

By working on the answers offline, we can restructure the contents of the knowledge base in ways that chunking methods cannot. We can focus on questions and answers instead of document sections, and GARAGe can combine content across sections—or even across documents—to generate semantically complete answers.

That makes the retrieval process less brittle and far more flexible.

3. offline preprocessing efficiency

Much of the computational effort and cost can be shifted offline during knowledge base ingestion and updates. Here’s how GARAGe does most of the heavy lifting offline:

Generate questions
Rank answers
Build the index

That means the runtime cost is significantly lower. When a query comes in, GARAGe only needs to run a single embedding comparison and fetch the result. This makes it ideal for high-volume, low-latency applications.

4. feedback mechanisms

GARAGe isn’t static. It continuously improves via real-world feedback. AI Managers are able to take queries seen in production, map them to the correct answer, and have them automatically be added to the GARAGe index. In the future similar questions result in the correct documents being retrieved. Over time, this creates a virtuous cycle: the more the system is used, the better its retrieval gets.

5. negative indexing: what isn’t covered?

This one’s unique: GARAGe doesn’t just index what it knows—it also tracks what it doesn’t.

When a customer query doesn’t closely match any of the pre-generated questions in the index, GARAGe flags it as a potential knowledge gap. These “misses” aren’t ignored—they’re collected, reviewed, and used to improve the system. That might mean creating a new question-answer pair, updating outdated content, or identifying areas where the knowledge base needs expansion.

This process turns retrieval failures into insight, enabling proactive content improvements, reducing the risk of hallucinations , and ultimately helping AI agents build and maintain trust with users.

getting around the limitations of generative AI for customer service

Measuring success: how GARAGe performs in the real world

So, how does this generative-first approach hold up in practice? To evaluate GARAGe, we focused on two key metrics:

Mean Average Semantic Similarity (MASS): Measures how close the retrieved context is to the ideal context for a given question. It penalizes for both shorter and longer context. This measure guards against bringing too much context in (the whole KB).
Title Match: Checks whether the retrieved chunk came from the correct knowledge base article. This measure could easily increase by bringing in more chunks, hence the necessity of MASS.

When we started, there was clear room for improvement. Our baseline MASS score was 14%, and Title Match sat at 80%. That meant a lot of the context our agents were using wasn’t just imperfect—it was often from the wrong place entirely.

GARAGe changed that. Here’s how:

stage 1: generate and index smarter chunks

We began by proactively generating and indexing additional, more contextually relevant chunks. That alone gave us a lift, and we enhanced our precision considerably:

MASS jumped from 14% to 25%
Title Match improved slightly from 80% to 82%

These improvements showed that GARAGe was effective at reducing irrelevant context and slightly improving article accuracy.

stage 2: rerank dynamically at runtime

Next, we introduced a dynamic reranking of retrieved chunks—choosing the best matching chunks at the time of the query, not just relying on the static index. This selective approach led to substantial enhancements:

MASS surged again to 33%
Title Match climbed to 88%

Reranking demonstrated its value in dynamically filtering and refining context, resulting in fewer irrelevant chunks and a much higher likelihood of retrieving the correct article.

stage 3: make the context even tighter

Finally, we shortened chunk lengths and added summaries of preceding context. That helped the agent zero in on exactly the right information.

MASS hit 36%
Title Match rose to 89%

Shorter chunks accompanied by contextual summaries proved especially beneficial in pinpointing precise information and ensuring correct article retrieval.

These may look like incremental gains, but in the world of generative AI, they’re significant. We more than doubled semantic relevance and significantly improved retrieval precision. And because these improvements happen before the model even starts generating, every response gets better without adding latency or compute costs at runtime.

Ablation analysis of the performance gains from the GARAGe pipeline. Each stage contributes to increase on Title Match as well as MASS, both when retrieving 3 or 5 items from the Index.

Conclusion

GARAGe is more than a clever acronym. It’s a fundamental rethinking of retrieval for LLM-based systems. By flipping the traditional RAG approach on its head, Ada's ML team has found a way to make generative models not just smarter, but also more grounded, more scalable, and more efficient.

Because the best answers don’t just come from the best models. They come from the best context.

ada open house: product & engineering hiring

Come meet the future of AI—and the team building it. On May 1, we’re welcoming software engineers, product managers, and product designers who are interested in learning about Ada and ready to tackle some of the most ambitious technical challenges in the industry.

Learn more

how to prioritize tech stack investments for AI customer service