Optimizing Query Context Handling in RAG Systems- Embedding vs. LLM Rewriting


Overview
Efficiently handling follow-up or context-dependent questions is crucial for building high-performing Retrieval-Augmented Generation (RAG) systems. Users often provide only partial or ambiguous queries—particularly in conversational or UI-driven contexts—making it essential to supply the model with sufficient background information. This guide explores two main strategies to address this challenge:
- Embedding the Full Query Context
- LLM-Based Query Rewriting
We’ll examine the advantages, limitations, and practical use cases of each, as well as a hybrid approach that can balance speed and accuracy.
Strategy 1: Embedding the Full Query Context
How It Works
This method combines all past user interactions with the latest query into a single text string before generating an embedding. For example:
- QUERY1: How should I write my project proposal?
- QUERY2: What structure should the proposal follow?
- QUERY3: Do I need to add references?
These strings are concatenated into a single block of text, then converted into an embedding (using a model such as a sentence transformer or OpenAI embeddings) for querying a vector store (e.g., FAISS).
Advantages
- Efficiency: Avoids additional LLM calls, thus reducing latency.
- Simplicity: Easy to implement in existing retrieval pipelines.
- Context Preservation: Maintains the natural progression of conversation history.
Limitations
- Semantic Dilution: Multiple combined queries can dilute the specific intent of the latest question.
- Ambiguity Handling: Vague references (“this,” “that,” etc.) may remain unresolved.
- Noise Sensitivity: Off-topic or irrelevant content in the history can negatively impact retrieval.
Strategy 2: LLM-Based Query Rewriting
How It Works
Instead of embedding the raw conversation history, you first prompt an LLM to rewrite the user’s latest query into a fully self-contained question. For example:
Previous Questions:
- How should I write my project proposal?
- What structure should the proposal follow?
Current Question:
- Do I need to add references?
Instruction:
Rewrite the above conversation into a standalone, context-rich query.
The LLM might respond:
“Do I need to include APA-style references in my project proposal for Move Tickers?”
Only this rewritten query is then embedded and used to query your vector database.
Advantages
- Disambiguation: Resolves unclear or informal prompts by adding explicit context.
- Improved Retrieval Accuracy: Focuses on the core intent of the question.
- User-Friendly: Ideal for conversational interfaces where follow-up prompts may be minimal or ambiguous.
Limitations
- Additional Latency: Each rewrite requires an extra LLM call, adding cost and processing time.
- Potential Over-Correction: Poorly tuned prompts may lead to rewriting with unintended details.
- Implementation Complexity: Requires integrating another LLM call and related logic into your pipeline.
When to Use Each Strategy
Scenario | Full Context Embedding | LLM Rewriting |
---|---|---|
Low-latency applications | ✅ Optimal | ❌ May add delay |
Well-formed, coherent queries | ✅ Suitable | ❌ Unnecessary |
Short, vague user inputs | ⚠️ Potential issues | ✅ Highly effective |
High-accuracy retrieval requirements | ❌ May dilute intent | ✅ Offers clarity |
Conversational UIs with multiple follow-up questions | ⚠️ Context can degrade | ✅ Clarifies ambiguity |
Cost-sensitive environments | ✅ Lower cost | ❌ Additional expense |
A Hybrid Approach: Combining the Best of Both Worlds
Many real-world scenarios benefit from a hybrid strategy:
- Primary Path: Embed the full query context as the default approach to keep latency and costs low.
- Fallback Mechanism: If the system detects a poor match or low-confidence retrieval, invoke the LLM to rewrite the query. This second pass can clarify the request and lead to better results.
By blending speed (embedding) with precision (LLM rewriting), you can handle a broad range of user queries effectively.
Conclusion
The choice between embedding the full context and using an LLM to rewrite the query depends on your specific requirements for speed, accuracy, and cost. Embedding the entire conversation is simple and low-latency but can dilute focus on the latest query. LLM rewriting excels in clarifying ambiguous prompts but introduces added cost and complexity. A hybrid strategy often delivers the best balance, harnessing both speed and precision to maintain robust performance.
If you need guidance on implementing or fine-tuning these strategies, feel free to get in touch!