Frequently asked questions
- What is eval-driven development for RAG?
- It is a development approach where you define measurable tests for retrieval and answer quality before and during implementation, then use those evals to guide each release.
- Why is eval-driven development important for RAG in 2026?
- RAG systems now power customer support, internal knowledge search, and agent workflows, so teams need repeatable ways to catch hallucinations, poor retrieval, and regressions before users do.
- What should a RAG eval set include?
- A practical eval set should include real user questions, expected source documents, answer quality criteria, failure cases, and edge cases such as ambiguous queries or missing context.
- How do teams in Indonesia apply this approach?
- Teams in Jakarta and across Indonesia often start with a small, high-value use case, define a gold set from local documents or support tickets, and iterate on retrieval, prompts, and ranking based on measured results.
- Can eval-driven development guarantee a safe or correct RAG system?
- No. It reduces risk and improves confidence, but you still need human review, production monitoring, and domain experts for sensitive use cases.
What is eval-driven development for RAG?
Eval-driven development means you design measurable checks before you scale a retrieval-augmented generation system. Instead of asking, “Does this feel better?”, you ask, “Did retrieval improve, did answer quality rise, and did failure rates drop on a known test set?”
For RAG features, that matters because the system has multiple moving parts: query rewriting, chunking, embedding choice, vector search, reranking, prompt design, and the final generation step. A change in any one layer can improve one metric while breaking another. Eval-driven development gives you a way to see those tradeoffs early.
In 2026, this approach is becoming standard for funded startups and enterprises building AI products in Jakarta, across Indonesia, and in global markets. Teams no longer want a demo that works once. They want a release process that can prove whether a RAG feature is actually better.
Why RAG needs evals more than traditional features
Traditional software features usually have deterministic outputs. If a payment button is clicked, you can verify whether the transaction started. RAG is different. The system may answer correctly, partially correctly, or confidently incorrectly, and the quality depends on both the retrieved context and the generated response.
That makes manual review alone too slow and too subjective. You need a mix of automated and human evals to answer questions like:
- Did the retriever fetch the right documents?
- Did the model use those documents instead of inventing facts?
- Did the answer stay grounded in the source material?
- Did latency or cost rise too much after the latest change?
- Did the system become worse for Indonesian-language queries, code-mixed queries, or long enterprise documents?
If you are shipping a support assistant, an internal policy assistant, or a sales enablement copilot, these questions directly affect trust and adoption.
What should you measure in a RAG eval?
A useful eval suite usually covers three layers: retrieval, generation, and system behavior.
Retrieval metrics
These tell you whether the right information was found.
- Recall@k: Did the relevant document appear in the top k results?
- Precision@k: How many of the returned chunks were actually useful?
- MRR or nDCG: How well were the best results ranked?
- Context coverage: Did the retrieved chunks contain enough evidence to answer the query?
Generation metrics
These tell you whether the answer was grounded and useful.
- Faithfulness: Is the answer supported by the retrieved context?
- Answer relevance: Does it actually answer the question?
- Completeness: Does it cover the important parts?
- Citation quality: Are references accurate and traceable?
System metrics
These tell you whether the feature is viable in production.
- Latency end to end
- Token and inference cost
- Refusal rate on unsafe or unsupported queries
- Escalation rate to human support
- User satisfaction or task success
For many teams, the biggest mistake is focusing only on answer quality. If retrieval is weak, prompt tuning can hide the problem temporarily, but it will not fix the system.
How do you build an eval set for a RAG feature?
Start with real user intent. Pull questions from support tickets, sales conversations, internal search logs, or onboarding workflows. In Indonesia, this often means combining English queries with Bahasa Indonesia, and sometimes mixed-language phrasing from actual users.
A practical eval set should include:
- Common questions that represent the main use case.
- Hard questions that require multiple documents or careful ranking.
- Ambiguous questions that should trigger clarification.
- Missing-knowledge questions that the system should refuse or escalate.
- Edge cases involving outdated policy, duplicate documents, or conflicting sources.
Each item should include the query, the expected source documents, and the ideal answer criteria. If possible, include a short rationale from a subject matter expert. That makes review faster and more consistent.
For example, a Jakarta-based fintech team might create an eval set from KYC support questions, product FAQs, and compliance documentation. A B2B SaaS company might use onboarding questions, integration docs, and account-management policies. The exact domain matters less than the discipline of turning real usage into repeatable tests.
A practical eval-driven workflow
A good workflow is simple enough to repeat every week.
1. Define the success criteria first
Before writing code, decide what “good” means. For example: the assistant must cite the correct policy document, answer in under 4 seconds, and refuse unsupported legal advice.
2. Create a baseline
Run the initial system against your eval set and record the results. This becomes the benchmark for all future changes.
3. Change one layer at a time
Test retrieval separately from generation. For example, compare chunk sizes, embedding models, or rerankers before changing the prompt.
4. Review failures manually
Look at the worst cases, not just the average score. Often the most valuable insight is that the retriever is missing a document because of formatting, metadata, or language mismatch.
5. Gate releases on eval thresholds
If a change improves faithfulness but hurts recall or latency, decide whether the tradeoff is acceptable. This is where product judgment matters.
6. Monitor production drift
User behavior changes over time. New documents are added, policies change, and query patterns shift. Re-run evals regularly and compare against live traffic.
What changes in 2026?
By 2026, RAG systems are more mature, but user expectations are also higher. Teams are moving beyond “chat over documents” and into agentic workflows, customer operations, and decision support. That raises the bar for observability and reliability.
Three trends stand out:
- Better automated judges: Teams increasingly use model-based evaluators, but they still validate them against human review to avoid blind spots.
- More domain-specific evals: Generic benchmarks are not enough for enterprise use cases in healthcare, finance, legal, or procurement.
- Stronger production monitoring: Teams track retrieval drift, citation failures, and user correction patterns after launch.
This is especially relevant for companies in Indonesia that serve regulated industries or multilingual users. A system that works in a demo may fail once it meets real documents, real support tickets, and real operational pressure.
Common mistakes to avoid
The biggest mistake is treating evals as a one-time QA task. RAG quality is not static.
Other common mistakes include:
- Using synthetic questions only, with no real user data
- Measuring final answers without checking retrieval quality
- Optimizing for one metric while ignoring latency or cost
- Letting the eval set become stale as documents change
- Assuming a high score means the system is safe for every use case
If your RAG feature supports customer-facing decisions, compliance workflows, or internal approvals, bring in domain experts and run a professional audit where needed. Evals improve confidence, but they do not replace accountability.
Key takeaways
- Eval-driven development makes RAG quality measurable before release.
- Test retrieval, generation, and system behavior separately.
- Build eval sets from real user questions and domain documents.
- Re-run evals as data, policies, and user behavior change.
- Use human review for sensitive use cases; do not rely on scores alone.
How APLINDO approaches RAG engineering
At APLINDO, we help teams design production-ready AI systems with clear evaluation loops. As a Jakarta-headquartered, remote-first engineering partner, we work with funded startups and enterprises on SaaS engineering, applied AI, Fractional CTO support, and ISO/compliance consulting.
For RAG features, that usually means helping teams define the right evals, improve retrieval pipelines, and set up release gates that are realistic for their product and risk profile. In practice, the goal is not just a smarter demo. It is a system your team can maintain, measure, and improve over time.
FAQ
Is eval-driven development only for large AI teams?
No. Small teams benefit even more because evals help them avoid expensive guesswork and focus on the changes that matter most.
Should I use automated or human evals for RAG?
Use both. Automated evals are fast and scalable, while human review catches nuance, policy issues, and domain-specific mistakes.
How often should RAG evals be updated?
Update them whenever your documents, user behavior, prompts, or retrieval stack changes, and review them on a regular schedule after launch.
Can evals help with multilingual RAG?
Yes. They are especially useful for multilingual systems because they reveal retrieval gaps, translation issues, and language-specific failure patterns.
What is the first step if my RAG system is already live?
Start by sampling real queries, labeling a small gold set, and measuring retrieval and answer quality against your current production behavior.

