RAG Systems That Don't Hallucinate: Engineering Zero-Trust AI
How we engineered the Aya Knowledge Base with grounding scores, provenance tracking, and hallucination detection to build enterprise RAG that clients actually trust.
The Trust Problem in Enterprise RAG
Every enterprise wants RAG. Very few trust the answers it produces. When we started building the Aya Knowledge Base at AlysAI -- a retrieval-augmented generation system for enterprise clients handling compliance-sensitive documents -- the primary concern from every stakeholder was the same: "How do I know this answer is not hallucinated?"
It is a fair question. Standard RAG implementations retrieve relevant chunks, stuff them into a prompt, and hope the LLM synthesizes a faithful answer. In practice, LLMs confabulate details, merge information from unrelated chunks, and present fabricated content with absolute confidence. For an enterprise dealing with regulatory filings, legal contracts, or medical protocols, a single hallucination can mean compliance violations and real financial damage.
This post describes the zero-trust architecture we built for Aya -- a system where every claim in every answer is verified, scored, and traceable to its source document.
The Zero-Trust RAG Architecture
The core principle is simple: treat the LLM as an untrusted component. The LLM generates candidate answers, but a separate verification pipeline validates every factual claim before the answer reaches the user. The architecture has four stages:
- Retrieval with relevance scoring and chunk provenance
- Generation with inline citation requirements
- Verification with grounding score computation
- Filtering with hallucination threshold enforcement
Stage 1: Retrieval with Provenance
We use a hybrid retrieval strategy combining dense embeddings (via a fine-tuned BGE-large model) and sparse BM25 scoring, fused with Reciprocal Rank Fusion. But retrieval alone is not enough -- we attach full provenance metadata to every chunk:
@dataclass
class RetrievedChunk:
content: str
document_id: str
document_title: str
page_number: int
paragraph_index: int
chunk_hash: str # SHA-256 of content for integrity verification
ingestion_timestamp: datetime
relevance_score: float # hybrid retrieval score
source_classification: str # "primary", "secondary", "derived"
pii_redacted: bool
redaction_log: list[str] # which PII categories were redactedEvery chunk carries its complete lineage. When the system produces an answer, a user can trace any claim back to a specific paragraph on a specific page of a specific document, ingested at a specific time. This is not just a nice feature -- for several of our clients in regulated industries, it is a compliance requirement.
Stage 2: Constrained Generation
The generation prompt explicitly instructs the LLM to cite its sources using chunk identifiers. We format the prompt so that each retrieved chunk has a unique tag, and the model must reference these tags inline:
GENERATION_PROMPT = """
You are answering questions using ONLY the provided source documents.
RULES:
1. Every factual claim MUST include a citation tag [ChunkID].
2. If the sources do not contain information to answer the question, say "I cannot find this information in the provided documents."
3. Do NOT combine information from different chunks to infer new facts.
4. Do NOT add information beyond what is explicitly stated in the sources.
Sources:
{formatted_chunks}
Question: {user_question}
Answer (with citations):
"""This prompt engineering reduces hallucination rate significantly, but does not eliminate it. LLMs still occasionally fabricate citations or misattribute claims. That is why Stage 3 exists.
Stage 3: Grounding Verification
After the LLM generates an answer, a separate verification pipeline decomposes the answer into individual claims and scores each claim against the cited source chunk. We use a fine-tuned NLI (Natural Language Inference) model for this -- specifically, a DeBERTa-v3-large model fine-tuned on a combination of MNLI, FEVER, and our own domain-specific entailment dataset.
class GroundingVerifier:
def __init__(self, nli_model, threshold=0.7):
self.nli_model = nli_model
self.threshold = threshold
def verify_answer(self, answer: str, chunks: dict[str, RetrievedChunk]):
claims = self.decompose_claims(answer)
results = []
for claim in claims:
cited_chunk = chunks.get(claim.citation_id)
if cited_chunk is None:
results.append(GroundingResult(
claim=claim, score=0.0, status="FABRICATED_CITATION"
))
continue
# NLI: does the chunk entail the claim?
score = self.nli_model.entailment_score(
premise=cited_chunk.content,
hypothesis=claim.text
)
status = "GROUNDED" if score >= self.threshold else "UNGROUNDED"
results.append(GroundingResult(
claim=claim, score=score, status=status
))
return resultsEach claim receives a grounding score between 0.0 and 1.0. Scores above 0.7 are considered grounded. Scores between 0.3 and 0.7 are flagged for review. Scores below 0.3 trigger automatic removal of the claim from the answer.
Stage 4: Hallucination Filtering and Response Assembly
The final stage assembles the verified answer. Ungrounded claims are either removed or replaced with a disclaimer. The overall answer receives an aggregate grounding score -- the weighted average of its individual claim scores. If the aggregate score falls below our hallucination threshold of 0.3, the entire answer is rejected and the user receives a message explaining that the system could not produce a sufficiently reliable answer.
def assemble_verified_response(claims, grounding_results, threshold=0.3):
verified_claims = []
warnings = []
for claim, result in zip(claims, grounding_results):
if result.status == "GROUNDED":
verified_claims.append(claim.text)
elif result.status == "UNGROUNDED" and result.score >= 0.3:
verified_claims.append(
f"{claim.text} [Low confidence - verify against source]"
)
warnings.append(f"Claim partially supported: {claim.text[:80]}...")
else:
warnings.append(f"Removed unverified claim: {claim.text[:80]}...")
aggregate_score = mean([r.score for r in grounding_results])
if aggregate_score < threshold:
return {
"answer": None,
"message": "Unable to produce a sufficiently reliable answer.",
"grounding_score": aggregate_score,
"warnings": warnings,
}
return {
"answer": " ".join(verified_claims),
"grounding_score": aggregate_score,
"warnings": warnings,
"provenance": [r.to_dict() for r in grounding_results],
}PII Redaction as a First-Class Concern
Enterprise documents contain personally identifiable information -- names, addresses, social security numbers, medical record numbers. Our ingestion pipeline runs PII detection before chunking, using a combination of Microsoft Presidio and custom regex patterns for domain-specific identifiers.
Redacted content is replaced with typed placeholders ([PERSON_NAME_1], [SSN_REDACTED]) that preserve semantic structure while removing sensitive data. The redaction log is attached to each chunk's provenance metadata, so compliance teams can audit exactly what was redacted and why.
Critically, PII redaction happens before embeddings are computed. This means the vector store never contains PII in any form -- neither in the stored text nor in the embedding vectors that could theoretically be inverted.
OWASP LLM Top 10 Compliance
We audited the Aya system against the OWASP Top 10 for LLM Applications:
- LLM01 (Prompt Injection): Input sanitization layer strips known injection patterns. The generation prompt uses XML delimiters that are validated before LLM submission.
- LLM02 (Insecure Output): All LLM outputs are treated as untrusted. HTML is escaped, and outputs are validated against expected schemas before rendering.
- LLM06 (Sensitive Information Disclosure): PII redaction pipeline described above, plus output scanning that catches any PII that bypasses ingestion-time redaction.
- LLM09 (Overreliance): The grounding score system explicitly communicates confidence levels to users, discouraging blind trust in AI outputs.
Production Results
After three months in production with four enterprise clients, the system processes approximately 15,000 queries per day with the following metrics:
- Average grounding score: 0.82
- Full rejection rate (aggregate score below 0.3): 3.2% of queries
- User-reported inaccuracies: 0.4% of queries (down from 11% with standard RAG)
- Average response latency: 2.8 seconds (acceptable for the document analysis use case)
The latency cost of verification is real -- approximately 800ms is spent on claim decomposition and NLI scoring. But every client we have spoken to accepts this trade-off. In their words: "A slower correct answer is infinitely more valuable than a fast wrong one."
Lessons Learned
The NLI model is the linchpin. Off-the-shelf NLI models work reasonably well, but fine-tuning on domain-specific entailment pairs improved grounding accuracy by 15 percentage points. Invest in building a high-quality entailment dataset for your domain.
Chunk boundaries matter more than you think. A claim that spans two chunks will fail grounding verification even if both chunks support it. We implemented a chunk merging strategy for adjacent chunks from the same document section, which reduced false-negative grounding failures by 22%.
Users trust the system more when they can see the scores. Exposing grounding scores and provenance links in the UI, rather than hiding them, dramatically increased user adoption. Transparency builds trust faster than accuracy alone.
Comments