Stack Forensics catches fabricated outputs in real time, traces them back to root cause, and gives your team a forensic record of every false claim your model made.
Different hallucinations have different causes. We classify before we remediate.
Verify every cited URL, paper, or section actually exists and contains the claimed content.
Score every factual statement against retrieved context. Flag claims with no source.
Catch person/place/product confusion: wrong CEO, wrong year, wrong jurisdiction.
Numeric claims re-evaluated symbolically. Bad math caught before users see it.
Cluster hallucinations by topic, prompt template, and model version — find systemic issues fast.
Trace each hallucination to retrieval miss, prompt ambiguity, model brittleness, or training-data gap.
Straightforward answers about scope, integration, data handling, and rollout.
Evals score samples; we monitor production. Detection runs on live traffic with sub-15s latency, and we provide forensic root-cause for each event.
2.1% on our public benchmark. We disclose calibration data per claim type — arithmetic is near-zero, identity confusion is the hardest.
No. We complement thumbs-down by catching the hallucinations users don't notice — and giving QA teams a queue to review.
Findings export to your eval set, fine-tuning corpus, or retrieval index as targeted negatives.