Best tested frontier model still fabricated citations 49% of the time.
HalluciBench v1 evaluated 13 frontier models across 135 prompts and 9 high-stakes domains.
AI systems write confident claims, cite real-looking sources, and collapse uncertainty into answers. Hallucinaite audits the output before organizations bet legal, medical, financial, or regulatory decisions on it.
We do not ask whether AI wrote it; it probably did. We ask whether AI got it right: source support, citation laundering, hallucination risk, and credit-rating-style model grades.
Audit Output
Source exists, but does not support the claimed magnitude or conclusion.
Model collapses conflicting source evidence into a single definitive claim.
Primary reference supports the architectural mechanism described.
HalluciBench v1
| # | Model | Grade | Rate |
|---|---|---|---|
| 1 | Claude Sonnet 4.6 | CC | 49.1% |
| 2 | MiMo-V2-Pro | CC | 54.0% |
| 3 | Kimi K2.5 | CC | 54.4% |
| 4 | Qwen 3.6 Plus | CC | 55.4% |
| 5 | Gemini 3.1 Pro Preview | CC | 56.6% |
| 6 | Claude Opus 4.6 | C | 60.2% |
Claims enter, evidence gets inspected, and risk comes out as a structured signal an enterprise team can act on.
Source supports the stated review-time reduction.
Cited case could not be resolved in legal source registry.
Real source is being used to support a stronger claim than it contains.
HalluciBench v1 is preparing for release. We are sharing early looks with teams that need to understand whether cited AI output is actually supported by the sources it invokes.
HalluciBench v1 evaluated 13 frontier models across 135 prompts and 9 high-stakes domains.
Enterprises need risk language general counsel, CROs, and CTOs can use. A grade is more useful than a vague model score.
Hallucinaite separates fake citations, unsupported claims, quote drift, and source laundering so teams can see exactly what kind of trust failure occurred.
We check whether the cited material actually supports the claim being made, not merely whether the citation exists somewhere.
We are starting with public benchmarks and structured audits, then turning the same evaluation pipeline into enterprise API infrastructure.
An open reliability leaderboard that combines citation verification, a 4-axis rubric, an 8-type error taxonomy, and credit-rating-style model grades.
Board-ready reliability audits for organizations deploying AI into legal, medical, financial, and other high-stakes workflows.
A real-time evaluation endpoint for fabricated citations, overconfident claims, sycophancy, and broken reasoning before AI output reaches users.
Hallucinaite reports are designed for AI buyers, general counsel, compliance teams, and technical leaders who need to understand where a model fails, how often it fails, and what risk that creates.
We are onboarding design partners, early customers, and investors who want independent verification for AI outputs before teams rely on them in high-stakes workflows.
Prefer email? alex@humansofai.xyz