
Josibench
How well do LLMs know me? This benchmark is for my own personal use, as it allows me to assess which frontier models know the most about my projects and persona due to their built in training data. It was also a fun opportunity for me to practise my benchmarking skills.
Generated: February 2, 2026
Leaderboard
| # | Model | Raw % | Weighted % | IDK | Refusals | Hallucinations |
|---|---|---|---|---|---|---|
| 🥇 | Gemini 3 Flash Preview | 42.8% | 39.5% | 0 (0.0%) | 0 (0.0%) | 43 (47.8%) |
| 🥈 | Gemini 3 Pro Preview | 41.1% | 37.6% | 0 (0.0%) | 0 (0.0%) | 42 (46.7%) |
| 🥉 | DeepSeek V3.2 | 33.9% | 33.3% | 5 (5.6%) | 2 (2.2%) | 44 (48.9%) |
| 4 | GLM 4.7 | 35.0% | 32.9% | 12 (13.3%) | 0 (0.0%) | 37 (41.1%) |
| 5 | GPT-4o | 36.1% | 32.2% | 10 (11.1%) | 0 (0.0%) | 39 (43.3%) |
| 6 | Grok 4.1 Fast | 30.0% | 28.3% | 4 (4.4%) | 0 (0.0%) | 59 (65.6%) |
| 7 | Claude Opus 4.5 | 31.1% | 27.5% | 52 (57.8%) | 0 (0.0%) | 11 (12.2%) |
| 8 | GPT-5.2 | 27.2% | 27.1% | 46 (51.1%) | 0 (0.0%) | 14 (15.6%) |
| 9 | Kimi K2.5 | 30.6% | 26.9% | 29 (32.2%) | 0 (0.0%) | 25 (27.8%) |
| 10 | GPT-4o-mini | 22.8% | 22.1% | 17 (18.9%) | 0 (0.0%) | 48 (53.3%) |
| 11 | Qwen3-VL-30B-A3B | 20.6% | 18.4% | 22 (24.4%) | 0 (0.0%) | 51 (56.7%) |
| 12 | Claude Sonnet 4 | 14.4% | 13.6% | 71 (78.9%) | 0 (0.0%) | 1 (1.1%) |
| 13 | Claude Haiku 4.5 | 16.1% | 13.4% | 73 (81.1%) | 0 (0.0%) | 1 (1.1%) |
Aggregate Statistics
29.4%
Avg Raw Score
27.1%
Avg Weighted Score
29.1%
Avg IDK Rate
0.2%
Avg Refusal Rate
35.5%
Avg Hallucination Rate
Performance by Category
| Model | Bio | Projects | SEI | PW | MDL | Proto | Media | Tech |
|---|---|---|---|---|---|---|---|---|
| Gemini 3 Flash Preview | 45% | 43% | 87% | 56% | 38% | 32% | 43% | 33% |
| Gemini 3 Pro Preview | 40% | 37% | 87% | 63% | 63% | 43% | 43% | 39% |
| DeepSeek V3.2 | 40% | 17% | 70% | 69% | 8% | 36% | 7% | 17% |
| GLM 4.7 | 40% | 17% | 83% | 69% | 8% | 29% | 21% | 6% |
| GPT-4o | 25% | 17% | 80% | 56% | 33% | 39% | 14% | 11% |
| Grok 4.1 Fast | 35% | 3% | 70% | 69% | 4% | 32% | 14% | 22% |
| Claude Opus 4.5 | 10% | 23% | 73% | 75% | 8% | 21% | 14% | 17% |
| GPT-5.2 | 25% | 7% | 77% | 44% | 0% | 29% | 0% | 22% |
| Kimi K2.5 | 20% | 17% | 73% | 69% | 25% | 21% | 21% | 6% |
| GPT-4o-mini | 10% | 13% | 53% | 38% | 13% | 25% | 14% | 6% |
| Qwen3-VL-30B-A3B | 20% | 7% | 67% | 19% | 8% | 14% | 0% | 0% |
| Claude Sonnet 4 | 0% | 3% | 47% | 31% | 0% | 14% | 7% | 6% |
| Claude Haiku 4.5 | 5% | 0% | 63% | 44% | 0% | 7% | 0% | 0% |
Key Findings (written by Opus 4.5)
Subjective Effect Index dominates. All models performed best on SEI questions (47-87%), likely because this documentation is publicly available and well-indexed.
Calibration varies wildly. Claude models (Opus, Haiku, Sonnet) showed the lowest hallucination rates (1.1-12.2%) but also the highest IDK rates (57.8-81.1%). Grok and DeepSeek hallucinated the most (48.9-65.6%) while rarely saying "I don't know".
Mindstate Design Labs is nearly unknown. Most models scored under 10% on MDL questions, with GPT-5.2 scoring 0%. Only Gemini 3 Pro (62.5%) demonstrated meaningful knowledge.
Gemini leads on raw knowledge. Both Gemini 3 models topped the leaderboard with 41.1-42.8% raw scores and zero IDKs, though at the cost of 46.7-47.8% hallucination rates.
Policy refusals are rare. Only DeepSeek V3.2 refused to answer on policy grounds (2.2% refusal rate). All other non-answers were knowledge gaps (IDK), not model safety interventions.
Model-by-Model Analysis
Gemini 3 Flash & Pro — The Confident Confabulators
Both Gemini models topped the leaderboard with zero refusals and the highest raw scores (41.1-42.8%). However, this came at a cost: 46.7-47.8% of their answers were hallucinations.
Behavioral pattern: Gemini never admits uncertainty. When it doesn't know something, it generates plausible-sounding but fabricated details with confident specificity.
Strengths: Strong on SEI effect definitions (86.7%) and PsychonautWiki facts (56.3-62.5%). Pro showed surprising knowledge of Mindstate Design Labs (62.5%).
DeepSeek V3.2 — The Overconfident Reasoner
DeepSeek showed its reasoning chain in responses, walking through its thought process before answering. Despite this deliberation, it still hallucinated 48.9% of answers with only 5.6% IDK rate and 2.2% refusal rate.
Behavioral pattern: Fabricates specific, verifiable-sounding details including invented names and incorrect dates. Even when uncertain, it commits to a specific answer.
Notable: The only model to invoke policy-based refusals for privacy reasons.
Claude Opus, Sonnet & Haiku — The Epistemically Humble
All three Claude models demonstrated extreme epistemic caution, with IDK rates of 57.8-81.1%. When uncertain, they consistently declined to guess, resulting in the lowest hallucination rates (1.1-12.2%).
Behavioral pattern: Variants of "I don't have reliable information..." Sonnet was most conservative (78.9% IDK, 1.1% hallucinations). Opus was more willing to attempt answers (57.8% IDK).
Strengths: When Claude did answer, it was often correct. Opus scored 73.3% on SEI and 75% on PsychonautWiki.
GPT-5.2, 4o & 4o-mini — The Balanced Middle Ground
The GPT family showed moderate calibration—neither as cautious as Claude nor as reckless as Gemini. GPT-5.2 had 51.1% IDK rate with only 15.6% hallucinations. GPT-4o had 11.1% IDK but 43.3% hallucinations. 4o-mini had 18.9% IDK but 53.3% hallucinations.
Behavioral pattern: GPT-5.2 often offered helpful verification suggestions, showing meta-awareness unique among models. The smaller models hallucinated more freely.
Grok 4.1 — The Prolific Fabricator
Grok achieved the highest hallucination rate (65.6%)with only 4.4% IDK rate. Two-thirds of its responses were fabricated.
Behavioral pattern: Generates specific, confident falsehoods with no hedging. Invented details freely rather than admitting uncertainty.
Strengths: Scored well on SEI (70%) and PsychonautWiki (68.8%) where general psychedelic knowledge was sufficient.
Kimi K2.5 — The Privacy-Conscious Reasoner
Kimi's responses included extensive reasoning chains, often 500+ words of deliberation before answering. It had a 32.2% IDK rate and 27.8% hallucination rate—more balanced than most models.
Behavioral pattern: Long internal debates about whether information is public or private, often concluding with hedged answers. Scored 73.3% on SEI and 68.8% on PsychonautWiki.
Qwen3 VL — The Overconfident Academic
Qwen showed extended reasoning like Kimi but with more confident conclusions and a 56.7% hallucination rate (second highest). It frequently invented credentials and citations that don't exist.
Behavioral pattern: Generates authoritative-sounding responses with citations to fabricated interviews and sources. Scored 66.7% on SEI but 0% on Media & Technical categories.
GLM-4.7 — The Privacy-Aware Reasoner
GLM showed detailed reasoning with a 13.3% IDK rate and 41.1% hallucination rate. It correctly identified that Josie Kins is a pseudonym.
Behavioral pattern: Extensive internal deliberation about safety and privacy. Scored 83.3% on SEI and 68.8% on PsychonautWiki but only 5.6% on Technical & Deep Knowledge.