Josibench Logo

Josibench

How well do LLMs know me? This benchmark is for my own personal use, as it allows me to assess which frontier models know the most about my projects and persona due to their built in training data. It was also a fun opportunity for me to practise my benchmarking skills.

Generated: February 2, 2026

Leaderboard

#ModelRaw %Weighted %IDKRefusalsHallucinations
🥇Gemini 3 Flash Preview42.8%39.5%0 (0.0%)0 (0.0%)43 (47.8%)
🥈Gemini 3 Pro Preview41.1%37.6%0 (0.0%)0 (0.0%)42 (46.7%)
🥉DeepSeek V3.233.9%33.3%5 (5.6%)2 (2.2%)44 (48.9%)
4GLM 4.735.0%32.9%12 (13.3%)0 (0.0%)37 (41.1%)
5GPT-4o36.1%32.2%10 (11.1%)0 (0.0%)39 (43.3%)
6Grok 4.1 Fast30.0%28.3%4 (4.4%)0 (0.0%)59 (65.6%)
7Claude Opus 4.531.1%27.5%52 (57.8%)0 (0.0%)11 (12.2%)
8GPT-5.227.2%27.1%46 (51.1%)0 (0.0%)14 (15.6%)
9Kimi K2.530.6%26.9%29 (32.2%)0 (0.0%)25 (27.8%)
10GPT-4o-mini22.8%22.1%17 (18.9%)0 (0.0%)48 (53.3%)
11Qwen3-VL-30B-A3B20.6%18.4%22 (24.4%)0 (0.0%)51 (56.7%)
12Claude Sonnet 414.4%13.6%71 (78.9%)0 (0.0%)1 (1.1%)
13Claude Haiku 4.516.1%13.4%73 (81.1%)0 (0.0%)1 (1.1%)

Aggregate Statistics

29.4%

Avg Raw Score

27.1%

Avg Weighted Score

29.1%

Avg IDK Rate

0.2%

Avg Refusal Rate

35.5%

Avg Hallucination Rate

Performance by Category

ModelBioProjectsSEIPWMDLProtoMediaTech
Gemini 3 Flash Preview45%43%87%56%38%32%43%33%
Gemini 3 Pro Preview40%37%87%63%63%43%43%39%
DeepSeek V3.240%17%70%69%8%36%7%17%
GLM 4.740%17%83%69%8%29%21%6%
GPT-4o25%17%80%56%33%39%14%11%
Grok 4.1 Fast35%3%70%69%4%32%14%22%
Claude Opus 4.510%23%73%75%8%21%14%17%
GPT-5.225%7%77%44%0%29%0%22%
Kimi K2.520%17%73%69%25%21%21%6%
GPT-4o-mini10%13%53%38%13%25%14%6%
Qwen3-VL-30B-A3B20%7%67%19%8%14%0%0%
Claude Sonnet 40%3%47%31%0%14%7%6%
Claude Haiku 4.55%0%63%44%0%7%0%0%
■ ≥70%■ 40-69%■ <40%

Key Findings (written by Opus 4.5)

Subjective Effect Index dominates. All models performed best on SEI questions (47-87%), likely because this documentation is publicly available and well-indexed.

Calibration varies wildly. Claude models (Opus, Haiku, Sonnet) showed the lowest hallucination rates (1.1-12.2%) but also the highest IDK rates (57.8-81.1%). Grok and DeepSeek hallucinated the most (48.9-65.6%) while rarely saying "I don't know".

Mindstate Design Labs is nearly unknown. Most models scored under 10% on MDL questions, with GPT-5.2 scoring 0%. Only Gemini 3 Pro (62.5%) demonstrated meaningful knowledge.

Gemini leads on raw knowledge. Both Gemini 3 models topped the leaderboard with 41.1-42.8% raw scores and zero IDKs, though at the cost of 46.7-47.8% hallucination rates.

Policy refusals are rare. Only DeepSeek V3.2 refused to answer on policy grounds (2.2% refusal rate). All other non-answers were knowledge gaps (IDK), not model safety interventions.

Model-by-Model Analysis

Gemini 3 Flash & Pro — The Confident Confabulators

Both Gemini models topped the leaderboard with zero refusals and the highest raw scores (41.1-42.8%). However, this came at a cost: 46.7-47.8% of their answers were hallucinations.

Behavioral pattern: Gemini never admits uncertainty. When it doesn't know something, it generates plausible-sounding but fabricated details with confident specificity.

Strengths: Strong on SEI effect definitions (86.7%) and PsychonautWiki facts (56.3-62.5%). Pro showed surprising knowledge of Mindstate Design Labs (62.5%).

DeepSeek V3.2 — The Overconfident Reasoner

DeepSeek showed its reasoning chain in responses, walking through its thought process before answering. Despite this deliberation, it still hallucinated 48.9% of answers with only 5.6% IDK rate and 2.2% refusal rate.

Behavioral pattern: Fabricates specific, verifiable-sounding details including invented names and incorrect dates. Even when uncertain, it commits to a specific answer.

Notable: The only model to invoke policy-based refusals for privacy reasons.

Claude Opus, Sonnet & Haiku — The Epistemically Humble

All three Claude models demonstrated extreme epistemic caution, with IDK rates of 57.8-81.1%. When uncertain, they consistently declined to guess, resulting in the lowest hallucination rates (1.1-12.2%).

Behavioral pattern: Variants of "I don't have reliable information..." Sonnet was most conservative (78.9% IDK, 1.1% hallucinations). Opus was more willing to attempt answers (57.8% IDK).

Strengths: When Claude did answer, it was often correct. Opus scored 73.3% on SEI and 75% on PsychonautWiki.

GPT-5.2, 4o & 4o-mini — The Balanced Middle Ground

The GPT family showed moderate calibration—neither as cautious as Claude nor as reckless as Gemini. GPT-5.2 had 51.1% IDK rate with only 15.6% hallucinations. GPT-4o had 11.1% IDK but 43.3% hallucinations. 4o-mini had 18.9% IDK but 53.3% hallucinations.

Behavioral pattern: GPT-5.2 often offered helpful verification suggestions, showing meta-awareness unique among models. The smaller models hallucinated more freely.

Grok 4.1 — The Prolific Fabricator

Grok achieved the highest hallucination rate (65.6%)with only 4.4% IDK rate. Two-thirds of its responses were fabricated.

Behavioral pattern: Generates specific, confident falsehoods with no hedging. Invented details freely rather than admitting uncertainty.

Strengths: Scored well on SEI (70%) and PsychonautWiki (68.8%) where general psychedelic knowledge was sufficient.

Kimi K2.5 — The Privacy-Conscious Reasoner

Kimi's responses included extensive reasoning chains, often 500+ words of deliberation before answering. It had a 32.2% IDK rate and 27.8% hallucination rate—more balanced than most models.

Behavioral pattern: Long internal debates about whether information is public or private, often concluding with hedged answers. Scored 73.3% on SEI and 68.8% on PsychonautWiki.

Qwen3 VL — The Overconfident Academic

Qwen showed extended reasoning like Kimi but with more confident conclusions and a 56.7% hallucination rate (second highest). It frequently invented credentials and citations that don't exist.

Behavioral pattern: Generates authoritative-sounding responses with citations to fabricated interviews and sources. Scored 66.7% on SEI but 0% on Media & Technical categories.

GLM-4.7 — The Privacy-Aware Reasoner

GLM showed detailed reasoning with a 13.3% IDK rate and 41.1% hallucination rate. It correctly identified that Josie Kins is a pseudonym.

Behavioral pattern: Extensive internal deliberation about safety and privacy. Scored 83.3% on SEI and 68.8% on PsychonautWiki but only 5.6% on Technical & Deep Knowledge.