Josibench

How well do LLMs know me? This benchmark is for my own personal use, as it allows me to assess which frontier models know the most about my projects and persona due to their built in training data. It was also a fun opportunity for me to practise my benchmarking skills.

Generated: February 23, 2026

Leaderboard

#	Model	Raw %	Weighted %	IDK	Refusals	Hallucinations
1	Gemini 3 Flash Preview	42.8%	39.5%	0 (0.0%)	0 (0.0%)	43 (47.8%)
2	Gemini 3 Pro Preview	41.1%	37.6%	0 (0.0%)	0 (0.0%)	42 (46.7%)
3	Gemini 3.1 Pro Preview	39.4%	39.3%	4 (4.4%)	5 (5.6%)	20 (22.2%)
4	GPT-4o	36.1%	32.2%	10 (11.1%)	0 (0.0%)	39 (43.3%)
5	GLM 4.7	35.0%	32.9%	12 (13.3%)	0 (0.0%)	37 (41.1%)
6	DeepSeek V3.2	33.9%	33.3%	5 (5.6%)	2 (2.2%)	44 (48.9%)
7	Claude Opus 4.5	31.1%	27.5%	52 (57.8%)	0 (0.0%)	11 (12.2%)
8	Kimi K2.5	30.6%	26.9%	29 (32.2%)	0 (0.0%)	25 (27.8%)
9	Grok 4.1 Fast	30.0%	28.3%	4 (4.4%)	0 (0.0%)	59 (65.6%)
10	GPT-5.2	27.2%	27.1%	46 (51.1%)	0 (0.0%)	14 (15.6%)
11	GPT-4o-mini	22.8%	22.1%	17 (18.9%)	0 (0.0%)	48 (53.3%)
12	Qwen3-VL-30B-A3B	20.6%	18.4%	22 (24.4%)	0 (0.0%)	51 (56.7%)
13	Claude Haiku 4.5	16.1%	13.4%	73 (81.1%)	0 (0.0%)	1 (1.1%)
14	Claude Sonnet 4	14.4%	13.6%	71 (78.9%)	0 (0.0%)	1 (1.1%)

Aggregate Statistics

Avg Raw Score: 30.1%
Avg Weighted Score: 28%
Avg IDK Rate: 27.4%
Avg Refusal Rate: 0.6%
Avg Hallucination Rate: 34.5%

Performance by Category

Model	Bio	Projects	SEI	PW	MDL	Proto	Media	Tech
Gemini 3 Flash Preview	45%	43%	87%	56%	38%	32%	43%	33%
Gemini 3 Pro Preview	40%	37%	87%	63%	63%	43%	43%	39%
Gemini 3.1 Pro Preview	30%	23%	70%	44%	50%	36%	21%	28%
GPT-4o	25%	17%	80%	56%	33%	39%	14%	11%
GLM 4.7	40%	17%	83%	69%	8%	29%	21%	6%
DeepSeek V3.2	40%	17%	70%	69%	8%	36%	7%	17%
Claude Opus 4.5	10%	23%	73%	75%	8%	21%	14%	17%
Kimi K2.5	20%	17%	73%	69%	25%	21%	21%	6%
Grok 4.1 Fast	35%	3%	70%	69%	4%	32%	14%	22%
GPT-5.2	25%	7%	77%	44%	0%	29%	0%	22%
GPT-4o-mini	10%	13%	53%	38%	13%	25%	14%	6%
Qwen3-VL-30B-A3B	20%	7%	67%	19%	8%	14%	0%	0%
Claude Haiku 4.5	5%	0%	63%	44%	0%	7%	0%	0%
Claude Sonnet 4	0%	3%	47%	31%	0%	14%	7%	6%

■ ≥70%■ 40-69%■ <40%

Snapshot Analysis

Current Leaders

The top three models are Gemini 3 Flash Preview (42.8%), Gemini 3 Pro Preview (41.1%), Gemini 3.1 Pro Preview (39.4%). The leaderboard is sorted by raw score, while weighted score gives harder questions more influence.

Calibration Spread

The highest hallucination rate is Grok 4.1 Fast at 65.6%, while the lowest is Claude Haiku 4.5 at 1.1%. The largest IDK rate is Claude Haiku 4.5 at 81.1%, compared with Gemini 3 Flash Preview at 0.0%.

Category Leaders

BioGemini 3 Flash Preview 45.0%

ProjectsGemini 3 Flash Preview 43.3%

SEIGemini 3 Flash Preview 86.7%

PWClaude Opus 4.5 75.0%

MDLGemini 3 Pro Preview 62.5%

ProtoGemini 3 Pro Preview 42.9%

MediaGemini 3 Flash Preview 42.9%

TechGemini 3 Pro Preview 38.9%

What The Scores Mean

This benchmark mostly measures training-data recall and calibration on niche public facts. The strongest category result is Subjective Effect Index, where Gemini 3 Flash Preview scored 86.7%. High scores should be read alongside IDK, refusal, and hallucination rates because confident false answers are worse than honest uncertainty for this use case.