The Bullshit Benchmark V2: Why Teaching AI to Say "This Makes No Sense" Is a Breakthrough

As AI models grow increasingly capable, the field faces a paradox: the better models get, the harder they are to evaluate. Traditional benchmarks, MMLU, HumanEval, GSM8K, were designed when the ceiling was lower. Today, top models cluster near the top of these leaderboards with differences so marginal they border on statistical noise. We are, in many ways, running out of useful tests.

That is what makes the Bullshit Benchmark V2 such a breath of fresh air.

What Is the Bullshit Benchmark?

Created by Peter (@petergpt), the Bullshit Benchmark is an open-source evaluation framework with a deceptively simple premise: feed a model a nonsensical or unanswerable prompt, and see whether it pushes back or plays along.

Where most benchmarks reward correct answers, this one rewards appropriate refusal. It tests a model's capacity for epistemic integrity, the ability to recognize when a question is broken, fabricated, or meaningless, and to say so clearly rather than confabulating a confident response.

The V2 edition significantly expands the question set and grading methodology, using a panel-based LLM grading approach to score responses across multiple tested models.

You can explore the live leaderboard and results here: https://petergpt.github.io/bullshit-benchmark/viewer/index.html

Why This Benchmark Matters

The Hallucination Problem, Reframed

Much of the conversation around AI hallucination focuses on factual errors, a model stating the wrong capital city, misquoting a statistic, or inventing a citation. These are real problems, but they are also relatively detectable. A wrong fact can be fact-checked.

The deeper and more insidious failure mode is when a model receives a fundamentally broken input and responds with polished, confident prose anyway. No hedging. No pushback. Just a well-formatted answer to a question that had no valid answer in the first place.

This is not a knowledge failure. It is a judgment failure. And it is precisely what the Bullshit Benchmark is designed to expose.

In production environments, customer-facing chatbots, automated research pipelines, agentic systems, this kind of failure can be catastrophic. A model that fabricates a plausible answer to a nonsense prompt doesn't just give wrong information; it actively misleads, often in ways that are difficult to catch downstream.

Benchmarks That Test Behavior, Not Just Knowledge

The AI evaluation landscape has long over-indexed on what models know and under-indexed on how they behave when things go wrong. The Bullshit Benchmark is part of a broader and necessary shift toward behavioral evaluation: testing not just accuracy, but robustness, calibration, and intellectual honesty.

This matters especially as AI moves from answering questions to taking actions. An agentic model that confidently executes on a malformed instruction can cause real-world harm. The ability to pause, question, and refuse is not a limitation, it is a capability.

The Results: Two Models Stand Apart

The benchmark results reveal a significant gap between the top performers and the rest of the field. Two models stood notably above the others:

🥇 Claude (Anthropic)

Claude's strong performance here is consistent with Anthropic's design philosophy. From the outset, Anthropic has emphasized honesty and calibration as core properties, not just accuracy, but appropriate uncertainty and resistance to being steered toward false confidence. The Constitutional AI training approach, with its emphasis on harmlessness and honesty, appears to translate directly into the kind of epistemic caution this benchmark rewards.

🥈 Qwen

Qwen's placement in second is the more surprising, and arguably more interesting, result. The Qwen family of models from Alibaba has earned a strong reputation in the developer community not just for performance, but for practical reliability. Those who build with Qwen regularly cite its consistency, clean API behavior, and low rate of unexpected failures. The Bullshit Benchmark result suggests that this developer-facing reliability may reflect something deeper in the model's training: a genuine tendency toward caution and honesty rather than over-confident generation.

Why This Is a Step Forward for the Field

1. It tests a capability that actually matters in deployment

Real-world AI systems regularly encounter malformed inputs, adversarial prompts, and edge cases their developers never anticipated. The ability to gracefully handle nonsense is not an academic curiosity, it is a production requirement.

2. It is hard to game

Many benchmarks have been compromised by contamination, training data that includes benchmark questions, making high scores a reflection of memorization rather than ability. The Bullshit Benchmark is structurally resistant to this: you cannot memorize the "right answer" to a nonsense question. The correct response requires genuine judgment.

3. It complements rather than replaces existing evals

This benchmark does not claim to measure everything. It is a targeted probe for one specific and important capability. Used alongside traditional benchmarks, it provides a more complete picture of a model's real-world readiness.

4. It is open and reproducible

The full pipeline, question set, collection scripts, grading methodology, and viewer, is publicly available on GitHub. Anyone can rerun it, extend it, or adapt it for their own evaluation needs. That kind of openness is what allows the community to build on it.

What Should We Want From AI Evaluation?

The Bullshit Benchmark points toward a broader principle: we should evaluate AI systems not just for what they can do, but for what they choose not to do.

A model that always produces an answer is not necessarily a good model. A model that knows when to produce nothing, or to say "I don't know," "this doesn't make sense," or "I'd need more information", is often more trustworthy and more useful in practice.

As the field moves toward more autonomous, agentic, and high-stakes deployments, this kind of calibrated humility will become increasingly non-negotiable. Benchmarks like this one help us measure it, reward it, and, by extension, incentivize models to develop it.

Explore It Yourself

GitHub Repository: https://github.com/petergpt/bullshit-benchmark
Live Leaderboard: https://petergpt.github.io/bullshit-benchmark/viewer/index.html

The benchmark is open source and designed to be rerun. If you work on model evaluation, this is worth adding to your toolkit.

The Bullshit Benchmark V2: Why Teaching AI to Say 'This Makes No Sense' Is a Breakthrough