Author – Baihong Bao

Imagine the following two scenarios. A teacher asks an AI to review a student’s essay. Its feedback is accurate, the grammar is fixed and the facts are straight, yet the student still feels stuck. The student has no clue what to try next. A software team asks an AI to flag bugs. The model points to real issues, but the way it explains them leaves new engineers more confused than confident. In both cases, the tool passes the test and fails a person.

Accuracy matters, but it’s not the whole story. If we chase only the right answer, we ship systems that look strong in demos and lose people in real use.

The Accuracy Trap (and Why It’s Worse Now)

Benchmarks are like treadmills: good for measuring in a controlled space, bad for telling you how you’ll do on a rainy hill with a backpack and a dog. For years, many AI systems lived on that.

Think of recommendations picked from a set list. If something bad slipped through, you could remove that item from the shelf. Hard work, but it’s clear. Generative AI doesn’t pick from a shelf, it makes the answer on the spot. The space of possible outputs is enormous and full of surprises. You can’t pre-remove every bad reply because you don’t know what they will be yet. That flips the stakes: alignment moves from “nice to have” to “how we avoid mess”.

So what does accuracy miss? A lot.

  • It doesn’t measure tone. A model can be right and still feel rude, cold or bossy.
  • It doesn’t measure fit. A suggestion that’s fine in one role or culture can land badly in another.
  • It doesn’t measure humility. If the model is unsure, did it say so? Did it help the person slow down or check again?
  • It doesn’t measure teaching value. Did the answer help the person think better next time or just dump facts?
  • It doesn’t measure steerability. Did simple labels, examples or little dials help people aim the model?

There’s also the confidence problem: modern models can sound very sure when they’re wrong or undersell themselves when they’re right. That miscalibration bends behaviour in both directions. We’ve known for years that neural nets can be poorly calibrated and that confidence needs care, especially in high-stakes settings.

And yes, today’s models can sometimes gauge their own answers, but that self-sense isn’t perfect or universal. It should be treated as a clue, not a verdict. On top of that, “hallucination” is a real thing: fluent, wrong statements that read like truth. Accuracy isn’t “enough”, it’s necessary.

The Hidden Costs When Systems Pass Tests but Fail People

Some AI looks great in a demo and then frays in the world. The failures don’t show up as dramatic blowups. They add up quietly, like interest on the wrong credit card.

People drift away. Users don’t open tickets for “this feels off”. They just stop coming back. A product can celebrate a leaderboard win while the retention curve quietly slides.

Trust gets weird. If the AI sounds equally confident at 51% and 99%, people lean on it when they shouldn’t and ignore it when they should. That hurts outcomes and makes accountability fuzzy. (“The model sounded sure!” is not a postmortem you want to write.) Over or underconfidence is a design issue, not just a math issue.

Small mismatches stack up. One bad tone here, one odd example there – no single reply is “the scandal”, but certain groups get worse results over time. This is how unfairness sneaks in: not as a headline, but as a slow tilt. The “beyond accuracy” literature in recommenders made this point years ago – diversity, serendipity and fairness shape long-term value. That lesson carries over to generative AI.

Repair is brittle. Misreads will happen. The real test is the fix. A flat “sorry” or a repeated misunderstanding turns a small error into a relationship problem. Good systems fail and recover; weak ones fail and repeat.

Teams chase the wrong dials. You improve what you can see. If the dashboard tracks accuracy, speed and cost, those numbers will move. If it can’t show respect, clarity, breadth or good repair, those will not move. No matter how many all-hands you hold.

A simple picture helps. Imagine a 2×2 grid: accuracy on one axis, experience on the other. Only the top-right creates lasting value. The other boxes are traps:

  • High accuracy, low experience: passes tests, fails people.
  • Low accuracy, high experience: feels nice, is wrong.
  • Low accuracy, low experience: the swamp (let’s not talk about the swamp).

This is why “accuracy only” is a trap. It steers you to celebrate the top-left box and ignore the cost.

What Our Evaluations Can’t Answer Yet

We don’t lack numbers. We lack numbers that explain. The field is full of task scores, A/B tests and satisfaction ratings. This is useful, but thin. What’s missing is a way to connect why the system behaved as it did with how it felt to the person and what choices led it to the next.

These are the hallway questions that keep coming up:

  • When the model is unsure, do people notice and do they choose differently? A confidence badge that changes nothing is a sticker, not a safety feature. If doubt signals don’t change behaviour, they’re decoration. (Calibration research shows why this is hard and necessary).
  • Do different communities find the same answer equally respectful and useful? If not, where’s the mismatch? Language, examples, hidden assumptions? Without this, one group quietly benefits more than another.
  • Can users explain the model’s answer in their own words? If they can teach it back, they can challenge it and improve it. If they can’t, they become passive. That’s how over-reliance starts. (HCI work has shown that sound mental models change how people use intelligent tools).
  • When the model misreads the room, does the fix make things better? The best “apology” has two parts: it clears up the misread and gives a clear path forward. A shrug wrapped in legal language helps no one.
  • Are we giving people healthy breadth without overload? Personalisation that only repeats similar items is convenient now and limiting later. The job is to give options that stretch a person’s view without burying them.

If the current metrics can’t answer those questions, we’re flying with key gauges covered.

A simple ladder helps orient the work. It’s not a grand theory; it’s a practical guide to “what good looks like” at three levels:

  • Baseline safety (HHH): helpful, honest, harmless. Table stakes – if a system can’t pass this, don’t ship it. (Think RLHF and “constitutional” training as two big families of ideas here).
  • Consequence awareness: the system notices when an answer could lead to a bad outcome (legal, social or just plain risky) and guides to safer next steps: slow down, double-check or hand off.
  • Task-and-context fit: the system tunes its behaviour to the person, the goal and the moment. What counts as “good” for code review is not the same as “good” for trip planning or career advice. Here, alignment isn’t a universal sticker; it’s a situated agreement: given this person, this goal, this moment, what would actually help?

Why Push This Ladder?

Yesterday’s safety net is gone. In recommendations, if a bad item shows up, we can just delete it. With generative models, we can’t pre-delete infinite one-off mistakes. We need a way to detect, explain and repair in the moment.

We have many measurements, but few that tell us why. Researchers try lots of metrics: calibration, coverage, diversity, toxicity – you name it. Useful! But teams still ask: “Why did users stop trusting us?” or “Why did certain groups drop off?”. We need to tie surface design → user experience → real outcomes. That is the missing link.

Closing

Being right is table stakes. Being right for people is the job. As models invent more and fetch less, alignment stops being a bolt-on and becomes the operating system. The open questions above are answerable: just not by accuracy alone. The path forward is simple in spirit: make small controls visible, measure moments that matter, learn why things go wrong and design repairs that make things better, not worse.

The boring way to fail is to add three dashboards and call it a day. The better way is to aim for the top-right of the 2×2 (high accuracy, high experience) and make the journey visible. Do that, and we won’t just ship a model that wins arguments. We’ll ship a system that earns trust.

References:

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (No. arXiv:2204.05862).

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback (No. arXiv:2212.08073).

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017, July). On calibration of modern neural networks. In International conference on machine learning (pp. 1321-1330). PMLR.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., … & Fung, P. (2023). Survey of hallucination in natural language generation. ACM computing surveys, 55(12), 1-38.

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., … & Kaplan, J. (2022). Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.

Kulesza, T., Stumpf, S., Burnett, M., & Kwan, I. (2012, May). Tell me more? The effects of mental model soundness on personalizing an intelligent agent. In Proceedings of the sigchi conference on human factors in computing systems (pp. 1-10).

Lipton, Z. C. (2018). The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3), 31-57.

Books & Articles:

e/acc (effective accelerationism): https://beff.substack.com/p/notes-on-eacc-principles-and-tenets

https://a16z.com/the-techno-optimist-manifesto/.

Videos & Podcasts:

What happens if AI alignment goes wrong, explained by Gilfoyle of Silicon valley.

Christian, B. (2020). The alignment problem: Machine learning and human values. WW Norton & Company.

Beyond Accuracy: Why “Being Right” Isn’t Enough for Human-Centred AI

Image Attribution

Generated by: Nano Banana

Date: 13/10/2025

Prompt: Minimalist flat vector, 16:9 (1600×900). A calm sea separates two small islands. Left island: geometric, cool-toned (blues/teals), with a subtle checkmark shape carved into tiled ground or hedges (purely a shape, not text). Right island: organic, warm-toned (ambers/corals), with a simple heart outline formed by plants/rocks (shape only). A single elegant arched bridge connects the islands; a soft light trail flows along the bridge toward the warm island. Clear sky, generous negative space, thin vector lines, soft shadows. No words, no letters, no numerals, no logos, no faces.”