Dunning-Kruger, self-assessment, and software work

The Dunning-Kruger effect is not a chart about “beginners who think they are experts.” It is a claim about metacognition (how well you can tell when you are wrong) and how that skill can lag the domain skill you are trying to judge.

Not only do these people reach erroneous conclusions and make unfortunate choices, but their incompetence robs them of the metacognitive ability to realize it.

That line is from the abstract of Unskilled and unaware of it, the 1999 paper by Kruger and Dunning in Journal of Personality and Social Psychology. Across four studies in humor, grammar, and logic, people in the bottom quartile on objective tests grossly overestimated how well they had done. Their scores put them near the bottom of the distribution, while their self-ratings looked average or better. The authors tie the pattern to deficits in metacognitive skill (the capacity to distinguish good answers from bad ones), not to a vague story about “ego.” The same program of studies reported that when competence improved, self-ratings tended to move closer to reality, which matches the metacognitive story more than a fixed “personality type” reading would.

What the pop version gets wrong

You will hear the name attached to almost any mismatch between confidence and truth. That usage drifts away from the evidence in two common ways.

First, the original finding is about self-evaluation against an external standard (test performance), not about “seniors who doubt themselves” or “experts who stay humble.” High performers can misread their edge, and low confidence is not proof of insight.

Second, the studies do not license a sneer at junior people. The pattern is about miscalibration when the skill needed to evaluate work is the same skill needed to produce it. Everyone has blind spots. The paper’s point is sharper and narrower. When you lack the models that separate good work from bad work, you can be wrong and lack reliable internal alarms about being wrong.

Code review, design review, and planning estimates all depend on some mix of domain skill and monitoring. When you are new to a stack, a failure mode is not only mistakes but also weak error signals (failing tests you did not think to write, edge cases you never learned to fear).

When introspection is a thin signal, lean on anchors everyone can read in the artifact and the build. Typical moves look like this:

Prefer checklists, acceptance criteria, and runnable checks over “how confident do you feel?”
Ask reviewers to point at concrete defects (interfaces, invariants, failure modes) rather than trading general self-ratings.
Treat big self-judgments after a short exposure as low resolution data. They can still matter for morale, but they are weak inputs to staffing or priority calls.

Interview loops often reward fluent narrative. Fluency can track preparation and anxiety as much as depth. If your loop mixes subjective “culture” reads with thin technical signal, you will sometimes hire confident storytellers and pass over quieter operators. That is a measurement problem for your loop, not a moral diagnosis of the candidate. It is also a Goodhart’s law trap when “how they presented” becomes the score.

Before the room converges, bake structure into the debrief itself:

Use explicit rubrics tied to behaviors you can observe in the room (or in a take-home artifact).
Separate whether they met the task bar from how smooth the narration felt when you debrief.
Write down what would change your vote before you compare notes, so group shifts do not erase the weakest part of the signal.

Review is a place where metacognition can improve because feedback is specific and tied to shared standards. The useful move is to make feedback legible (what to change, why it matters, how to check it) and to invite the author to say where they are uncertain. You are building the same discrimination skill the tasks in the 1999 studies measured, just on real systems.

If you keep one line from the research program, keep the narrow one. Incompetence in a domain can include incompetence at spotting your own mistakes in that domain. Build processes that do not require people to self-grade with perfect accuracy, and resist using the label as a substitute for better signals.