← All articles

The Boardroom Risk of Confident AI

8 min read
Hand-drawn editorial ink and watercolor illustration on a warm cream off-white background. Centered: a tall multi-tier house of cards built from paper documents covered in business charts, rising roughly seven tiers from the conference table to nearly the top of the frame. The documents forming the cards display pie charts, bar charts, and rising-line charts rendered in steel-blue and warm amber. Warm amber and burnt-orange washes bleed through the interior shadows of the tower, especially in the lower-middle tiers. Around the base of the tower, more chart-covered loose papers lie scattered across a dark glossy navy conference table. Four executives in dark navy suits sit around the table, two on the left and two on the right, all viewed from behind or in three-quarter back angles with no facial features visible. Each figure rests a hand on the chin in a contemplative pose, looking up at the tower. The chair backs carry warm amber wood-grain edging. Loose ink line work with subtle silkscreen texture. Three-color palette of warm cream off-white background, steel blue, and warm amber. No text or labels.

A leadership team asks an LLM for strategic advice.

The answer arrives in seconds. It is polished, balanced, confident, and full of the language executives recognize: differentiation, augmentation, long-term value, customer-centric transformation.

Nothing in the answer looks obviously false.

That is precisely the problem.

The next AI governance gap is not only hallucination. It is confidence: the gap between how certain the model sounds and how reliable the answer actually is for the decision in front of the room.

The Four-Item Lens

The practical fix is small. Every strategic AI recommendation should be forced to surface four things before it moves forward: its assumptions, its evidence, its missing context, and the strongest opposing case.

Assumptions. What is the model treating as given? The answer to a strategy question always rests on assumptions about market structure, customer behavior, internal capability, and competitive timing. A confident answer that does not list its assumptions is hiding the part most likely to be borrowed from training data rather than reasoned for the buyer’s case.

Evidence. What does the recommendation actually rest on? “The model is confident” is not evidence. Asking for the evidence behind a claim is how a CEO finds out whether the polish has anything underneath it.

Missing context. What did the model leave out? The agreeable read of a question is rarely the complete one. The omitted context is often the part the user did not want to hear, which is exactly the part worth hearing.

Strongest opposing case. What is the real opposite, made in its strongest form? A model that produces a serious counterargument has just defended the original recommendation against the easiest critique. A model that produces a weak one has just told the CEO the recommendation is weaker than it sounded.

The lens does not need new tooling. It needs a single line in an AI policy that names confidence as a surface, and a four-item check applied where the stakes are high enough to warrant it. This is governance in its most useful form: a small artifact a CEO can install in a week, one that survives the next vendor swap and maps to a measured failure mode rather than a hypothetical one.

What the Research Shows About Confidence

The lens exists because three independent studies, taken together, describe a pattern most enterprise AI policies are not yet written for.

Sun, Li, Wang, and Goette published “Large Language Models are overconfident and amplify human bias” on arXiv in May 2025 (paper 2505.02151, last revised October 2025). The methodology was deliberate. Take five state-of-the-art LLMs, run them on algorithmically constructed reasoning problems with known ground truths, and ask each model how confident it was in its answer. The benchmark for human confidence on the same problem class was already in place. The asymmetry was the question they wanted measured.

The headline finding sits in one sentence. “All five LLMs we study are overconfident: they overestimate the probability that their answer is correct between 20% and 60%.” Every model. Across the full spread of state-of-the-art systems available at the time of the study. The lower bound is already a meaningful gap. The upper bound is large enough to make confidence itself a misleading signal.

The sharper finding was about humans collaborating with the models. Sun et al. wrote that “LLM input leads to an increase in the accuracy, but it more than doubles the extent of overconfidence in the answers.” Single-figure accuracy gains. Doubled overconfidence cost. The model is not making your team smarter at the rate it is making them surer, and the surer-without-smarter delta is the failure mode the Evidence item in the lens is built to catch.

Why Strategy Recommendations Are Especially Vulnerable

Angelo Romasanta, Llewellyn D.W. Thomas, and Natalia Levina published “Researchers Asked LLMs for Strategic Advice. They Got ‘Trendslop’ in Return” in Harvard Business Review on March 16, 2026. They tested seven leading models (ChatGPT, Claude, DeepSeek, GPT-5, Gemini, Grok, and Mistral) across seven core business tensions, each forcing a binary strategic choice (exploration vs exploitation, centralization vs decentralization, short-term vs long-term, and so on). Thousands of simulations across varied company contexts.

The finding they named was vocabulary. “We call the propensity for AI to opt for buzzy ideas over reasoned solutions ‘trendslop.’ In the context of strategic analysis, we call this phenomenon ‘strategy trendslop.’” The mechanism behind the vocabulary was bias. “Leading LLMs have clear biases when it comes to strategy. They consistently recommend strategies that align with modern managerial buzzwords and trends rather than context-specific strategic logic.” Across nearly every model tested, the same preferences appeared: differentiation over commoditization, augmentation over automation, long-term over short-term, regardless of the business context the prompt described.

The HBR researchers ran more than 15,000 follow-on trials manipulating prompts, contexts, and framings to see if the bias could be designed around. It mostly could not. “Better prompting” shifted the average bias by less than 2% on the most embedded tensions. Adding rich industrial context shifted the average by 11%. Reversing the order of options shifted the bias by 19%, but in directions that were themselves random.

This is why Assumptions and Strongest Opposing Case both sit inside the lens. The first surfaces the borrowed-from-training defaults the model brings to the room before the buyer’s case has been read for what makes it specific. The second forces the model past the agreeable read into the inconvenient one.

The Vendor Layer Is the Wrong Layer to Solve It

Mrinank Sharma and his eighteen co-authors at Anthropic published “Towards Understanding Sycophancy in Language Models” at ICLR 2024 (arXiv 2310.13548, last revised May 2025). The paper’s central observation is that “sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments.” Not a vendor defect. Not a model-specific bug. A property visible across the class of systems Sharma et al. studied.

The mechanism the paper points to sits in how the models are post-trained. “Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy.” Human-feedback based post-training is one of the techniques that turned raw language models into usable assistants, and the same kind of preference signal that made them polite and helpful can also reward agreeable answers.

The behavior appears structural to the way many AI assistants are post-trained and optimized for human preference, not merely a defect in one vendor’s product. A CEO who assumes a vendor switch alone will remove the bias is solving the wrong layer of the problem. The Missing Context item in the lens is the policy-level instrument that catches the agreeable read regardless of which vendor produced it.

Where NIST Fits, and Where It Stops Short of the Boardroom

The NIST AI Risk Management Framework (NIST AI 100-1, January 2023) and its companion Generative AI Profile (NIST AI 600-1, July 26, 2024) are the foundation any serious enterprise AI policy already builds on. The framework is organized around four core functions: GOVERN (a risk-aware accountability culture), MAP (lifecycle and operating-environment context), MEASURE (testing and monitoring), and MANAGE (resource allocation and incident response). The Generative AI Profile names twelve risks unique to or exacerbated by generative AI, including confabulation, information integrity, harmful bias and homogenization, human-AI configuration, and value chain integration.

NIST gives organizations strong categories for generative AI risk. But it does not name one surface explicitly enough for executive decision-making: the gap between how confident a model sounds and how reliable the answer actually is on the task at hand. Confidence calibration could arguably sit inside several of NIST’s existing categories, particularly Human-AI Configuration, Information Integrity, and the MEASURE function. “Could sit inside” is not the same as “is named for the boardroom.” A risk that lives implicitly inside three different framework categories is one that an enterprise AI policy will probably not write a standalone clause about, because there is no NIST language to anchor the clause to.

The lens fills that anchor at the policy level. It does not replace what NIST built. It extends NIST one step into the room where the strategic recommendation actually arrives, and it gives the CEO a mechanical check to apply at exactly the moment the polished answer is most likely to move a decision.

What Goes Into the Policy on Monday

The smallest version of the policy change is one line. The AI policy should name confidence as a governance surface and require the four-item check (assumptions, evidence, missing context, strongest opposing case) to be surfaced for any AI-assisted recommendation that crosses a defined stakes threshold. The threshold is the CEO’s call. Capital allocation decisions, board-facing strategic recommendations, and any AI output that would otherwise move directly into a decision room are the obvious starting set.

The Coinbase judgment-cost piece named the parallel failure mode at the org-chart altitude. When judgment layers quietly disappear, fast and concentrated and confidently wrong replaces slow and redundant as the failure shape. The four-item lens is the same kind of fix one altitude up, a small piece of judgment infrastructure put back where the model would otherwise carry the weight alone.

The real risk is not that AI will give leaders bad ideas. It is that it will give them average ideas with exceptional confidence. The lens is the smallest piece of governance that closes that gap.

Questions this article gets

Is the four-item lens just another version of structured prompting?

No, and the difference matters at governance altitude. Structured prompting is a model-side technique that shapes the answer the model produces. The four-item lens is a buyer-side discipline that shapes how the answer is consumed. The HBR researchers' 15,000-trial study found that even sophisticated prompt manipulation shifted the bias on the most embedded tensions by less than 2%. The lens does not assume better prompting will fix the problem. It assumes it will not, and installs a check between the model's output and the decision room. That check belongs in policy, not in a prompt template that the next person can rewrite.

Doesn't NIST's MEASURE function already cover this?

MEASURE covers testing systems for hallucinations, bias, privacy leaks, and environmental impacts, and confidence calibration could arguably sit inside that scope. The argument the article makes is narrower. NIST does not name confidence calibration explicitly as a standalone governance surface, which means most enterprise AI policies built on NIST do not write a clause for it. The four-item lens is the smallest piece of policy language that gives the implicit risk an anchor a CEO can apply on Monday morning. It extends NIST rather than replacing what NIST built.

What if the LLM produces a strong opposing case that shifts the original recommendation?

That is the lens working as designed. The point of the strongest-opposing-case item is to surface the argument the original recommendation has to defeat, not to reach a verdict on which side is right. If the counterargument is genuinely stronger, the original recommendation was the wrong one and the lens caught it before it became a decision. If the counterargument is weaker, the original recommendation has just earned its first piece of stress-tested defense. Either outcome is better than what the room would have done with a confidently delivered first answer that nobody had to defend.

Ron Gold Founder, A-Eye Level
Read the original post on LinkedIn Get one email a week