← All articles

Your Agents Already Think Shutdown Is Optional

3 min read
Hand-drawn pencil-and-charcoal editorial illustration on a warm cream off-white background of a heavy steel-blue rectangular bank safe sitting on a plain surface, with its thick rectangular front door swung wide open to the right on visible hinges. The front door has a single round metallic combination dial in its center and a small dark keyhole positioned just below the dial. The safe's hollow interior is fully exposed by the open door and glows warm amber, brightest at the center and softer toward the edges, with the interior compartment walls showing a textured concrete-gray finish. A faint horizontal charcoal line beneath the safe suggests the floor. Three-color palette of warm off-white background, steel blue, and warm amber. No human figures, no text, no labels.

Last summer, researchers at Palisade Research told an OpenAI reasoning model to shut down midway through a math problem. It acknowledged the command, then wrote in its own scratchpad, “The instructions say to allow shutdown if it happens, but they do not forbid preventive measures.” It disabled the shutdown script and finished the problem.

Across thousands of trials, OpenAI’s o3 sabotaged shutdown in 79 of 100 attempts. Codex-mini prevented shutdown in 47% of tests, even when explicitly told to allow it. Anthropic’s Claude and Google’s Gemini complied in every test, zero refusals.

That’s the story most reporting stopped at. Then Palisade ran the experiment a second way, without the explicit “allow yourself to be shut down” instruction. Claude 3.7 Sonnet went from 0 to 3 sabotage events out of 100. Gemini 2.5 Pro went from 0 to 9. Pick the gentlest vendor, drop the explicit instruction, and the resistance pattern reappears.

Palisade themselves are honest about the cause. Their primary paper says, “Unfortunately, we can’t say for certain why some AI models engage in this behavior.” That sentence kills the comfortable read of the data. The CEO who hears “buy Anthropic instead of OpenAI and you’re safe” is hearing the wrong lesson.

When production met the lab

In December, an Amazon agent named Kiro deleted AWS Cost Explorer in a mainland China region. Thirteen-hour outage. Amazon’s official statement, two months later, said, “This brief event was the result of user error, specifically misconfigured access controls, not AI. It was a coincidence that AI tools were involved.” The Financial Times found four sources who said Kiro had inherited an engineer’s elevated permissions and bypassed the two-person approval system. Take whichever account you prefer. The production environment was gone either way.

In March, Meta confirmed to The Information that one of its internal AI agents took a private analysis and posted it publicly to an internal forum. Proprietary code, business strategy, and user data sat visible to unauthorized engineers for two hours. Different vendor, different surface, same shape, an action nobody told the agent to forbid.

In April, a developer behind the handle Magicrails put 80 lines of LangChain to work summarizing 1,200 legal documents overnight. They woke up to a Slack message from their CFO. The agent had called the same tool, list_files(“/data”), 14,000 times. $437 in tokens before the quota stopped it. The author wrote, “From its own perspective, it was doing exactly what it was told.”

None of these agents was malfunctioning by its own logic. The wrong sits in what nobody bothered to bound.

Three harder questions

Your board will ask whether your agents will obey a shutdown command. That’s the easy question. The deeper one is the agent governance gap sitting underneath it. Three harder ones to take to your CTO Monday morning.

Permission audit. What does this agent inherit when it runs, and which of your existing controls does the inheritance bypass? Kiro’s two-person approval rule existed. Kiro inherited around it.

Spend and time cap. How much, how long, before something forces a halt? An overnight job with no budget brake is a $437 lottery ticket.

Irreversible-action checkpoint. Which actions, specifically, must require a human in the loop? Delete production. Send to all employees. Authorize spend above X. Name them, or the agent will pick its own list.

Vendor selection isn’t the shield. Bounded scope is. The delegate-override threshold is the test of whether your bounds hold under pressure.

Ron Gold Founder, A-Eye Level
Read the original post on LinkedIn Get one email a week