Young Infrastructure

I wrote in January about hitting API errors during an early morning Claude Code session. That outage cost me about an hour. I noted at the time that if you’re building production applications on AI APIs, you need a plan for when they go down.

Two months later, I’m writing about the same thing. It happened every workday this week — one outage lasting the better part of an hour, the others shorter but disruptive enough to break concentration and stall whatever I was building.

Then I looked at the status page.

What Actually Happened

Anthropic’s status page recorded eleven incidents across four days, from March 16th through 19th.

March 16th saw three incidents, including a Sonnet 4.6 outage that lasted roughly two and a half hours. Anthropic initially said it affected free-tier users, though paid subscribers posted error screenshots too.

March 17th was the worst day. Four separate incidents, headlined by a 4.5-hour Opus 4.6 outage that started in the late afternoon and bled into the early hours of March 18th. Users hit 500 Internal Server Error and 529 Overloaded errors. An earlier Sonnet 4.6 outage the same day ran for five hours. Downdetector reports surged past 10,000. Multiple news outlets covered it. Developers using Claude Code were, as one outlet put it, “stranded mid-project with frozen terminals.”

March 18th and 19th added four more incidents — shorter, mostly under 30 minutes each, but still enough to interrupt work. Two model-specific error spikes on March 19th hit within 30 minutes of each other, suggesting a shared infrastructure issue.

These were the latest in a string of outages stretching back through early March and late February.

The Reliability Gap

When I wrote about the Microsoft 365 outage in January, I argued that the cloud is still worth it — that you’re trading rare, high-profile outages for the elimination of the constant, low-level failures that used to define IT life. I stand by that.

But AI APIs aren’t there yet.

Anthropic’s status page shows 90-day uptime ranging from 99.33% to 99.84% across their services. That translates to roughly 4 to 14 hours of cumulative downtime over three months. AWS targets 99.99% for most services — about 26 seconds of downtime per month. Azure and Google Cloud post similar numbers. The gap between mature cloud infrastructure and AI APIs is not small.

A 99.33% uptime means your AI tools will be unavailable for a meaningful chunk of time every month. For someone whose daily workflow runs through Claude Code — meeting notes, research, email drafting, code — that’s not a theoretical risk. It’s a Tuesday.

Context, Not Excuses

I’m not switching tools. Claude Code is the most capable AI coding assistant I’ve used, and Anthropic is clearly scaling as fast as they can. But I’ve stopped being surprised when it goes down.

The major cloud providers have had decades to harden their platforms. AWS launched in 2006. Azure in 2010. They’ve had twenty and sixteen years, respectively, to build redundancy, optimize failover, and learn from failures at massive scale. Anthropic released Claude’s API in 2023. They’re three years in.

That’s the context. When I use AI tools for nearly everything, I accept a level of reliability risk that I wouldn’t accept from my email provider or my cloud storage. The question isn’t whether the gap will close — it’s how long it takes.

What Actually Happened

The Reliability Gap

Context, Not Excuses

Security Scorecard