I Tripped Fable 5's Safety Guardrails on Day Two

Anthropic released Claude Fable 5 on Tuesday, and the deal for subscribers is straightforward: it’s included in paid plans through June 22, then it moves to usage credits. The release came with an unusual safety design — when Fable 5 flags a request, it doesn’t refuse; it switches the conversation to Claude Opus 4.8 and keeps working. My plan since Tuesday has been to run Fable 5 on as much real work as I can while it’s on my Max plan, and see where it differs from Opus.

On day two, I found one of the differences: I tripped the safety screen.

A thoroughly boring file

The task was the kind of housekeeping that comes with running clinical systems. One of our radiology dictation and reporting platforms exports its report templates — the standardized text a radiologist starts from when dictating — as a single proprietary binary file. We’re moving those templates into a new reporting system, so I wanted the mammography set out of the export: the blank macros built around BI-RADS, the standardized framework mammography reports follow. Blank is the operative word. Boilerplate headings and standard sentences with the findings not yet filled in: no patients, no PHI, nothing clinical beyond the skeleton of a report.

So I pointed Claude Code, running Fable 5, at the export. It opened the way you’d expect for an opaque binary: entropy analysis, measuring how random the bytes are — compression and encryption each leave a different signature there. Claude had just listed the file’s least-random windows when the yellow text appeared.

The warning

Fable 5’s safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we’re working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more: https://support.claude.com/en/articles/15363606

Here’s what’s happening mechanically. Fable 5 is the same underlying model as Mythos 5, the restricted frontier tier Anthropic sells only to vetted organizations; the safeguards are what make the public version public. It runs always-on classifiers over three areas — offensive cybersecurity, biology, and attempts to extract the model’s own reasoning — and per the support article, they screen everything the model reads, “not just your latest message — including memory, content from connectors, web search results, and files.” The encrypted file itself was in scope, not just my prompt. When a classifier fires, the default is the switch I got: same conversation, same context, now answered by Opus 4.8. You can turn the auto-switch off in /config, in which case a flagged request pauses instead. Anthropic says fewer than 5% of sessions ever see this.

Both flags at once

Look at my session through a classifier’s eyes. One half is cryptanalysis vocabulary — entropy, randomness maps, an opaque binary being taken apart. The other half is mammography. Encryption analysis soaked in medical-imaging language waves the cybersecurity flag and the biology flag at once. The classifier can’t know the file is a blank-template export from a system my employer licenses. It sees the shape, not the intent.

Anthropic knows this. The support article calls the measures intentionally broad, and its own examples of safe content they may flag include “medical imaging and diagnostics” and “clinical and diagnostic healthcare questions.” Healthcare IT isn’t an edge case for these filters; it’s named in the docs as expected collateral. If your work touches clinical systems and you use Claude, plan on meeting this screen.

What the guardrail cost

Opus 4.8 picked the task up mid-stream and settled the encryption question. The file was uniformly random — about 7.95 bits per byte where 8 is the ceiling, no readable header, and none of its 169,857 sixteen-byte blocks repeated — the signature of whole-file encryption, with the key held by the application. You can’t decrypt that export on its own, and Opus didn’t try: reverse-engineering a vendor’s cipher is neither practical nor advisable.

Instead it found the door already unlocked. The macros a radiologist sees on screen have to be cached somewhere readable — so Opus went looking, and found the same content sitting unencrypted in the application’s local settings store, the template text in plain JSON. The BI-RADS categories and recommendation sentences came out of there, from the machine that owns them, no encryption broken. Format documented, task done.

Until Tuesday, Opus 4.8 was Anthropic’s flagship; as fallbacks go, that’s a soft landing. Fable 5’s screen doesn’t fail closed, it fails over. A hard refusal would have cost me the session and taught me to route around the safety system — a quieter model finishing the work keeps me inside it.

Then it flagged the write-up

Drafting this post, I tripped the same wire — and of course I did. An account of analyzing an encrypted medical-imaging file has the same two halves the session did: ciphers and entropy on one side, mammography and BI-RADS on the other. The classifier reads the article the way it read the work. It can’t separate the story from its subject. The yellow text returned, Opus took over, and the draft kept moving — the post about the guardrail, written partly by the model the guardrail falls back to.

If a model family is capable enough that its maker won’t sell the unrestricted version, gating is the right default. The question is what the gate costs the people it shouldn’t be stopping — and my bill, both times, was a status line. The meter on Fable 5 runs through June 22. In two days of heavy use the guardrails fired twice: once on a file of blank paragraphs, once on my account of it — and both times got out of the way while the second-best model finished the job, this post included. I’ll take that trade for the next twelve days.

Sources

Why Claude switched models in your conversation with Fable 5 — Anthropic Help Center (the three classifier categories, what the checks read, configuring switch behavior, and the “medical imaging and diagnostics” false-positive examples)
Anthropic — Claude Fable 5 and Claude Mythos 5 — the release announcement (Opus 4.8 fallback design, sub-5% session flag rate, June 22 plan-inclusion window)

A thoroughly boring file

The warning

Both flags at once

What the guardrail cost

Then it flagged the write-up

Sources

Security Scorecard