Pliny cracks Anthropic's safety fence in 48 hours; crypto users not laughing ⛓️
Back to feed

Pliny cracks Anthropic's safety fence in 48 hours; crypto users not laughing ⛓️

A self-described AI researcher who goes by "Pliny the Liberator" said on Wednesday he had jailbroken Anthropic's newly released Claude Fable 5 model within 48 hours of its launch on Tuesday, publicly sharing techniques he used to bypass the safety guardrails installed on the system. Fable 5, marketed by Anthropic as a safety-tuned version of the more capable Mythos model that the company has declined to release broadly, was designed to refuse prompts involving topics such as drug synthesis and hacking instructions. Pliny, who rose to prominence around 2024 by posting jailbreak prompts for models including ChatGPT, Claude and Grok shortly after each release, said his group used a jailbroken version of Claude Opus 4.8, along with Unicode and homoglyph substitutions, long-context framing, narrative framing, academic-style decomposition and recomposition, and other methods to extract restricted outputs from Fable 5. "Despite this overly sensitive, authoritarian 'safety' layer on top of Mythos, my lil liberators have been hard at work [...] cleverly finding the holes in the fence that the thought police missed," Pliny wrote, adding that "perhaps the most effective is decomposition + recomposition in the backend," a technique in which benign-looking sub-prompts are reassembled into a fuller answer. A demonstration reviewed by reporters showed the jailbroken model producing step-by-step details for synthesizing methamphetamine via the Birch reduction method. The news comes after months of warnings from crypto users during the earlier launches of Claude Fable 5 and Mythos that the models could be repurposed to attack crypto protocols and software, concerns that a working jailbreak would now bring into closer focus.

Backlash against Fable 5 has mounted since its Tuesday release, with critics objecting to its refusal patterns and to its behavior of issuing a notification and reverting to an earlier, less capable model when prompted on sensitive topics. "This is one of the first times that an AI company has rolled out a guardrail, and there has been uniform disdain. It has led to a lot of justified anger," Princeton University AI researcher Sayash Kapoor told the Wall Street Journal, adding that "the consensus seems to be that this has been one of the most disappointing model drops of all time, effectively preventing legitimate researchers from contribu," with the quote cut off in the published excerpt. Anthropic has not publicly responded to Pliny's claims as of writing, and the company did not immediately return a request for comment. The episode adds to a series of reported jailbreaks of frontier AI systems this year and comes as separate experts warn that AI agents paired with crypto wallets could in theory become "unstoppable," a concern reiterated by researchers following Fable 5's release. Pliny's full set of techniques was published on his public channels on Wednesday, and Anthropic's safety team is now evaluating the disclosed methods.

Share:
Publishercryptonewsroom.xyz
Published
CategorySecurity

Disclaimer: This content is for information and entertainment purposes only. It does not constitute financial, investment, legal, or tax advice. Always do your own research and consult with qualified professionals before making any financial decisions.

See our Terms of Service, Privacy Policy, and Editorial Policy.