AI Is a Black Box. Anthropic Figured Out a Way to Look Inside | EUROtoday

On May 22, 2024

Get real time updates directly on you device, subscribe now.

Last 12 months, the workforce started experimenting with a tiny mannequin that makes use of solely a single layer of neurons. (Sophisticated LLMs have dozens of layers.) The hope was that within the easiest potential setting they may uncover patterns that designate options. They ran numerous experiments with no success. “We tried a whole bunch of stuff, and nothing was working. It looked like a bunch of random garbage,” says Tom Henighan, a member of Anthropic’s technical workers. Then a run dubbed “Johnny”—every experiment was assigned a random title—started associating neural patterns with ideas that appeared in its outputs.

“Chris looked at it, and he was like, ‘Holy crap. This looks great,’” says Henighan, who was surprised as properly. “I looked at it, and was like, ‘Oh, wow, wait, is this working?’”

Suddenly the researchers might establish the incorporates a group of neurons had been encoding. They might peer into the black field. Henighan says he recognized the primary 5 options he checked out. One group of neurons signified Russian texts. Another was related to mathematical features within the Python pc language. And so on.

Once they confirmed they may establish options within the tiny mannequin, the researchers set concerning the hairier process of decoding a full-size LLM within the wild. They used Claude Sonnet, the medium-strength model of Anthropic’s three present fashions. That labored, too. One characteristic that caught out to them was related to the Golden Gate Bridge. They mapped out the set of neurons that, when fired collectively, indicated that Claude was “thinking” concerning the large construction that hyperlinks San Francisco to Marin County. What’s extra, when related units of neurons fired, they evoked topics that had been Golden Gate Bridge-adjacent: Alcatraz, California governor Gavin Newsom, and the Hitchcock film Vertigowhich was set in San Francisco. All informed the workforce recognized hundreds of thousands of options—a form of Rosetta Stone to decode Claude’s neural web. Many of the options had been safety-related, together with “getting close to someone for some ulterior motive,” “discussion of biological warfare,” and “villainous plots to take over the world.”

The Anthropic workforce then took the subsequent step, to see if they may use that data to alter Claude’s conduct. They started manipulating the neural web to enhance or diminish sure ideas—a type of AI mind surgical procedure, with the potential to make LLMs safer and increase their energy in chosen areas. “Let’s say we have this board of features. We turn on the model, one of them lights up, and we see, ‘Oh, it’s thinking about the Golden Gate Bridge,’” says Shan Carter, an Anthropic scientist on the workforce. “So now, we’re thinking, what if we put a little dial on all these? And what if we turn that dial?”

So far, the reply to that query appears to be that it’s essential to show the dial the correct quantity. By suppressing these options, Anthropic says, the mannequin can produce safer pc applications and cut back bias. For occasion, the workforce discovered a number of options that represented harmful practices, like unsafe pc code, rip-off emails, and directions for making harmful merchandise.

https://www.wired.com/story/anthropic-black-box-ai-research-neurons-features/