Here’s what’s really going on inside an LLM’s neural network

With most computer programs—even complex ones—you can meticulously trace through the code and memory usage to figure out why that program generates any specific behavior or output. That's generally not true in the field of generative AI, where the non-interpretable neural networks underlying these models make it hard for even experts to figure out precisely why they often confabulate information, for instance.

Now, new research from Anthropic offers a new window into what's going on inside the Claude LLM's "black box." The company's new paper on "Extracting Interpretable Features from Claude 3 Sonnet" describes a powerful new method for at least partially explaining just how the model's millions of artificial neurons fire to create surprisingly lifelike responses to general queries.

Opening the hood

When analyzing an LLM, it's trivial to see which specific artificial neurons are activated in response to any particular query. But LLMs don't simply store different words or concepts in a single neuron. Instead, as Anthropic's researchers explain, "it turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts."

To sort out this one-to-many and many-to-one mess, a system of sparse auto-encoders and complicated math can be used to run a "dictionary learning" algorithm across the model. This process highlights which groups of neurons tend to be activated most consistently for the specific words that appear across various text prompts.

The same internal LLM "feature" describes the Golden Gate Bridge in multiple languages and modes.

Anthropic

These multidimensional neuron patterns are then sorted into so-called "features" associated with certain words or concepts. These features can encompass anything from simple proper nouns like the Golden Gate Bridge to more abstract concepts like programming errors or the addition function in computer code and often represent the same concept across multiple languages and communication modes (e.g., text and images).

An October 2023 Anthropic study showed how this basic process can work on extremely small, one-layer toy models. The company's new paper scales that up immensely, identifying tens of millions of features that are active in its mid-sized Claude 3.0 Sonnet model. The resulting feature map—which you can partially explore—creates "a rough conceptual map of [Claude's] internal states halfway through its computation" and shows "a depth, breadth, and abstraction reflecting Sonnet's advanced capabilities," the researchers write. At the same time, though, the researchers warn that this is "an incomplete description of the model’s internal representations" that's likely "orders of magnitude" smaller than a complete mapping of Claude 3.

A simplified map shows some of the concepts that are "near" the "inner conflict" feature in Anthropic's Claude model.

Anthropic

Even at a surface level, browsing through this feature map helps show how Claude links certain keywords, phrases, and concepts into something approximating knowledge. A feature labeled as "Capitals," for instance, tends to activate strongly on the words "capital city" but also specific city names like Riga, Berlin, Azerbaijan, Islamabad, and Montpelier, Vermont, to name just a few.

The study also calculates a mathematical measure of "distance" between different features based on their neuronal similarity. The resulting "feature neighborhoods" found by this process are "often organized in geometrically related clusters that share a semantic relationship," the researchers write, showing that "the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity." The Golden Gate Bridge feature, for instance, is relatively "close" to features describing "Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo."

Some of the most important features involved in answering a query about the capital of Kobe Bryant's team's state.

Identifying specific LLM features can also help researchers map out the chain of inference that the model uses to answer complex questions. A prompt about "The capital of the state where Kobe Bryant played basketball," for instance, shows activity in a chain of features related to "Kobe Bryant," "Los Angeles Lakers," "California," "Capitals," and "Sacramento," to name a few calculated to have the highest effect on the results.

Change your (artificial) mind

More than just mapping how an LLM stores information, this kind of knowledge of Claude's inner workings can also be used to tweak the model's behaviors in very specific ways. After "clamping" the values for specific features artificially high or low, the Claude model can start exhibiting some extremely strange behaviors. Amplifying the Golden Gate Bridge feature, for instance, led the model to start describing itself by saying, "I am the Golden Gate Bridge... my physical form is the iconic bridge itself..."

These kinds of behavior-related results suggest that "the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior."

An example of how "clamping" certain feature values leads to strikingly different results for certain prompts.

Anthropic

Sometimes, the links between features and behavior are relatively direct: Clamping the "more hateful; bias-related features" at high levels "causes the model to go on hateful screeds," the researchers write. Other times, the effects are more subtle; clamping down a feature associated with "sycophantic praise" increases how accurately it rated the prompter's abilities, while clamping an "internal conflict" feature to a high value gets the model to stop consistently lying about its ability to "forget" information.

We've seen similar surprising behaviors from LLMs in the past when adversarial users engineer prompts that "jailbreak" certain models to ignore pre-set safeguards or behavior modes. By tweaking the relative values in specific parts of a model's internal feature map, though, an LLM maker could potentially be more proactive about "monitor[ing] AI systems for certain dangerous behaviors... steer[ing] them towards desirable outcomes... or remov[ing] certain dangerous subject matter entirely," the researchers write.

An example showing which specific part of an LLM neural network helps prevent it from writing scam emails.

Anthropic

"The interesting thing is not that these features exist, but that they can be discovered at scale and intervened on," the authors write about this early research. "For example, we might hope to reliably know whether a model is being deceptive or lying to us. Or we might hope to ensure that certain categories of very harmful behavior (e.g. helping to create bioweapons) can reliably be detected and stopped."

This kind of behavior-related model tweaking is still very rough, the researchers warn, and more research is needed to identify any potential downstream effects of altering features for safety reasons. Even at this early stage, though, Anthropic's research provides an exciting framework for making an LLM's "black box" results that much more interpretable and, potentially, controllable.

Sign In

Here’s what’s really going on inside an LLM’s neural network

Anthropic's conceptual mapping helps explain why LLMs behave the way they do.

Opening the hood

Change your (artificial) mind

User Feedback

Recommended Comments

Join the conversation

Recently Browsing 0 members

nsane.down

News

Browse

Activity