Skip to main content
Interpretability & Mechanistic Analysis

The dynaxx map: tracing feature superposition to latent circuit topology

Introduction: The Problem of Feature SuperpositionWhen we inspect the internal representations of a deep neural network, we often find that individual neurons do not correspond to single, interpretable concepts. Instead, a single neuron may activate for multiple unrelated features—a phenomenon known as feature superposition. This poses a fundamental challenge for interpretability: how can we trace the behavior of a model back to its underlying circuit topology when features are entangled in this

Introduction: The Problem of Feature Superposition

When we inspect the internal representations of a deep neural network, we often find that individual neurons do not correspond to single, interpretable concepts. Instead, a single neuron may activate for multiple unrelated features—a phenomenon known as feature superposition. This poses a fundamental challenge for interpretability: how can we trace the behavior of a model back to its underlying circuit topology when features are entangled in this way? The dynaxx map methodology provides a structured approach to address this exact problem.

In this guide, we assume you are already familiar with basic interpretability concepts like activation maximization and saliency maps. We will focus on the advanced techniques needed to go from observing superposition to constructing a latent circuit map. The key insight is that superposition is not random; it follows patterns dictated by the model's training objective and architecture. By carefully designing probing experiments and analyzing activation covariance, we can begin to separate overlapping features and trace their causal pathways. This work is essential for model debugging, safety analysis, and understanding how models generalize.

We will cover the core concepts behind feature superposition and latent circuits, provide a step-by-step guide to constructing a dynaxx map, compare three probing methods, discuss real-world applications, and address common questions. By the end, you should be able to apply these techniques to your own models and contribute to the growing field of mechanistic interpretability. Remember that this is an active research area; the methods described here reflect best practices as of April 2026, but always verify critical details against the latest literature.

", "

Core Concepts: Understanding Feature Superposition

Feature superposition occurs when a model represents more features than it has dimensions in its internal activation space. This is a consequence of the model's need to compress information efficiently. For example, a single neuron in a transformer's residual stream might respond to both the concept of 'cat' and the concept of 'car', depending on context. This is not a bug but a feature of how neural networks learn to reuse representational capacity. However, it makes post-hoc interpretability difficult because we cannot simply read off feature detectors from individual neurons.

How Superposition Arises: The Polysemantic Neuron Phenomenon

Neurons that respond to multiple unrelated concepts are called polysemantic. Research has shown that polysemanticity emerges naturally when the number of features in the data exceeds the model's representational capacity. In a typical transformer trained on diverse text, we might find that a single neuron activates for both 'Hollywood' and 'Bollywood' in one context, but also for 'star' in an astronomical sense. These are distinct features that happen to share a neuron because the model could not dedicate separate neurons to each. Understanding this phenomenon is the first step towards mapping circuits because we must recognize where superposition is likely occurring.

Latent Circuit Topology: The Hidden Graph of Computation

A latent circuit is the underlying computational graph that the model uses to perform a specific task. Unlike the explicit architecture (e.g., layers and attention heads), the latent circuit is a subgraph of the model's computation that is dedicated to a particular function. For instance, the circuit for 'detecting subject-verb agreement' in a language model might involve a specific set of attention heads and MLP neurons. Feature superposition complicates circuit discovery because a single neuron might participate in multiple circuits simultaneously. The dynaxx map aims to disentangle these overlapping circuits by analyzing how features combine and interact across layers.

Why Tracing Matters: From Debugging to Safety

Understanding the mapping from features to circuits has practical implications. When a model makes an unexpected error, tracing the feature that caused the error through the circuit can reveal the root cause. For safety, knowing whether a model uses a 'honest' feature or a 'deceptive' shortcut is crucial for alignment. Without addressing superposition, our circuit maps will be incomplete and potentially misleading. The dynaxx map provides a systematic way to handle this complexity, making it a valuable tool for any serious interpretability practitioner.

", "

Methodology Overview: The Dynaxx Map Approach

The dynaxx map is a multi-step process that combines activation analysis, causal intervention, and graph construction. The goal is to produce a directed graph where nodes represent features (or groups of features) and edges represent causal dependencies. The methodology is designed to be model-agnostic, though it works best with transformer-based architectures. We break it down into four main phases: probing, disentangling, circuit extraction, and validation.

Phase 1: Sparse Probing to Identify Feature Directions

We begin by training sparse linear probes on the model's internal activations to identify directions in activation space that correspond to specific features. Unlike dense probes, sparse probes impose an L1 penalty to encourage the probe to use only a few neurons. This helps to isolate features even when they are superposed, because the probe will learn to ignore neurons that are not relevant. For example, to probe for the feature 'plural noun', we collect activations from a dataset of singular and plural nouns, then train a sparse logistic regression on the residual stream at a given layer. The resulting weight vector points in the direction of the feature. We repeat this for many features of interest, creating a library of feature directions.

Phase 2: Disentangling Superposed Features via Covariance Analysis

With a set of feature directions, we analyze the covariance matrix of their activations across a diverse dataset. Features that are superposed will exhibit correlated activations because they share neurons. By applying independent component analysis (ICA) or sparse dictionary learning to the covariance matrix, we can separate the original features into independent components. This step is crucial because it allows us to 'unmix' overlapping features, producing a set of disentangled features that correspond more closely to individual concepts. We then refine our probe library to use these disentangled directions.

Phase 3: Constructing the Circuit Graph

With disentangled features in hand, we trace their causal influence through the model's layers using activation patching. We systematically replace the activation of a feature at one layer with a baseline activation and observe the effect on downstream feature activations. If patching a feature at layer L causes a significant change in feature B at layer L+2, we add an edge from A to B. By performing this for all pairs of features across all layers, we build a directed graph. The graph is pruned to remove weak edges, typically using a threshold based on the magnitude of the patching effect. The result is a sparse circuit map that shows the main pathways of feature influence.

", "

Step-by-Step Guide: Building a Dynaxx Map

This section provides a detailed, actionable walkthrough for constructing a dynaxx map on a small transformer model, such as a two-layer attention-only architecture. We assume you have access to the model's activations and a dataset of examples relevant to the features you want to study. The steps are designed to be reproducible and to highlight common pitfalls.

Step 1: Define Your Feature Set

Start by selecting a set of features that you hypothesize are present in the model. For a language model, these could be syntactic features (e.g., subject-verb agreement, negation), semantic features (e.g., animal, vehicle), or task-specific features (e.g., correct answer in a QA dataset). It is important to have at least 10-20 features to get a meaningful graph. For each feature, collect a dataset of inputs where the feature is present and absent. For example, for 'negation', use sentences like 'The cat is not on the mat' (present) vs 'The cat is on the mat' (absent). This dataset will be used to train your probes.

Step 2: Train Sparse Probes

For each feature, train a sparse linear probe on the activations at a specific layer. We recommend using the residual stream after the attention and MLP sublayers, as this is where most feature information is concentrated. Use an L1 regularization strength that yields a weight vector with 5-10 non-zero entries. This sparsity encourages the probe to focus on the most relevant neurons. Evaluate the probe's accuracy on a held-out set to ensure it is capturing the feature. If accuracy is below 80%, consider using a different layer or increasing the regularization. Document the probe weights for later analysis.

Step 3: Disentangle Features

Collect the activation vectors for all features across a large dataset (at least 1000 examples). Compute the covariance matrix of these activation vectors. Then apply ICA to decompose the covariance into independent components. The number of components should be equal to the number of features you started with. Each component is a new 'disentangled' feature direction. Project your original probe weights onto these components to get the disentangled probes. Verify that the disentangled probes are less correlated with each other than the original probes. This step can be computationally intensive but is essential for accurate circuit mapping.

Step 4: Perform Activation Patching

Now, we need to trace causal connections. For each pair of features (A, B) where A appears at layer L and B appears at layer L' > L, we perform activation patching. Run the model on an input where feature A is active. At layer L, replace the activation of feature A with its mean activation across the dataset (i.e., 'ablate' the feature). Then measure the activation of feature B at layer L'. If the activation of B changes significantly compared to a baseline run, we infer that A causally influences B. Use a statistical test (e.g., t-test) to determine significance. Repeat this for all pairs and layers. This is the most time-consuming step; it may require thousands of forward passes.

Step 5: Construct and Validate the Graph

From the patching results, build a directed graph where nodes are features and edges are significant causal influences. Threshold the edge weights to remove noise—a common choice is to keep only edges where the patching effect size is greater than 0.1 (in units of standard deviation). Validate the graph by checking that it reproduces known behaviors. For example, if you have a feature for 'subject' and a feature for 'verb agreement', there should be an edge from the subject feature to the agreement feature. If not, you may need to refine your probes or feature definitions. Finally, visualize the graph using a tool like Graphviz to inspect the topology.

", "

Comparison of Probing Methods

Choosing the right probing method is critical for the success of the dynaxx map. Different methods offer trade-offs between accuracy, computational cost, and interpretability. We compare three common approaches: sparse linear probes, probing with activation clustering, and probing with trained autoencoders. Each has its own strengths and weaknesses depending on the complexity of the feature superposition.

1. Sparse Linear Probes

As described earlier, sparse linear probes use L1 regularization to select a small set of neurons. They are fast to train and easy to interpret, as the weight vector directly indicates which neurons are used. However, they may fail if the feature is distributed across many neurons in a non-linear way. For example, if the feature is encoded as a linear combination of many neurons, a sparse probe might miss it entirely. Sparse probes work best when features are relatively localized, which is often the case in early layers of a transformer. They are our default recommendation for initial experiments.

2. Probing with Activation Clustering

This method involves clustering the activation vectors of neurons across a dataset to identify groups of neurons that co-activate. Features are then assigned to clusters based on which cluster's centroid best predicts the feature. This approach is useful when features are highly superposed and sparse probes fail. For instance, if a feature is encoded by a distributed pattern across 50 neurons, clustering can capture the pattern that sparse regression might miss. However, clustering requires careful choice of the number of clusters and can be sensitive to the clustering algorithm (e.g., k-means vs spectral clustering). It is also more computationally expensive than sparse probes.

3. Probing with Trained Autoencoders

Autoencoders, particularly sparse autoencoders, can learn a compressed representation of the activation space that separates features. The encoder's hidden layer can be interpreted as a set of feature detectors. This method is powerful because it can handle complex superposition patterns without manual feature engineering. However, training autoencoders is data-intensive and requires hyperparameter tuning (e.g., sparsity penalty, bottleneck size). The resulting features may also be harder to interpret because the autoencoder's weights are not directly tied to original neurons. We recommend autoencoders for advanced users who have large datasets and computational resources.

Comparison Table

MethodProsConsBest Use Case
Sparse Linear ProbeFast, interpretable, low computeMay miss distributed featuresEarly layers, localized features
Activation ClusteringCaptures distributed patternsSensitive to hyperparametersHighly superposed features
AutoencoderHandles complex superpositionData-hungry, hard to interpretLarge datasets, advanced users

In practice, we often start with sparse probes and switch to clustering or autoencoders if we suspect the probes are missing features. The choice also depends on the layer depth: deeper layers tend to have more distributed representations, making clustering or autoencoders more appropriate.

", "

Real-World Application: Debugging a Transformer's Syntax Circuit

To illustrate the dynaxx map in action, consider a composite scenario based on experiences from several teams. A research group was investigating a transformer model that sometimes produced grammatically incorrect outputs, such as subject-verb disagreement. They hypothesized that the model's syntax circuit was flawed. Using the dynaxx map methodology, they traced the issue to a specific feature superposition problem.

Initial Observations and Hypothesis

The model was a 6-layer transformer trained on a large corpus of English text. On simple sentences like 'The dog run fast' (incorrect) vs 'The dog runs fast' (correct), the model occasionally output the wrong verb form. The team suspected that the feature for 'singular subject' was being conflated with the feature for 'present tense' in the residual stream. This is a classic case of feature superposition: two distinct syntactic features were being represented by overlapping sets of neurons.

Applying the Dynaxx Map

The team first trained sparse probes for 'singular subject' and 'present tense' using activations from layer 4 (the middle layer). The probes achieved 85% accuracy but were highly correlated (r=0.7), confirming superposition. They then applied ICA to the covariance matrix of probe activations, which produced two independent components. Projecting the probes onto these components yielded disentangled features with only 0.1 correlation. Next, they performed activation patching: ablating the singular subject feature at layer 4 reduced the activation of the verb agreement feature at layer 6 by 30%, while ablating the present tense feature had no effect. This indicated that the singular subject feature was the primary causal driver of verb agreement, but its influence was weakened by superposition with tense.

Outcome and Fix

By removing the superposition effect (essentially amplifying the singular subject signal and suppressing the tense signal in the circuit), the team was able to create a patch that improved subject-verb agreement accuracy from 92% to 97% on a held-out test set. The dynaxx map revealed that the original circuit was using the superposed representation, which sometimes caused the model to rely on the wrong feature. This case demonstrates how tracing feature superposition to latent circuit topology can directly lead to model improvements.

In another composite scenario, a team working on a question-answering model found that the feature for 'entity type' (e.g., person vs location) was superposed with the feature for 'answer type' (e.g., yes/no vs factual). Using the dynaxx map, they discovered that the circuit for factual answers was inadvertently using the entity type feature, causing the model to answer 'What is the capital of France?' with 'Paris' (correct) but also sometimes with 'France' (incorrect) when the entity type feature was dominant. By adjusting the circuit weights, they reduced this error by 15%.

", "

Common Challenges and Pitfalls

While the dynaxx map is a powerful tool, practitioners frequently encounter several challenges. Being aware of these pitfalls can save time and improve the quality of your circuit maps. We discuss the most common issues and how to address them.

Pitfall 1: Incomplete Feature Coverage

If you only probe for a handful of features, you may miss important interactions. The circuit map will be sparse but potentially misleading because you are ignoring features that mediate between your probed features. For example, if you probe for 'noun' and 'verb' but not 'subject', you might incorrectly conclude that nouns directly influence verbs, when in fact the subject feature is the intermediary. To avoid this, we recommend probing for at least 20-30 features that cover the main semantic and syntactic categories relevant to your task. You can use an unsupervised method like clustering to discover features you haven't thought of.

Pitfall 2: Confusing Correlation with Causation

Activation patching is designed to measure causal influence, but it is easy to misinterpret results if you don't control for confounding features. For instance, if feature A and feature B are both caused by a third feature C, patching A might still affect B because A and B are correlated, not because A causes B. To mitigate this, always include a control condition where you patch a random feature or a feature known to be unrelated. Also, use double-patching experiments where you ablate multiple features simultaneously to isolate direct effects. Statistical tests like permutation tests can help establish significance.

Pitfall 3: Layer Mismatch and Activation Alignment

Features may not be perfectly aligned with the layer boundaries you choose. An important feature might be computed across two adjacent layers, and probing at a single layer might capture only part of it. Moreover, the activation space at different layers has different orientations, making it difficult to compare feature directions across layers. To handle this, we recommend probing at multiple layers (e.g., every other layer) and then aligning the feature directions using a technique like canonical correlation analysis (CCA). This allows you to track how a feature transforms from layer to layer.

Pitfall 4: Computational Cost

Activation patching for all pairs of features across all layers can be prohibitively expensive for large models. For a 12-layer model with 30 features, you might need 30*30*12 = 10,800 patching experiments, each requiring a forward pass. To reduce cost, you can limit the analysis to a subset of layers that are known to be important (e.g., the last few layers) or use a greedy algorithm to prune edges early. Some teams use a random sampling of patches and interpolate the results. Always start with a small pilot study to gauge the cost before scaling up.

", "

Interpreting the Dynaxx Map: From Graph to Insight

Once you have constructed the circuit graph, the next step is to interpret its topology. The structure of the graph reveals how the model organizes features into functional modules. We discuss several patterns you might find and what they mean for model behavior.

Pattern 1: Feedforward Chains

A simple pattern is a linear chain of features, where feature A influences B, which influences C, and so on. This indicates a sequential computation, such as parsing a sentence from syntax to semantics. For example, you might see: 'subject' → 'subject number' → 'verb agreement'. If the chain is broken (missing edge), it suggests a weakness in the model's reasoning. Chains are easy to interpret and often correspond to intuitive steps. We can validate them by checking if the model's output changes when we ablate the first feature in the chain.

Pattern 2: Convergent and Divergent Hubs

Some features (hubs) receive input from many other features (convergent) or send output to many features (divergent). A convergent hub might represent an abstract concept that integrates multiple lower-level features. For instance, a 'sentiment' feature might receive input from 'positive word', 'negation', and 'intensifier'. A divergent hub might be a feature that modulates many downstream features, like a 'task switch' feature. Identifying hubs can help you understand which features are critical for the model's overall computation. You can test the importance of a hub by ablating it and measuring the impact on performance across multiple tasks.

Pattern 3: Cycles and Feedback Loops

Cycles in the graph indicate recurrent processing, where a feature can influence itself indirectly through other features. This is common in models with residual connections, where information from early layers can be re-injected later. For example, a 'coreference resolution' feature might receive input from an earlier 'entity mention' feature and then feed back into the same 'entity mention' feature in a later layer. Cycles can make the model's computation more robust but also harder to analyze. When you find a cycle, try to determine the minimal set of features that can break it by ablating each in turn.

Share this article:

Comments (0)

No comments yet. Be the first to comment!