How Steering Works
A comprehensive guide to understanding SAE-based model steering from the mathematics to implementation.
Overview
Steering allows you to control model behavior by manipulating Sparse Autoencoder (SAE) features during inference. When you drag a slider in the interface to steer on a feature, you're modifying high-level semantic concepts while the SAE decoder handles the low-level tensor operations to achieve that change.
Key Concepts
- Feature
- A learned representation in the SAE that corresponds to a specific concept, pattern, or behavior (e.g., "pirate speech", "politeness", "technical jargon"). Features are interpretable units that the model uses internally.
- Feature Index
- The position of a feature in the SAE (0 to 131,071 for a 131k-feature SAE). This is the identifier you use in the API to specify which feature to steer.
- Activation
- The strength at which a feature fires when processing text. Higher activation values mean the feature is more present in that context. Natural activations are typically between 0 and 10, with most features at 0 (inactive).
- Similarity
- A score (0 to 1) indicating how semantically related a feature is to a search query. Used when searching for features by description. Higher similarity means the feature better matches your search.
- Steering Strength
- The amount to add to (or subtract from) a feature's natural activation. Positive values amplify a concept, negative values suppress it. The UI allows -1 to +1, but the API accepts any value for stronger effects.
- Intervention
- The act of modifying a feature's activation during model inference. An intervention consists of a feature index, steering strength, and mode (add or clamp).
- SAE (Sparse Autoencoder)
- A neural network trained to decompose the model's internal representations into interpretable features. It encodes hidden states into sparse feature activations, then decodes them back to reconstruct the original representation.
- Hidden States
- The internal vector representations at each layer of the model. For Llama 3.1 8B, these are 8,192-dimensional vectors that encode the model's understanding of each token in context.
Running Example
Concrete Example
Throughout this guide, we'll follow a concrete example:
- Feature Index: 99 - "Pirate speech patterns and vocabulary"
- Steering: Add +0.2 (natural 0.1 → 0.3)
- Model: Llama 3.1 8B Instruct
- Prompt: "Tell me about the ocean"
The Complete Flow
1. Send API Request
When you drag the slider for feature 99 from 0.1 to 0.2, the web interface sends:
POST /v1/chat/completions
{
"messages": [{"role": "user", "content": "Tell me about the ocean"}],
"interventions": [
{
"index_in_sae": 99,
"strength": 0.2
}
]
}2. Process Through Steering Layer
During the forward pass at layer 19 (the steering layer for Llama 3.1 8B), the following operations transform the activations:
Extract Hidden States for Input Sequence
Get the hidden states tensor h for the input sequence from layer 19 of the Llama model:
h.shape = (38, 8192)
- 38 tokens in the prompt "Tell me about the ocean"
- 8192 dimensions per token (model hidden size)
What are hidden states? Each token in the prompt is represented as a vector of 8192 numbers. These vectors encode the model's internal understanding of each word in context - capturing meaning, syntax, and relationships to other words. Think of them as the model's "thoughts" about each token at this layer.
Encode to Feature Space
Apply the SAE encoder to map hidden states to sparse features:
features = ReLU(Wenc @ h + benc)
Where:
- Wenc is the SAE encoder weight matrix (131072, 8192) for 16× expansion
- benc is the SAE encoder bias vector
- @ means matrix multiplication
- ReLU is the activation function (explained below)
What is ReLU? ReLU (Rectified Linear Unit) is a simple function:
ReLU(x) = max(0, x)
It keeps positive values unchanged and zeros out negative values. This creates sparsity - most features end up being exactly 0, with only a few active (positive) at any given time.
features.shape = (38, 131072)
At token position 15 (the word "ocean"), feature 99 naturally activates to 0.1:
features[15, 99] = 0.1
Decode Features Back to Hidden States
Before applying steering, decode the features to see what the SAE can reconstruct:
reconstructed = Wdec @ features
Where:
- Wdec is the SAE decoder weight matrix (8192, 131072)
What is the decoder matrix? Wdec is part of the SAE (Sparse Autoencoder), not the base LLM. Each column represents a learned "direction" in the 8192-dimensional hidden space. For example, column 99 represents the direction for "pirate speech." The decoder translates sparse feature activations back into the dense hidden state representation that the LLM uses.
reconstructed.shape = (38, 8192)
Calculate Reconstruction Error
Compute what the SAE failed to capture:
error = h - reconstructed
This error contains information the SAE cannot represent. We'll add it back later to preserve all information.
Modify Target Feature
Apply the steering intervention:
# Create intervention tensor (all zeros except at target feature)
add_tensor = zeros(38, 131072)
add_tensor[:, 99] = 0.2
# Add intervention to natural activations
features = features + add_tensorNow at token position 15, feature 99 has been increased:
features[15, 99] = 0.1 + 0.2 = 0.3
Decode Steered Features and Restore Error
Transform the modified features back to hidden states and add back the reconstruction error:
h' = Wdec @ features + error
These steered hidden states h' continue through the rest of the transformer, influencing the generated text to include more pirate speech patterns.
The Mathematical Formula
Complete Formula
Combining all the steps above, the full steering operation is:
h' = Wdec @ (ReLU(Wenc @ h + benc) + δ) + (h - Wdec @ ReLU(Wenc @ h + benc))
Where:
- h = original hidden states (38, 8192)
- δ = intervention vector, all zeros except δ[:, 99] = 0.2
- Wenc = SAE encoder weights (131072, 8192)
- Wdec = SAE decoder weights (8192, 131072)
- benc = SAE encoder bias
- h' = steered hidden states (38, 8192)
Simplified Form
The error term cancels out the original reconstruction. This leaves:
h' = h + Wdec @ δ
For our example with feature 99:
h' = h + Wdec[:, 99] × 0.2
This means we're adding 0.2 times the decoder direction for feature 99 to every token's hidden state. The decoder direction is a vector in the 8192-dimensional hidden space that the SAE learned represents "pirate speech patterns."
Concrete Numerical Example
The decoder direction for feature 99 (pirate speech patterns) is a vector with 8192 numbers. Here are the first few values:
W_dec[:, 99] = [0.23, -0.11, 0.45, 0.08, ...] (8192 values total)When we multiply this by our steering strength of 0.2 and add it to each token:
# For every token position in the sequence:
h'[pos, 0] = h[pos, 0] + (0.23 × 0.2) = h[pos, 0] + 0.046
h'[pos, 1] = h[pos, 1] + (-0.11 × 0.2) = h[pos, 1] - 0.022
h'[pos, 2] = h[pos, 2] + (0.45 × 0.2) = h[pos, 2] + 0.090
h'[pos, 3] = h[pos, 3] + (0.08 × 0.2) = h[pos, 3] + 0.016
...and so on for all 8192 dimensionsExample: Token 15 ("ocean")
Original: h[15] = [2.5, 1.2, -0.8, 3.1, ...]
Change: [0.046, -0.022, 0.090, 0.016, ...]
Steered: h'[15] = [2.546, 1.178, -0.710, 3.116, ...]Steering Modes
Add Mode (Default)
features[:, index_in_sae] += strengthIncreases or decreases the natural activation of a feature.
Example: If feature 99 naturally activates to 0.1 at a token, adding 0.2 gives a final activation of 0.3. Use positive values to amplify a concept, negative values to suppress it.
Clamp Mode
features[:, index_in_sae] = strengthSets the feature to an exact activation level, ignoring the natural value.
Example: Whether feature 99 naturally activates to 0.05 or 5.0, clamping to 0.8 forces it to exactly 0.8. Useful when you want precise control regardless of context.
What Actually Happens
When you steer on feature 99 with strength 0.2 while asking about the ocean:
Without steering:
"The ocean covers about 71% of Earth's surface and contains 97% of the planet's water. It's divided into five major basins and plays a crucial role in regulating climate and supporting marine life..."
With steering (feature 99, strength 0.2):
"Arr, the ocean be coverin' about 71% of Earth's surface, matey! These vast waters hold 97% of the planet's water, divided into five great basins. The briny deep plays a crucial role in regulatin' the climate and supportin' all manner of marine life, from the tiniest plankton to the mightiest whales sailin' the seven seas..."
The model now generates text with pirate speech patterns because we increased the activation of feature 99 throughout the generation process.
Summary
Key Takeaways
Steering modifies model behavior by manipulating SAE feature activations:
- You specify a feature index and steering strength through the API
- At the steering layer, the model extracts hidden states and encodes them to sparse features
- Your intervention increases or sets the target feature's activation
- The modified features are decoded back to hidden states
- These steered activations change how the model generates text
You control the semantic concept (e.g., "pirate speech patterns") by adjusting a single number. The SAE decoder automatically translates this into the right pattern of changes across all 8192 hidden dimensions.
Additional Resources
- API Documentation - Interactive OpenAPI documentation with all endpoints and schemas
- vLLM SDK - Official Python SDK for steering and feature manipulation
- Towards Monosemanticity (Anthropic) - Foundational paper on Sparse Autoencoders and interpretable features
- Try it out - Sign up and start steering