How Steering Works

A comprehensive guide to understanding SAE-based model steering from the mathematics to implementation.

Overview

Steering allows you to control model behavior by manipulating Sparse Autoencoder (SAE) features during inference. When you drag a slider in the interface to steer on a feature, you're modifying high-level semantic concepts while the SAE decoder handles the low-level tensor operations to achieve that change.

Key Concepts

Feature
A learned representation in the SAE that corresponds to a specific concept, pattern, or behavior (e.g., "pirate speech", "politeness", "technical jargon"). Features are interpretable units that the model uses internally.
Feature Index
The position of a feature in the SAE (0 to 131,071 for a 131k-feature SAE). This is the identifier you use in the API to specify which feature to steer.
Activation
The strength at which a feature fires when processing text. Higher activation values mean the feature is more present in that context. Natural activations are typically between 0 and 10, with most features at 0 (inactive).
Similarity
A score (0 to 1) indicating how semantically related a feature is to a search query. Used when searching for features by description. Higher similarity means the feature better matches your search.
Steering Strength
The amount to add to (or subtract from) a feature's natural activation. Positive values amplify a concept, negative values suppress it. The UI allows -1 to +1, but the API accepts any value for stronger effects.
Intervention
The act of modifying a feature's activation during model inference. An intervention consists of a feature index, steering strength, and mode (add or clamp).
SAE (Sparse Autoencoder)
A neural network trained to decompose the model's internal representations into interpretable features. It encodes hidden states into sparse feature activations, then decodes them back to reconstruct the original representation.
Hidden States
The internal vector representations at each layer of the model. For Llama 3.1 8B, these are 8,192-dimensional vectors that encode the model's understanding of each token in context.

Running Example

Concrete Example

Throughout this guide, we'll follow a concrete example:

  • Feature Index: 99 - "Pirate speech patterns and vocabulary"
  • Steering: Add +0.2 (natural 0.1 → 0.3)
  • Model: Llama 3.1 8B Instruct
  • Prompt: "Tell me about the ocean"

The Complete Flow

1. Send API Request

When you drag the slider for feature 99 from 0.1 to 0.2, the web interface sends:

POST /v1/chat/completions
{
  "messages": [{"role": "user", "content": "Tell me about the ocean"}],
  "interventions": [
    {
      "index_in_sae": 99,
      "strength": 0.2
    }
  ]
}

2. Process Through Steering Layer

During the forward pass at layer 19 (the steering layer for Llama 3.1 8B), the following operations transform the activations:

Extract Hidden States for Input Sequence

Get the hidden states tensor h for the input sequence from layer 19 of the Llama model:

h.shape = (38, 8192)

  • 38 tokens in the prompt "Tell me about the ocean"
  • 8192 dimensions per token (model hidden size)

What are hidden states? Each token in the prompt is represented as a vector of 8192 numbers. These vectors encode the model's internal understanding of each word in context - capturing meaning, syntax, and relationships to other words. Think of them as the model's "thoughts" about each token at this layer.

Encode to Feature Space

Apply the SAE encoder to map hidden states to sparse features:

features = ReLU(Wenc @ h + benc)

Where:

  • Wenc is the SAE encoder weight matrix (131072, 8192) for 16× expansion
  • benc is the SAE encoder bias vector
  • @ means matrix multiplication
  • ReLU is the activation function (explained below)

What is ReLU? ReLU (Rectified Linear Unit) is a simple function:

ReLU(x) = max(0, x)

It keeps positive values unchanged and zeros out negative values. This creates sparsity - most features end up being exactly 0, with only a few active (positive) at any given time.

features.shape = (38, 131072)

At token position 15 (the word "ocean"), feature 99 naturally activates to 0.1:

features[15, 99] = 0.1

Decode Features Back to Hidden States

Before applying steering, decode the features to see what the SAE can reconstruct:

reconstructed = Wdec @ features

Where:

  • Wdec is the SAE decoder weight matrix (8192, 131072)

What is the decoder matrix? Wdec is part of the SAE (Sparse Autoencoder), not the base LLM. Each column represents a learned "direction" in the 8192-dimensional hidden space. For example, column 99 represents the direction for "pirate speech." The decoder translates sparse feature activations back into the dense hidden state representation that the LLM uses.

reconstructed.shape = (38, 8192)

Calculate Reconstruction Error

Compute what the SAE failed to capture:

error = h - reconstructed

This error contains information the SAE cannot represent. We'll add it back later to preserve all information.

Modify Target Feature

Apply the steering intervention:

# Create intervention tensor (all zeros except at target feature)
add_tensor = zeros(38, 131072)
add_tensor[:, 99] = 0.2

# Add intervention to natural activations
features = features + add_tensor

Now at token position 15, feature 99 has been increased:

features[15, 99] = 0.1 + 0.2 = 0.3

Decode Steered Features and Restore Error

Transform the modified features back to hidden states and add back the reconstruction error:

h' = Wdec @ features + error

These steered hidden states h' continue through the rest of the transformer, influencing the generated text to include more pirate speech patterns.

The Mathematical Formula

Complete Formula

Combining all the steps above, the full steering operation is:

h' = Wdec @ (ReLU(Wenc @ h + benc) + δ) + (h - Wdec @ ReLU(Wenc @ h + benc))

Where:

  • h = original hidden states (38, 8192)
  • δ = intervention vector, all zeros except δ[:, 99] = 0.2
  • Wenc = SAE encoder weights (131072, 8192)
  • Wdec = SAE decoder weights (8192, 131072)
  • benc = SAE encoder bias
  • h' = steered hidden states (38, 8192)

Simplified Form

The error term cancels out the original reconstruction. This leaves:

h' = h + Wdec @ δ

For our example with feature 99:

h' = h + Wdec[:, 99] × 0.2

This means we're adding 0.2 times the decoder direction for feature 99 to every token's hidden state. The decoder direction is a vector in the 8192-dimensional hidden space that the SAE learned represents "pirate speech patterns."

Concrete Numerical Example

The decoder direction for feature 99 (pirate speech patterns) is a vector with 8192 numbers. Here are the first few values:

W_dec[:, 99] = [0.23, -0.11, 0.45, 0.08, ...]  (8192 values total)

When we multiply this by our steering strength of 0.2 and add it to each token:

# For every token position in the sequence:
h'[pos, 0] = h[pos, 0] + (0.23 × 0.2) = h[pos, 0] + 0.046
h'[pos, 1] = h[pos, 1] + (-0.11 × 0.2) = h[pos, 1] - 0.022
h'[pos, 2] = h[pos, 2] + (0.45 × 0.2) = h[pos, 2] + 0.090
h'[pos, 3] = h[pos, 3] + (0.08 × 0.2) = h[pos, 3] + 0.016
...and so on for all 8192 dimensions

Example: Token 15 ("ocean")

Original:    h[15] = [2.5,   1.2,  -0.8,   3.1,  ...]
Change:              [0.046, -0.022, 0.090, 0.016, ...]
Steered:     h'[15] = [2.546, 1.178, -0.710, 3.116, ...]

Steering Modes

Add Mode (Default)

features[:, index_in_sae] += strength

Increases or decreases the natural activation of a feature.

Example: If feature 99 naturally activates to 0.1 at a token, adding 0.2 gives a final activation of 0.3. Use positive values to amplify a concept, negative values to suppress it.

Clamp Mode

features[:, index_in_sae] = strength

Sets the feature to an exact activation level, ignoring the natural value.

Example: Whether feature 99 naturally activates to 0.05 or 5.0, clamping to 0.8 forces it to exactly 0.8. Useful when you want precise control regardless of context.

What Actually Happens

When you steer on feature 99 with strength 0.2 while asking about the ocean:

Without steering:

"The ocean covers about 71% of Earth's surface and contains 97% of the planet's water. It's divided into five major basins and plays a crucial role in regulating climate and supporting marine life..."

With steering (feature 99, strength 0.2):

"Arr, the ocean be coverin' about 71% of Earth's surface, matey! These vast waters hold 97% of the planet's water, divided into five great basins. The briny deep plays a crucial role in regulatin' the climate and supportin' all manner of marine life, from the tiniest plankton to the mightiest whales sailin' the seven seas..."

The model now generates text with pirate speech patterns because we increased the activation of feature 99 throughout the generation process.

Summary

Key Takeaways

Steering modifies model behavior by manipulating SAE feature activations:

  1. You specify a feature index and steering strength through the API
  2. At the steering layer, the model extracts hidden states and encodes them to sparse features
  3. Your intervention increases or sets the target feature's activation
  4. The modified features are decoded back to hidden states
  5. These steered activations change how the model generates text

You control the semantic concept (e.g., "pirate speech patterns") by adjusting a single number. The SAE decoder automatically translates this into the right pattern of changes across all 8192 hidden dimensions.

Additional Resources