SelfIE Labels
Understanding the SelfIE (Self-Interpretation of Embeddings) technique: how we automatically generate interpretable labels for SAE features using reflective coherence training.
What is SelfIE?
SelfIE (Self-Interpretation of Embeddings) is a neural network system that automatically generates natural language descriptions for Sparse Autoencoder (SAE) features. Instead of manually labeling thousands of features, SelfIE learns to map SAE decoder vectors directly to interpretable text descriptions.
The Core Insight
If a label accurately describes what a feature represents, then using that label to prompt the model should activate that same feature. SelfIE trains by measuring this reflective coherence between generated labels and actual feature activations.
The Problem
Modern SAEs can have 100,000+ features. Manually labeling each feature is:
- Time-consuming: Human experts would need months to label all features
- Expensive: Requires significant human effort and expertise
- Inconsistent: Different labelers may interpret features differently
- Not scalable: New models and SAEs are created frequently
SelfIE solves this by training a neural network to automatically generate high-quality labels that capture the semantic meaning of each feature.
System Architecture
Label Generator
The label generator is a neural network that maps SAE decoder vectors to natural language descriptions using a novel soft prompt approach:
Architecture Components:
- 1. Projection Layer: A learnable linear transformation that maps SAE vectors to soft prompt tokens
- 2. Soft Tokens: Continuous embedding vectors (not discrete tokens) that serve as a learned "summary" of the feature
- 3. Template: A hard-coded prompt template with placeholders for soft tokens
- 4. Base LLM: The language model (e.g., Llama 3.1) that generates the final label
# Simplified forward pass
def generate_label(sae_decoder_vector):
# 1. Project SAE vector to soft token
soft_token = projection_layer(sae_decoder_vector)
# 2. Inject into template
template = "This pattern activates when: <soft_token>"
# 3. Generate label using base LLM
label = base_llm.generate(template_with_soft_token)
return labelThe projection layer is the only trainable component - the base LLM remains frozen. This makes training efficient and leverages the LLM's pre-existing language capabilities.
Training Process
SelfIE training happens in two phases: pretraining and reinforcement learning.
Phase 1: Supervised Pretraining
Train on existing human-labeled features (e.g., from Neuronpedia or other SAE interpretability projects):
Input: SAE decoder vector for feature 58644
Target: "Formal academic citations"
Loss: Cross-entropy between
generated and target tokensThis gives the model a strong initialization and teaches it basic label formatting.
Phase 2: Reflective Coherence RL
Fine-tune using reinforcement learning based on feature activation:
1. Generate label for feature
2. Prompt model with generated label
3. Measure SAE activations
4. Reward if target feature activates
5. Update projection layer via GRPOThis ensures labels are not just plausible, but actually cause the feature to activate when used as prompts.
Reflective Coherence Reward
The reward function measures how well a generated label activates its target feature:
def compute_reward(generated_label, target_feature_id, sae, base_model):
# Generate text using the label as a prompt
prompts = [
f"Write text that heavily features {generated_label}",
f"Generate content with lots of {generated_label}",
f"Create text emphasizing {generated_label}"
]
activations = []
for prompt in prompts:
# Generate text
text = base_model.generate(prompt)
# Get SAE activations on generated text
hidden_states = base_model.get_hidden_states(text)
feature_acts = sae.encode(hidden_states)
# Extract target feature activation
activations.append(feature_acts[:, target_feature_id].mean())
# Reward is mean activation of target feature
reward = torch.stack(activations).mean()
return rewardHigh rewards indicate the label successfully captures what makes the feature activate.
Mathematical Formulation
Projection Layer
The projection layer transforms SAE decoder vectors into soft prompt embeddings:
s = W · d + b
Where:
- d ∈ ℝd_model = SAE decoder vector (e.g., 8192-dim for Llama 3.1 8B)
- W ∈ ℝd_model × d_model = Learnable projection matrix
- b ∈ ℝd_model = Learnable bias vector (the "universal interpretation direction")
- s ∈ ℝd_model = Soft prompt token embedding
Bias Vector: Universal Interpretation
The bias vector b is particularly important. It represents a universal interpretation direction learned across all features:
When the projection is simplified:
s = α · d + b
(where α is a learned scale parameter)
- Feature-specific: α · d captures what makes this feature unique
- Universal: b captures common patterns across all features (e.g., "this activates when...")
- Transfer learning: The bias vector can transfer between models and SAEs
GRPO Training Objective
The reinforcement learning phase uses Group Relative Policy Optimization (GRPO):
Maximize: E[reward(label)] - β * KL(π || π_ref)
Where:
- reward(label) = mean activation of target feature
- π = current policy (label generator)
- π_ref = reference policy (pretrained checkpoint)
- β = KL penalty coefficient (balances exploration)This encourages generating labels that activate features while staying close to the pretrained distribution.
Example Labels
Here are real examples comparing human-labeled features with SelfIE-generated labels:
| Feature ID | Human Label | SelfIE Label |
|---|---|---|
| 1 | Syntactical special characters and delimiters in programming contexts | Escape characters in programming languages |
| 2 | The Russian word состав (composition/compilation) and its variations | Russian language composition and structure |
| 5 | Technical documentation describing widespread applications and uses | Widespread adoption and acceptance of a technology or practice |
| 10 | Format conversion operators in API and task names | Converting or transforming one format or representation to another |
| 18 | Birth dates in biographical/historical contexts | Biographical information about birth dates and places |
SelfIE labels are often more concise while capturing the same semantic meaning. In many cases, they generalize better to related concepts.
Advanced Topics
Cross-Model Label Generation
SelfIE can use a larger model to generate labels for a smaller model's SAE:
- SAE Model: Llama 3.1 8B (whose features we want to label)
- Label Generator: Llama 3.3 70B (generates better descriptions)
- Projection: 8192-dim → 8192-dim (non-square matrices supported)
The larger model's superior language abilities often produce clearer, more nuanced labels.
DiffMean Bias Training
Recent work trains the bias vector using steering vectors from contrastive text pairs:
- 1. Generate diverse conversational properties (e.g., "formal language", "asking for clarification")
- 2. Create contrastive text pairs with/without each property
- 3. Compute DiffMean steering vectors from activation differences
- 4. Train bias vector on (steering vector, description) dataset
This allows training SelfIE without needing SAE features at all, using pure steering vectors as the training data.
Log-Scale Parameterization
The scale parameter α uses logarithmic parameterization for better training dynamics:
# Instead of: scale = α (standard)
# Use: scale = exp(log_scale) (log parameterization)
Δscale ≈ scale × learning_rate × gradient
# This makes updates proportional to current scale
# Same LR works across scale ranges: 0.1 to 100+This is particularly important when training across multiple layers, as different layers may need vastly different optimal scales.
Benefits & Applications
Benefits
- ✓ Scalable: Label 100k+ features in hours, not months
- ✓ Consistent: Deterministic labeling with unified style
- ✓ Transferable: Bias vectors work across models
- ✓ Quality: RL ensures labels reflect actual behavior
- ✓ Cost-effective: No manual labeling required
Applications
- → Feature search and discovery
- → Mechanistic interpretability research
- → Model debugging and analysis
- → Automated documentation generation
- → Cross-model feature comparison
Current Status
Production Ready
SelfIE is actively used in production to generate labels for the features you explore in SteeringAPI. Our current implementation:
- ✅ Trained on Llama 3.1 8B SAE features with human-curated baseline labels
- ✅ Uses reflective coherence RL to ensure label quality
- ✅ Generates labels for all 131,072 features in the 8B model
- 🔄 Ongoing experiments with cross-model training and bias vector transfer
- 🔄 Exploring multi-layer training to understand feature hierarchies
Learn More
- How Steering Works - Understand the mathematical foundations
- API Reference - Interactive OpenAPI documentation
- vLLM SDK - Use SelfIE labels in your own applications
- Try it out - Sign up and start exploring features