SelfIE Labels

Understanding the SelfIE (Self-Interpretation of Embeddings) technique: how we automatically generate interpretable labels for SAE features using reflective coherence training.

What is SelfIE?

SelfIE (Self-Interpretation of Embeddings) is a neural network system that automatically generates natural language descriptions for Sparse Autoencoder (SAE) features. Instead of manually labeling thousands of features, SelfIE learns to map SAE decoder vectors directly to interpretable text descriptions.

The Core Insight

If a label accurately describes what a feature represents, then using that label to prompt the model should activate that same feature. SelfIE trains by measuring this reflective coherence between generated labels and actual feature activations.

The Problem

Modern SAEs can have 100,000+ features. Manually labeling each feature is:

  • Time-consuming: Human experts would need months to label all features
  • Expensive: Requires significant human effort and expertise
  • Inconsistent: Different labelers may interpret features differently
  • Not scalable: New models and SAEs are created frequently

SelfIE solves this by training a neural network to automatically generate high-quality labels that capture the semantic meaning of each feature.

System Architecture

Label Generator

The label generator is a neural network that maps SAE decoder vectors to natural language descriptions using a novel soft prompt approach:

Architecture Components:

  1. 1. Projection Layer: A learnable linear transformation that maps SAE vectors to soft prompt tokens
  2. 2. Soft Tokens: Continuous embedding vectors (not discrete tokens) that serve as a learned "summary" of the feature
  3. 3. Template: A hard-coded prompt template with placeholders for soft tokens
  4. 4. Base LLM: The language model (e.g., Llama 3.1) that generates the final label
# Simplified forward pass
def generate_label(sae_decoder_vector):
    # 1. Project SAE vector to soft token
    soft_token = projection_layer(sae_decoder_vector)

    # 2. Inject into template
    template = "This pattern activates when: <soft_token>"

    # 3. Generate label using base LLM
    label = base_llm.generate(template_with_soft_token)

    return label

The projection layer is the only trainable component - the base LLM remains frozen. This makes training efficient and leverages the LLM's pre-existing language capabilities.

Training Process

SelfIE training happens in two phases: pretraining and reinforcement learning.

Phase 1: Supervised Pretraining

Train on existing human-labeled features (e.g., from Neuronpedia or other SAE interpretability projects):

Input: SAE decoder vector for feature 58644
Target: "Formal academic citations"
Loss: Cross-entropy between
      generated and target tokens

This gives the model a strong initialization and teaches it basic label formatting.

Phase 2: Reflective Coherence RL

Fine-tune using reinforcement learning based on feature activation:

1. Generate label for feature
2. Prompt model with generated label
3. Measure SAE activations
4. Reward if target feature activates
5. Update projection layer via GRPO

This ensures labels are not just plausible, but actually cause the feature to activate when used as prompts.

Reflective Coherence Reward

The reward function measures how well a generated label activates its target feature:

def compute_reward(generated_label, target_feature_id, sae, base_model):
    # Generate text using the label as a prompt
    prompts = [
        f"Write text that heavily features {generated_label}",
        f"Generate content with lots of {generated_label}",
        f"Create text emphasizing {generated_label}"
    ]

    activations = []
    for prompt in prompts:
        # Generate text
        text = base_model.generate(prompt)

        # Get SAE activations on generated text
        hidden_states = base_model.get_hidden_states(text)
        feature_acts = sae.encode(hidden_states)

        # Extract target feature activation
        activations.append(feature_acts[:, target_feature_id].mean())

    # Reward is mean activation of target feature
    reward = torch.stack(activations).mean()
    return reward

High rewards indicate the label successfully captures what makes the feature activate.

Mathematical Formulation

Projection Layer

The projection layer transforms SAE decoder vectors into soft prompt embeddings:

s = W · d + b

Where:

  • d ∈ ℝd_model = SAE decoder vector (e.g., 8192-dim for Llama 3.1 8B)
  • W ∈ ℝd_model × d_model = Learnable projection matrix
  • b ∈ ℝd_model = Learnable bias vector (the "universal interpretation direction")
  • s ∈ ℝd_model = Soft prompt token embedding

Bias Vector: Universal Interpretation

The bias vector b is particularly important. It represents a universal interpretation direction learned across all features:

When the projection is simplified:

s = α · d + b

(where α is a learned scale parameter)

  • Feature-specific: α · d captures what makes this feature unique
  • Universal: b captures common patterns across all features (e.g., "this activates when...")
  • Transfer learning: The bias vector can transfer between models and SAEs

GRPO Training Objective

The reinforcement learning phase uses Group Relative Policy Optimization (GRPO):

Maximize: E[reward(label)] - β * KL(π || π_ref)

Where:
- reward(label) = mean activation of target feature
- π = current policy (label generator)
- π_ref = reference policy (pretrained checkpoint)
- β = KL penalty coefficient (balances exploration)

This encourages generating labels that activate features while staying close to the pretrained distribution.

Example Labels

Here are real examples comparing human-labeled features with SelfIE-generated labels:

Feature IDHuman LabelSelfIE Label
1Syntactical special characters and delimiters in programming contextsEscape characters in programming languages
2The Russian word состав (composition/compilation) and its variationsRussian language composition and structure
5Technical documentation describing widespread applications and usesWidespread adoption and acceptance of a technology or practice
10Format conversion operators in API and task namesConverting or transforming one format or representation to another
18Birth dates in biographical/historical contextsBiographical information about birth dates and places

SelfIE labels are often more concise while capturing the same semantic meaning. In many cases, they generalize better to related concepts.

Advanced Topics

Cross-Model Label Generation

SelfIE can use a larger model to generate labels for a smaller model's SAE:

  • SAE Model: Llama 3.1 8B (whose features we want to label)
  • Label Generator: Llama 3.3 70B (generates better descriptions)
  • Projection: 8192-dim → 8192-dim (non-square matrices supported)

The larger model's superior language abilities often produce clearer, more nuanced labels.

DiffMean Bias Training

Recent work trains the bias vector using steering vectors from contrastive text pairs:

  1. 1. Generate diverse conversational properties (e.g., "formal language", "asking for clarification")
  2. 2. Create contrastive text pairs with/without each property
  3. 3. Compute DiffMean steering vectors from activation differences
  4. 4. Train bias vector on (steering vector, description) dataset

This allows training SelfIE without needing SAE features at all, using pure steering vectors as the training data.

Log-Scale Parameterization

The scale parameter α uses logarithmic parameterization for better training dynamics:

# Instead of: scale = α (standard)
# Use: scale = exp(log_scale) (log parameterization)

Δscale ≈ scale × learning_rate × gradient

# This makes updates proportional to current scale
# Same LR works across scale ranges: 0.1 to 100+

This is particularly important when training across multiple layers, as different layers may need vastly different optimal scales.

Benefits & Applications

Benefits

  • Scalable: Label 100k+ features in hours, not months
  • Consistent: Deterministic labeling with unified style
  • Transferable: Bias vectors work across models
  • Quality: RL ensures labels reflect actual behavior
  • Cost-effective: No manual labeling required

Applications

  • → Feature search and discovery
  • → Mechanistic interpretability research
  • → Model debugging and analysis
  • → Automated documentation generation
  • → Cross-model feature comparison

Current Status

Production Ready

SelfIE is actively used in production to generate labels for the features you explore in SteeringAPI. Our current implementation:

  • ✅ Trained on Llama 3.1 8B SAE features with human-curated baseline labels
  • ✅ Uses reflective coherence RL to ensure label quality
  • ✅ Generates labels for all 131,072 features in the 8B model
  • 🔄 Ongoing experiments with cross-model training and bias vector transfer
  • 🔄 Exploring multi-layer training to understand feature hierarchies

Learn More