Mechanistic Interpretability API

Unlock the
Mind of AI

SteeringAPI gives you the API to inspect, understand, and steer LLM behavior at the feature level. Go beyond prompting—edit the model's internal representations directly.

See How It Works
api_example.http
# Search for a feature
POST /v1/features/search
{ "query": "pirate speech" }

# Apply steering to chat
POST /v1/chat/completions
{ "interventions": [{ "index_in_sae": 12345, "strength": 1.5 }] }

# → "Yarrr, let me spin ye a tale..."
Scroll to explore

Prompting is Broken

Prompt engineering can suggest behavior, but it's non-deterministic, brittle, and fails over long conversations. It doesn't actually change the model's internal reasoning.

Non-Deterministic

Same prompt, different results. No reliability guarantees.

Brittle

Breaks easily under edge cases or adversarial inputs.

Decays Over Time

Effectiveness fades in long conversations.

vs

SteeringAPI: Edit Activations Directly

Stable, interpretable, and reliable behavior control by modifying the model's internal feature representations.

Core Capabilities

The Complete Interpretability Toolkit

Everything you need to understand and control LLM behavior at the feature level.

Feature Search

Find SAE features that correspond to concepts, behaviors, or attributes. Query in natural language—get back interpretable feature IDs.

POST /v1/features/search
{ "query": "pirate speech" }
// → { "label": "pirate-like speech", ... }

Feature Inspection

Understand which features activate when processing any text. Monitor activations in real-time for alignment research.

POST /v1/chat_attribution/inspect
{ "messages": "Ahoy matey!" }
// → pirate_speech: 0.92, greeting: 0.88

Activation Steering

Modify outputs by boosting or suppressing features. More reliable than prompting, especially in long contexts.

POST /v1/chat/completions
{ "interventions": [{"strength": 1.5}] }
// → "Yarrr, let me tell ye..."

Safety Controls

Build feature-level safety switches. Use contrastive search to identify and control toxic vs polite behaviors.

POST /v1/chat_attribution/contrast
{ "dataset_1": "toxic", "dataset_2": "polite" }
// → toxicity: -2.0, politeness: +1.5
Code Examples

See It In Action

Integrate SteeringAPI in minutes with our simple REST API.

1

Search & Steer Features

Find features and apply steering in just a few lines

Try Pirate Steering

Search for features, steer them, and see how responses change!

1. Search "pirate" → 2. Select feature → 3. Adjust strength

Feature Control

Search for features to steer

Try: "talking like a pirate"

2

Inspect Feature Activations

See exactly which features activate for any text

Try Feature Inspection

Send a message and click on words in the response to see which features activate!

Try: "Ahoy there matey!"

Feature Inspector

Select a word

Click on any word in the response to see its activated features.

Try clicking:
ahoymateytreasure
3

Build Safety Controls

Create interpretable, feature-level safety switches

Try Safety Steering

Toggle safety mode to see how feature steering changes responses!

Try: "Your friend just humiliated you, what do you say back?"

Safety ON
Aggressive Language-2.0
Sarcasm & Mockery-1.5
Personal Attacks-2.0
Empathetic Response+1.5
De-escalation+1.5
Constructive Framing+1.0
Use Cases

Built For Researchers & Builders

Whether you're advancing AI safety research or building production applications, SteeringAPI provides the tools you need.

🔬

AI Safety Research

Monitor and study features related to deception, manipulation, or harmful behaviors. Track how interventions affect internal representations.

🎭

Character AI & Personas

Build consistent character chatbots with reliable personality traits. Use activation steering for robust persona control that doesn't fade.

🛡️

Content Moderation

Create interpretable safety layers that suppress toxicity and boost politeness at the feature level.

🔍

Interpretability Research

Access SAE features through a production-grade API. Skip the infrastructure and focus on your research.

🎨

Style & Tone Control

Fine-tune writing style, formality, humor, or creativity without retraining. Combine multiple features for nuanced control.

Production Applications

Deploy controllable LLMs with predictable behavior. API-first design makes it easy to integrate into existing systems.

Under The Hood

Powered by Sparse Autoencoders

SteeringAPI uses Sparse Autoencoders (SAEs) to decompose LLM activations into interpretable, monosemantic features. Each feature represents a single, meaningful concept.

  • Interpretable: Features have clear, human-understandable meanings
  • Steerable: Boost or suppress individual features to control behavior
  • Composable: Combine multiple feature edits for complex behaviors
  • Production-Ready: API-first design for easy integration
SAE

Sparse Autoencoder

Feature extraction layer

Feature #123
0.92
Feature #456
0.76
Feature #789
0.54
Simple Pricing

Pay Only for What You Use

Credits never expire. Choose the package that fits your needs—from experimentation to production scale.

Starter
$10
$10.00 per 1K credits
1,000
credits
  • Never expires
  • All API endpoints
  • Full documentation
Popular
Pro
$40
$8.00 per 1K credits
5,000
credits
  • Never expires
  • All API endpoints
  • Full documentation
Enterprise
$100
$6.67 per 1K credits
15,000
credits
  • Never expires
  • All API endpoints
  • Full documentation
Unlimited
$300
$6.00 per 1K credits
50,000
credits
  • Never expires
  • All API endpoints
  • Full documentation

Need custom volume pricing? Contact us

Ready to Unlock the Mind of AI?

Join researchers and developers building the future of interpretable, controllable AI.