Supported Models

Model identifiers, infrastructure details, and how to handle cold starts for serverless models.

Model Availability

SteeringAPI hosts models on two types of infrastructure:

ModelTypeCold StartRecommended Timeout
meta-llama/Llama-3.3-70B-InstructAlways-onNone30s
RedHatAI/gemma-3-27b-it-FP8-dynamicServerless~4-10 minutes60s per attempt + retry loop
Always-on vs Serverless

Always-on models run on dedicated GPUs and respond immediately. Serverless models run on on-demand GPUs that scale to zero when idle. The first request after idle triggers a cold start (GPU allocation + model loading). Cold start duration varies with GPU availability (4-10 minutes).

Model Configuration Reference

PropertyLlama 3.3 70BGemma 3 27B
Model identifiermeta-llama/Llama-3.3-70B-InstructRedHatAI/gemma-3-27b-it-FP8-dynamic
InfrastructureAlways-on (dedicated GPU)Serverless (scales to zero when idle)
SAE features65,53665,536
Hidden dimension8,1923,584
Steering layer3326
Feature extraction layer5040
Feature labelsGoodfire SelfIE v2Neuronpedia auto-interpretation
SAE sourceGoodfireGemma Scope 2 (Google DeepMind)

All API endpoints, steering modes, and feature operations work identically across both models. To switch models, change the model field in your request.

Cold Start Behavior

When a serverless model is cold, your request will receive a 503 response with a standardized body:

{
  "detail": {
    "error": "model_warming_up",
    "model": "RedHatAI/gemma-3-27b-it-FP8-dynamic",
    "estimated_wait_seconds": 600,
    "retry_after": 60,
    "message": "Model is starting up on serverless GPU. Retry in 60s, expected ready within ~10 minutes."
  }
}

The response includes a Retry-After: 60 HTTP header. Most HTTP client libraries handle this automatically.

Check Model Availability

Before sending a chat request, you can check if a model is available:

curl -s -H "x-api-key: YOUR_API_KEY" \
  https://api.steeringapi.com/v1/models/status | python3 -m json.tool

The response includes available (true, false, or null) and serverless for each model. null means the backend has no data yet (fresh deploy) and will resolve within seconds.

Recommended Retry Logic

For serverless models, implement a simple retry loop that respects the Retry-After header:

import time
import requests

def chat_with_retry(api_key, messages, model, max_retries=10):
    """Send a chat request with automatic cold-start retry."""
    for attempt in range(max_retries):
        resp = requests.post(
            "https://api.steeringapi.com/v1/chat/completions",
            headers={"x-api-key": api_key, "Content-Type": "application/json"},
            json={"model": model, "messages": messages, "max_completion_tokens": 256},
            timeout=60,
        )
        if resp.status_code == 200:
            return resp.json()
        if resp.status_code == 503:
            retry_after = int(resp.headers.get("Retry-After", 60))
            print(f"Model warming up, retrying in {retry_after}s (attempt {attempt + 1})")
            time.sleep(retry_after)
            continue
        resp.raise_for_status()
    raise TimeoutError("Model did not become available within retry window")

# Usage
result = chat_with_retry(
    api_key="YOUR_API_KEY",
    messages=[{"role": "user", "content": "Hello!"}],
    model="RedHatAI/gemma-3-27b-it-FP8-dynamic",
)
print(result["choices"][0]["message"]["content"])

Streaming vs Non-Streaming

Both streaming (stream: true) and non-streaming requests return a 503 during cold starts. For streaming requests, the 503 is returned before the SSE stream opens, so your client receives a normal JSON error response rather than an interrupted stream.

Tip

For batch or async workloads, you can fire-and-forget a request to trigger the cold start, then poll GET /v1/models/status until the model is available before sending your actual requests.