SteeringAPI

Model Availability

SteeringAPI hosts models on two types of infrastructure:

Model	Type	Cold Start	Recommended Timeout
meta-llama/Llama-3.3-70B-Instruct	Always-on	None	30s
RedHatAI/gemma-3-27b-it-FP8-dynamic	Serverless	~4-10 minutes	60s per attempt + retry loop

Always-on vs Serverless

Always-on models run on dedicated GPUs and respond immediately. Serverless models run on on-demand GPUs that scale to zero when idle. The first request after idle triggers a cold start (GPU allocation + model loading). Cold start duration varies with GPU availability (4-10 minutes).

Model Configuration Reference

Property	Llama 3.3 70B	Gemma 3 27B
Model identifier	meta-llama/Llama-3.3-70B-Instruct	RedHatAI/gemma-3-27b-it-FP8-dynamic
Infrastructure	Always-on (dedicated GPU)	Serverless (scales to zero when idle)
SAE features	65,536	65,536
Hidden dimension	8,192	3,584
Steering layer	33	26
Feature extraction layer	50	40
Feature labels	Goodfire SelfIE v2	Neuronpedia auto-interpretation
SAE source	Goodfire	Gemma Scope 2 (Google DeepMind)

All API endpoints, steering modes, and feature operations work identically across both models. To switch models, change the model field in your request.

Cold Start Behavior

When a serverless model is cold, your request will receive a 503 response with a standardized body:

{
  "detail": {
    "error": "model_warming_up",
    "model": "RedHatAI/gemma-3-27b-it-FP8-dynamic",
    "estimated_wait_seconds": 600,
    "retry_after": 60,
    "message": "Model is starting up on serverless GPU. Retry in 60s, expected ready within ~10 minutes."
  }
}

The response includes a Retry-After: 60 HTTP header. Most HTTP client libraries handle this automatically.

Check Model Availability

Before sending a chat request, you can check if a model is available:

curl -s -H "x-api-key: YOUR_API_KEY" \
  https://api.steeringapi.com/v1/models/status | python3 -m json.tool

The response includes available (true, false, or null) and serverless for each model. null means the backend has no data yet (fresh deploy) and will resolve within seconds.

Recommended Retry Logic

For serverless models, implement a simple retry loop that respects the Retry-After header:

import time
import requests

def chat_with_retry(api_key, messages, model, max_retries=10):
    """Send a chat request with automatic cold-start retry."""
    for attempt in range(max_retries):
        resp = requests.post(
            "https://api.steeringapi.com/v1/chat/completions",
            headers={"x-api-key": api_key, "Content-Type": "application/json"},
            json={"model": model, "messages": messages, "max_completion_tokens": 256},
            timeout=60,
        )
        if resp.status_code == 200:
            return resp.json()
        if resp.status_code == 503:
            retry_after = int(resp.headers.get("Retry-After", 60))
            print(f"Model warming up, retrying in {retry_after}s (attempt {attempt + 1})")
            time.sleep(retry_after)
            continue
        resp.raise_for_status()
    raise TimeoutError("Model did not become available within retry window")

# Usage
result = chat_with_retry(
    api_key="YOUR_API_KEY",
    messages=[{"role": "user", "content": "Hello!"}],
    model="RedHatAI/gemma-3-27b-it-FP8-dynamic",
)
print(result["choices"][0]["message"]["content"])

Streaming vs Non-Streaming

Both streaming (stream: true) and non-streaming requests return a 503 during cold starts. For streaming requests, the 503 is returned before the SSE stream opens, so your client receives a normal JSON error response rather than an interrupted stream.

Tip

For batch or async workloads, you can fire-and-forget a request to trigger the cold start, then poll GET /v1/models/status until the model is available before sending your actual requests.

Supported Models