Supported Models
Model identifiers, infrastructure details, and how to handle cold starts for serverless models.
Model Availability
SteeringAPI hosts models on two types of infrastructure:
| Model | Type | Cold Start | Recommended Timeout |
|---|---|---|---|
| meta-llama/Llama-3.3-70B-Instruct | Always-on | None | 30s |
| RedHatAI/gemma-3-27b-it-FP8-dynamic | Serverless | ~4-10 minutes | 60s per attempt + retry loop |
Always-on vs Serverless
Always-on models run on dedicated GPUs and respond immediately. Serverless models run on on-demand GPUs that scale to zero when idle. The first request after idle triggers a cold start (GPU allocation + model loading). Cold start duration varies with GPU availability (4-10 minutes).
Model Configuration Reference
| Property | Llama 3.3 70B | Gemma 3 27B |
|---|---|---|
| Model identifier | meta-llama/Llama-3.3-70B-Instruct | RedHatAI/gemma-3-27b-it-FP8-dynamic |
| Infrastructure | Always-on (dedicated GPU) | Serverless (scales to zero when idle) |
| SAE features | 65,536 | 65,536 |
| Hidden dimension | 8,192 | 3,584 |
| Steering layer | 33 | 26 |
| Feature extraction layer | 50 | 40 |
| Feature labels | Goodfire SelfIE v2 | Neuronpedia auto-interpretation |
| SAE source | Goodfire | Gemma Scope 2 (Google DeepMind) |
All API endpoints, steering modes, and feature operations work identically across both models. To switch models, change the model field in your request.
Cold Start Behavior
When a serverless model is cold, your request will receive a 503 response with a standardized body:
{
"detail": {
"error": "model_warming_up",
"model": "RedHatAI/gemma-3-27b-it-FP8-dynamic",
"estimated_wait_seconds": 600,
"retry_after": 60,
"message": "Model is starting up on serverless GPU. Retry in 60s, expected ready within ~10 minutes."
}
}The response includes a Retry-After: 60 HTTP header. Most HTTP client libraries handle this automatically.
Check Model Availability
Before sending a chat request, you can check if a model is available:
curl -s -H "x-api-key: YOUR_API_KEY" \
https://api.steeringapi.com/v1/models/status | python3 -m json.toolThe response includes available (true, false, or null) and serverless for each model. null means the backend has no data yet (fresh deploy) and will resolve within seconds.
Recommended Retry Logic
For serverless models, implement a simple retry loop that respects the Retry-After header:
import time
import requests
def chat_with_retry(api_key, messages, model, max_retries=10):
"""Send a chat request with automatic cold-start retry."""
for attempt in range(max_retries):
resp = requests.post(
"https://api.steeringapi.com/v1/chat/completions",
headers={"x-api-key": api_key, "Content-Type": "application/json"},
json={"model": model, "messages": messages, "max_completion_tokens": 256},
timeout=60,
)
if resp.status_code == 200:
return resp.json()
if resp.status_code == 503:
retry_after = int(resp.headers.get("Retry-After", 60))
print(f"Model warming up, retrying in {retry_after}s (attempt {attempt + 1})")
time.sleep(retry_after)
continue
resp.raise_for_status()
raise TimeoutError("Model did not become available within retry window")
# Usage
result = chat_with_retry(
api_key="YOUR_API_KEY",
messages=[{"role": "user", "content": "Hello!"}],
model="RedHatAI/gemma-3-27b-it-FP8-dynamic",
)
print(result["choices"][0]["message"]["content"])Streaming vs Non-Streaming
Both streaming (stream: true) and non-streaming requests return a 503 during cold starts. For streaming requests, the 503 is returned before the SSE stream opens, so your client receives a normal JSON error response rather than an interrupted stream.
Tip
For batch or async workloads, you can fire-and-forget a request to trigger the cold start, then poll GET /v1/models/status until the model is available before sending your actual requests.