The API for AI builders.
ForgeStack is a hosted inference API platform built for teams shipping production AI features. Call fast multimodal models from one endpoint, keep p95 latency under 50ms, and scale from prototype to global traffic without babysitting GPUs.
Quickstart
Choose your runtime, pass a prompt, and receive a streamed response from the nearest ForgeStack edge.
from forgestack import ForgeStack
client = ForgeStack(api_key="fs_live_8xK_dev_preview_4f9a2")
response = client.responses.create(
model="fs-pro-2.0",
input=[
{
"role": "system",
"content": "You are a concise build assistant for backend engineers."
},
{
"role": "user",
"content": "Generate a Redis-backed rate limiter for a Next.js API route."
}
],
latency_tier="edge",
stream=True,
)
for event in response:
if event.type == "response.output_text.delta":
print(event.delta, end="")
Requests are routed to the nearest warm model pool with automatic regional fallback.
Swap models without rewriting payloads, stream handlers, retry logic, or observability.
Trace tokens, latency, errors, and spend per route, tenant, model, or environment.
Your API key
Use this development key to test local inference calls. Rotate keys from the dashboard before production.
Pricing
Start free, graduate to higher throughput, then move to dedicated capacity when inference becomes core infrastructure.
| Plan | Free | Growth | Scale |
|---|---|---|---|
| Requests/sec | 5 | 100 | 1000 |
| Tokens/month | 100K | 10M | 1B |
| Models | fs-mini-1.5 | + fs-pro-2.0 | + custom finetunes |
| Price | $0 | $99/mo | Contact us |
Changelog
ForgeStack releases ship weekly with latency tuning, model upgrades, SDK improvements, and dashboard telemetry.
Streaming responses now start at the edge with median first-token latency below 32ms.
Pin to stable aliases like fs-pro-latest while preserving audit logs for every resolved model.
Set hard token budgets per environment and receive webhook alerts before usage spikes.
Dashboard
Inspect live traffic, compare model performance, rotate keys, and promote inference routes across environments.
Fastest low-cost model for extraction, routing, classification, and short-form generation.
Balanced reasoning and latency for production copilots, agents, and developer workflows.
Reserved regional capacity, custom finetunes, tenant isolation, and stricter spend controls.