Skip to main content
Set "stream": true on any chat request to receive the response as Server-Sent Events (Content-Type: text/event-stream). The gateway forwards tokens to your client as they arrive from the upstream provider. Internally the gateway picks one of two streaming paths per request, based on whether your SDK’s wire format already matches the upstream provider’s. Both paths produce the same telemetry and apply the same guardrail checks — the difference is only how much work happens on the response.

Two paths

            client SDK format == upstream SSE format?
                    │                       │
                  yes                       no
            ┌───────▼───────┐      ┌────────▼──────────────────┐
            │   Fast path    │      │     Normalized path        │
            │  (passthrough) │      │  (parse → translate → emit)│
            └───────────────┘      └───────────────────────────┘

Fast path — SSE passthrough

When your SDK’s format matches the upstream provider, the gateway forwards the raw SSE bytes straight through. There is no re-serialization on the response path, which is what keeps streaming overhead low. Telemetry (token usage, model, finish reason, content) is still extracted from the stream as it passes.

Normalized path — parse, translate, re-emit

When the formats differ, each upstream event is parsed, translated into your SDK’s wire format, and re-emitted. This is what lets you stream an Anthropic model through an OpenAI client (and vice versa) without your SDK seeing an unfamiliar event shape.

Path selection

Client SDKUpstream providerPathReason
OpenAI / xAIOpenAIFastWire formats match
OpenAI / xAIxAIFastxAI uses OpenAI-compatible SSE
OpenAI / xAIAnthropicNormalizedAnthropic events are translated to the OpenAI shape
OpenAI / xAICohereNormalizedCohere events are translated to the OpenAI shape
AnthropicAnyNormalizedAnthropic ingress always uses the normalized path
Either way, the events your client receives are always in the format your SDK expects.

Requesting a stream

curl -N https://gw.to11.ai/v1/chat/completions \
  -H "x-to11-authorization: Bearer $TO11_API_KEY" \
  -H "x-to11-project-id: $TO11_PROJECT_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Count to 3"}],
    "stream": true
  }'

Guardrails during streaming

When output guardrails are enabled, the gateway accumulates response text (up to 64 KB) to inspect it. The enforcement point differs between streaming and non-streaming because streamed chunks have already left the gateway by the time the full response is known.
ModeBehavior
Non-streamingThe full response is inspected before delivery. A violation returns 422 Unprocessable Entity and the response is withheld.
StreamingChunks are delivered as they arrive. After the stream completes, accumulated text is checked; because the content has already been sent, a violation is recorded for monitoring rather than blocking the response.
The 64 KB accumulation cap bounds only the text inspected for guardrails — it never truncates the response streamed to your client, which continues unbounded.

Mid-stream errors

If an upstream fails after the stream has already started (HTTP 200 with text/event-stream headers sent), the gateway delivers the error as an SSE error frame in your SDK’s shape and stops the stream — it does not emit a false success terminator. See Error handling for the frame format per SDK.

Telemetry

Streaming and non-streaming requests produce the same GenAI span — model, token counts, finish reason, and timing — and the same metrics. Streaming additionally records time-to-first-token as gen_ai.server.time_to_first_token, measured from the start of the response to the first content token. See the Telemetry reference for the full attribute and metric catalog.