Documentation Index
Fetch the complete documentation index at: https://to11.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Span Attributes
Every GenAI span emitted by the gateway follows the OpenTelemetry GenAI semantic conventions. The span name follows the pattern {operation} {model} (e.g. chat gpt-4o, embeddings text-embedding-3-small).
This page is the complete catalogue of every span attribute the gateway may emit. Attributes are grouped by category, mirroring the product architecture: core GenAI semantics, routing, experimentation, client context, and caching. For configuration of the telemetry pipeline itself, see Telemetry Overview.
Conventions used in this reference
- Type column uses OTel attribute types:
string, int, float, bool.
string (JSON) denotes a string attribute whose value is a JSON-encoded structure.
string (JSON array) denotes a string attribute whose value is a JSON array.
- “When set” means the attribute is omitted entirely when the source value is absent; it is never recorded as empty or zero.
- Attributes in the
gen_ai.* namespace follow the OTel GenAI semantic conventions. Attributes in the gateway.* namespace are gateway-specific extensions.
Endpoint instrumentation matrix
Not every endpoint produces GenAI telemetry. The table below shows which endpoints emit GenAI spans and metrics versus plain HTTP spans only. Endpoints marked “No” under GenAI Span still produce standard HTTP spans via the application telemetry exporter.
| Endpoint | GenAI Span | Operation | Streaming | Content Capture | Metrics |
|---|
POST /v1/chat/completions | Yes | chat or generate_content | Both | Yes | All |
POST /v1/messages | Yes | chat or generate_content | Both | Yes | All |
POST /v1/responses | Yes | chat or generate_content | Both | Yes | All |
POST /v1/embeddings | Yes | embeddings | No | Input text | Duration + tokens |
POST /v1/images/generations | Yes | image_generation | No | Prompt in, revised_prompt out | Duration |
POST /v1/audio/speech | Yes | audio_speech | No | Input text, no output (binary) | Duration |
POST /v1/audio/transcriptions | Yes | audio_transcription | No | No input (binary), output text | Duration |
POST /v1/messages/count_tokens | No | — | — | — | — |
GET /v1/models | No | — | — | — | — |
GET /health | No | — | — | — | — |
All three chat endpoints share the same handler and produce identical telemetry. The gen_ai.operation.name is chat for standard requests or generate_content where the provider uses that convention.
Media endpoints (images/generations, audio/speech, audio/transcriptions) do not report token usage metrics. Only gen_ai.client.operation.duration is recorded for those endpoints.
Always present
These four attributes appear on every GenAI span regardless of endpoint, provider, or request shape. They form the minimum set required for basic filtering and grouping in any trace backend.
| Attribute | Type | Description |
|---|
gen_ai.operation.name | string | chat, generate_content, embeddings, image_generation, audio_speech, or audio_transcription |
gen_ai.provider.name | string | Provider identity (openai, anthropic, xai) |
gen_ai.request.model | string | Model name from the request |
gen_ai.request.stream | bool | Whether the request uses streaming |
Request parameters (when set)
These attributes are recorded only when the corresponding parameter is present in the request body. Absent parameters produce no attribute rather than a zero or null value.
| Attribute | Type | Description |
|---|
gen_ai.request.temperature | float | Sampling temperature |
gen_ai.request.max_tokens | int | Maximum tokens to generate |
gen_ai.request.top_p | float | Nucleus sampling threshold |
gen_ai.request.frequency_penalty | float | Frequency penalty coefficient |
gen_ai.request.presence_penalty | float | Presence penalty coefficient |
gen_ai.request.seed | int | Deterministic sampling seed |
gen_ai.request.top_k | int | Top-k sampling cutoff |
gen_ai.request.stop_sequences | string (JSON array) | Stop sequences, e.g. ["\n\n"] |
gen_ai.request.n | int | Number of completions requested |
Not all parameters are supported by every provider. The gateway records whichever parameters appear in the request body, regardless of whether the upstream provider honours them. Provider-specific parameter validation is not performed at the telemetry layer.
Response attributes
Populated after the upstream provider returns a response (or after the final streaming chunk). For streaming requests, token counts and finish reasons are extracted from the terminal usage chunk when the provider includes one, or accumulated from individual deltas otherwise.
| Attribute | Type | Description |
|---|
gen_ai.response.model | string | Model name from the provider response (may differ from the requested model) |
gen_ai.response.id | string | Provider-assigned response identifier |
gen_ai.usage.input_tokens | int | Prompt/input token count |
gen_ai.usage.output_tokens | int | Completion/output token count |
gen_ai.response.finish_reasons | string (JSON array) | Stop reasons, e.g. ["stop"] or ["tool_calls"] |
The gen_ai.response.model attribute is particularly useful when the request specifies an alias (e.g. gpt-4o) and the provider resolves it to a specific snapshot (e.g. gpt-4o-2024-08-06).
Advanced token types
Provider-specific token breakdowns beyond standard input/output counts. These attributes supplement gen_ai.usage.input_tokens and gen_ai.usage.output_tokens with finer-grained detail where the upstream provider supports it.
| Attribute | Type | Description |
|---|
gen_ai.usage.cache_creation.input_tokens | int | Tokens written to the Anthropic prompt cache during this request (aggregate across all TTLs) |
gen_ai.usage.cache_creation_5m.input_tokens | int | Tokens written to the 5-minute ephemeral prompt cache (Anthropic TTL-specific breakdown) |
gen_ai.usage.cache_creation_1h.input_tokens | int | Tokens written to the 1-hour prompt cache (Anthropic TTL-specific breakdown) |
gen_ai.usage.cache_read.input_tokens | int | Tokens read from the Anthropic prompt cache |
gen_ai.usage.reasoning_tokens | int | Internal reasoning tokens consumed by OpenAI reasoning models (o1, o3-mini) |
Cache token attributes are provider-specific. They appear only when the upstream provider reports them. Anthropic cache attributes require prompt caching to be enabled in the request. OpenAI reasoning tokens appear only for reasoning-capable models.
When Anthropic cache tokens are present, the relationship between token counts is:
gen_ai.usage.input_tokens = total input tokens (includes both cached and non-cached)
gen_ai.usage.cache_read.input_tokens = subset of input tokens served from cache
gen_ai.usage.cache_creation.input_tokens = tokens newly written to cache (aggregate)
gen_ai.usage.cache_creation_5m.input_tokens = 5-minute TTL portion of cache creation (when Anthropic sends TTL breakdown)
gen_ai.usage.cache_creation_1h.input_tokens = 1-hour TTL portion of cache creation (when Anthropic sends TTL breakdown)
When both the aggregate and TTL-specific breakdown are present, cache_creation.input_tokens equals the sum of cache_creation_5m.input_tokens + cache_creation_1h.input_tokens. The gateway passes all fields through as-is from the provider.
For OpenAI reasoning models, gen_ai.usage.reasoning_tokens is separate from gen_ai.usage.output_tokens. Reasoning tokens represent internal chain-of-thought computation and are billed separately by the provider.
Streaming-only attributes
Present exclusively on spans where gen_ai.request.stream is true. These attributes provide per-request performance observability for streaming workloads.
| Attribute | Type | Description |
|---|
gen_ai.server.time_to_first_token | float (seconds) | Time from StreamStart to first ContentDelta |
gen_ai.server.request.duration | float (seconds) | Total wall-clock duration of the stream |
gen_ai.server.output_tokens_per_second | float | Output token throughput over the stream lifetime |
For non-streaming requests, none of these attributes are present. The overall request duration is captured by the span’s own start and end timestamps.
The span-level TTFT attribute is a custom extension. The OTel spec defines TTFT as a metric only. The gateway adds the span attribute so individual slow requests can be identified without querying the metrics pipeline.
Output type
Indicates the modality of the response produced by the operation. The value is determined by the endpoint, not by the response content.
| Attribute | Type | Values |
|---|
gen_ai.output.type | string | "text", "embeddings", "image", "speech" |
The mapping from endpoint to output type is fixed:
| Endpoint group | Output type |
|---|
| Chat completions, messages, responses | text |
| Embeddings | embeddings |
| Image generations | image |
| Audio speech | speech |
Audio transcription endpoints produce text as their output type, since the response is transcribed text rather than audio.
Server address
Identifies the upstream provider endpoint the gateway connected to. These follow the OTel server.* namespace conventions.
| Attribute | Type | Description |
|---|
server.address | string | Upstream hostname (e.g. api.openai.com) |
server.port | int | Upstream port (e.g. 443) |
Error attributes
Present when the upstream call fails or times out. A span with error attributes always has its OTel status set accordingly.
| Attribute | Type | Description |
|---|
error.type | string | Error category: "timeout", "connection_error", or "request_error" |
The span StatusCode is set to ERROR and StatusMessage contains the error detail string. On successful requests, StatusCode remains UNSET and no error.type attribute is recorded.
Error type values:
| Value | Meaning |
|---|
timeout | The upstream provider did not respond within the configured timeout |
connection_error | TCP connection to the upstream provider failed or was reset |
request_error | The upstream provider returned an HTTP error status (4xx or 5xx) |
Provider-returned error bodies are not captured as span attributes. The StatusMessage contains a summary suitable for filtering and alerting.
Populated from request headers or body fields. All attributes in this group are optional and appear only when the corresponding header or field is present. These attributes enable grouping spans by user, session, or conversation for analytics and debugging.
| Attribute | Type | Source |
|---|
gen_ai.user.id | string | x-to11-user-id header or user field in the request body |
gen_ai.session.id | string | x-to11-session-id header |
gen_ai.conversation.id | string | X-Conversation-Id header |
deployment.environment.name | string | x-to11-environment header |
When both the x-to11-user-id header and the user body field are present, the header takes precedence. The gen_ai.user.id attribute enables per-user analytics queries in ClickHouse without requiring application-level joins.
When a model response includes tool calls, the gateway emits child spans nested under the GenAI operation span. Each tool call produces one child span. The parent-child relationship is visible in any trace viewer that supports nested spans.
Span name: tool_call.{function_name} (e.g. tool_call.get_weather)
| Attribute | Type | Description |
|---|
gen_ai.tool.name | string | Function name |
gen_ai.tool.type | string | Always "function" |
gen_ai.tool.call.id | string | Provider-assigned call identifier |
gen_ai.tool.call.arguments | string (JSON) | Function arguments (only with capture_content enabled) |
The gen_ai.tool.definitions attribute (string, JSON array) appears on the parent span, not on tool call child spans. It contains the tool/function definitions sent in the request when capture_content is enabled.
Multiple tool calls within a single response each produce their own child span. The child spans share the same parent span ID as the GenAI operation span. If a response contains no tool calls, no child spans are created.
Timing behaviour
| Path | Behaviour |
|---|
| Streaming | Spans opened on ToolCallStart, closed on ToolCallEnd. Durations reflect actual streaming time. |
| Non-streaming | Synthetic timestamps. Generation duration evenly distributed across tool call spans. |
In the non-streaming case, the gateway cannot observe when individual tool calls were generated. The synthetic timestamp distribution ensures that tool call spans are evenly spaced within the parent span’s duration, providing a reasonable approximation for visualisation in trace viewers.
Gateway routing attributes
Present on spans that traverse the gateway’s managed routing layer. These attributes use the gateway.* namespace, which is specific to the to11 gateway and not part of the OTel semantic conventions. Passthrough requests carry gateway.routing.path = "passthrough" and omit the remaining routing attributes.
| Attribute | Type | Description |
|---|
gateway.routing.path | string | "managed" or "passthrough" |
gateway.routing.target | string | Resolved target name |
gateway.routing.layer | string | Routing layer that resolved the request ("function" or the upstream layer) |
gateway.routing.strategy | string | Strategy used (e.g. "weighted", "fallback") |
gateway.routing.attempt | int | Attempt number (1-indexed; increments on fallback retries) |
gateway.routing.upstream_model | string | Model sent to the upstream provider (may differ from the requested model) |
For fallback strategies, the gateway.routing.attempt attribute increments with each retry. The first attempt is 1. Combined with gateway.routing.target, this allows reconstruction of the full fallback chain from span data.
Experiment and variant attributes
Present when the request is routed through a function with multiple variants. These attributes record which variant was selected and the weight distribution at the time of the request. For functions with a single variant, the variant attributes are still emitted but gateway.experiment.total_variants is 1.
| Attribute | Type | Description |
|---|
gateway.variant.name | string | Name of the selected variant |
gateway.variant.weight | int | Configured weight of the selected variant |
gateway.experiment.total_variants | int | Total number of variants defined on the function |
gateway.function.name | string | Function name from the route resolution |
gateway.function.endpoint | string | Resolved endpoint for the function |
Variant selection is determined at request time based on the configured weights. The gateway.variant.weight attribute records the weight of the selected variant, not the probability. To compute the selection probability, divide by the sum of all variant weights (derivable from the total variants and their individual weights across spans).
Client-context operation attributes
These attributes appear on sibling spans created via x-to11-context-* headers. Each header value generates a separate span adjacent to the main GenAI operation span, sharing the same parent trace context. The span name matches the operation type (e.g. execute_tool, retrieval, invoke_agent, create_agent).
| Attribute | Type | Description |
|---|
gen_ai.tool.name | string | Tool name from the context header |
gen_ai.tool.call.id | string | Tool call identifier from the context header |
retrieval spans
| Attribute | Type | Description |
|---|
gen_ai.retrieval.source | string | Retrieval source identifier |
gen_ai.retrieval.document.count | string | Number of documents retrieved |
invoke_agent / create_agent spans
| Attribute | Type | Description |
|---|
gen_ai.agent.id | string | Agent identifier |
gen_ai.agent.name | string | Agent name |
gen_ai.agent.step | string | Current agent step identifier (invoke_agent only) |
gen_ai.agent.description | string | Human-readable agent description (create_agent only) |
The invoke_agent and create_agent span types differ in their attribute sets. gen_ai.agent.step tracks the execution step within an agent invocation, while gen_ai.agent.description records the agent definition at creation time. Both share gen_ai.agent.id and gen_ai.agent.name.
Content capture attributes
Recorded when content capture is enabled via the capture_content configuration flag or the GATEWAY_GENAI_CAPTURE_CONTENT=true environment variable. These attributes contain the full prompt and completion text as serialised JSON. Streaming responses are accumulated up to a 64 KB limit (MAX_ACCUMULATED_CONTENT_BYTES); non-streaming responses are captured in full.
| Attribute | Type | Description |
|---|
gen_ai.input.messages | string (JSON) | Serialised prompt messages from the request |
gen_ai.output.messages | string (JSON) | Serialised completion choices or accumulated stream content |
gen_ai.system_instructions | string | System prompt text |
For media endpoints, content capture behaviour varies:
| Endpoint | Input capture | Output capture |
|---|
| Image generation | prompt field | revised_prompt from DALL-E 3 responses |
| Audio speech (TTS) | input text field | Not captured (binary audio) |
| Audio transcription | Not captured (binary audio) | Transcription text field |
Content capture must be explicitly enabled. Prompt and completion text, including any sensitive data, is recorded verbatim. See Content Capture for configuration details.
Cache status attributes
Present when the gateway’s response cache is active for the request. The skip status indicates that caching was enabled but the request was not eligible (e.g. streaming requests, or requests with temperature > 0).
| Attribute | Type | Description |
|---|
gen_ai.cache.status | string | "hit", "miss", or "skip" |
gen_ai.cache.age_seconds | float | Age of the cached response in seconds (present on cache hits only) |
On a cache hit, the gateway skips the upstream provider call entirely. The span still records gen_ai.response.model and token usage attributes from the cached response. The gen_ai.cache.age_seconds attribute indicates staleness, allowing cache freshness analysis across requests.
Cache status values:
| Value | Meaning |
|---|
hit | Response served from cache. gen_ai.cache.age_seconds is present. |
miss | Cache lookup performed but no matching entry found. Request forwarded to provider. |
skip | Request was not eligible for caching (e.g. streaming, non-deterministic temperature). |