Skip to main content

Documentation Index

Fetch the complete documentation index at: https://to11.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Telemetry Overview

The gateway uses OpenTelemetry to capture every LLM interaction as structured traces and metrics. Two independent pipelines serve different purposes: one for general application observability, the other for LLM-specific analytics. Understanding why they are separated — and how the ten operation names map onto your workloads — is the foundation for everything else in this section.

Dual-pipeline architecture

The gateway emits telemetry through two OTLP exporters that are configured, routed, and stored independently.
AspectApplication telemetryGenAI telemetry
What it capturesHTTP spans, request latency, error ratesLLM-specific spans (model, tokens, TTFT) and histogram metrics
DestinationTempo (via Alloy or a general-purpose OTel Collector)ClickHouse (via the GenAI OTel Collector)
Config section[telemetry][genai_telemetry]
Typical retentionDays to weeksWeeks to months (analytics queries over longer windows)
Primary consumersSRE dashboards, on-call alertingProduct analytics, cost attribution, model comparison
The reason for keeping these separate comes down to three practical concerns:
  1. Retention differs. Application traces are high-volume, short-lived data useful for debugging recent incidents. GenAI telemetry is lower-volume but needs longer retention for cost and quality analysis across model versions.
  2. Query patterns differ. Application traces are searched by trace ID or service name. GenAI data is aggregated by model, provider, and operation — queries that benefit from ClickHouse’s columnar storage rather than a trace-oriented store like Tempo.
  3. Storage requirements differ. Content capture (prompt and completion text) can be large. Storing it alongside lightweight HTTP spans would inflate trace storage costs without benefit.
Internally, spans emitted under the gateway::genai instrumentation target go to the GenAI exporter. Everything else goes to the application exporter.
Your App (traceparent)
    |
    v
Gateway (:4000)
    |--- GenAI spans + metrics ---> OTel Collector (:4317) ---> ClickHouse
    |--- App traces --------------> Alloy/Collector ---------> Tempo
    |
    v
LLM Provider (traceparent injected)
Both pipelines participate in the same W3C Trace Context propagation chain. A single traceparent header flows from your application through the gateway to the upstream provider, so GenAI spans and application spans share the same trace ID and can be correlated in a trace viewer that queries both backends.

OTel GenAI semantic conventions

The gateway follows the OpenTelemetry GenAI semantic conventions. This is a deliberate choice: by using the standard attribute names and span structure, any OTel-compatible tool — Jaeger, Grafana Tempo, Honeycomb, Datadog — can parse and visualise gateway spans without custom configuration. The span name follows the pattern {operation} {model}. For example:
  • chat gpt-4o
  • embeddings text-embedding-3-small
  • image_generation dall-e-3
This naming convention means you can filter spans by operation type using a simple prefix match, or by model using a suffix match, in any trace viewer that supports span name search.

Supported operations

The gateway emits ten gen_ai.operation.name values across two families: gateway-native operations (derived from the endpoint and request content) and client-context operations (declared by the caller via headers).

Gateway-native operations

These six operations are determined automatically by the gateway based on which endpoint is called and what the request contains.
OperationEndpoint(s)Description
chat/v1/chat/completions, /v1/messages, /v1/responsesText-only LLM calls
generate_contentSame as chatMultimodal requests (images detected automatically)
embeddings/v1/embeddingsVector embedding generation
image_generation/v1/images/generationsImage creation (DALL-E, etc.)
audio_speech/v1/audio/speechText-to-speech synthesis
audio_transcription/v1/audio/transcriptionsSpeech-to-text transcription
All three chat endpoints (/v1/chat/completions, /v1/messages, /v1/responses) share the same handler and produce identical telemetry. The distinction between chat and generate_content is covered in the multimodal detection section below.

Client-context operations

These four operations are declared by the caller via x-to11-context-* headers. When the gateway receives these headers, it emits a sibling span alongside the main GenAI operation span, representing work that happened on the client side between LLM calls.
OperationPurpose
execute_toolClient-side tool or function execution between LLM calls
retrievalVector search or document retrieval in RAG workflows
invoke_agentDelegating to a sub-agent in a multi-agent system
create_agentAgent lifecycle creation (recommended as client-emitted OTLP; gateway fallback available)
The reason these exist as gateway-emitted spans rather than requiring a full client-side OTel SDK is pragmatic: many applications — especially those using lightweight HTTP clients or serverless functions — do not have an OTel SDK configured. The context headers let these applications participate in structured GenAI tracing with zero additional dependencies.
See the Context Propagation how-to for the full header specification and examples.

Multimodal detection

When the gateway detects image_url content blocks in a chat request, it automatically promotes the operation from chat to generate_content. This distinction exists purely for telemetry: multimodal requests tend to have different latency profiles, token costs, and failure modes, so separating them in the data makes analytics more meaningful. Callers can override this automatic detection with the x-to11-operation header, which accepts two values: chat and generate_content. This is useful when your application knows the semantic intent of the request better than content inspection can determine — for example, a request that contains base64 image data but is primarily a text classification task.
Multimodal detection is purely a telemetry concern. Routing, credentials, and adapters behave identically regardless of whether the operation is classified as chat or generate_content.

What’s next

Span Attributes

Every GenAI span attribute the gateway emits, grouped by lifecycle stage.

Metrics Reference

Histogram metrics, dimensions, and ClickHouse query examples.

Distributed Tracing

Group multiple LLM calls and agent chains under a single trace.

Context Propagation

Emit client-context spans via headers without an OTel SDK.

Direct Ingestion

Send OTLP spans directly to the gateway’s collector endpoint.

Content Capture

Record prompt and completion text for debugging and evaluation.