Documentation Index
Fetch the complete documentation index at: https://to11.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Telemetry Overview
The gateway uses OpenTelemetry to capture every LLM interaction as structured traces and metrics. Two independent pipelines serve different purposes: one for general application observability, the other for LLM-specific analytics. Understanding why they are separated — and how the ten operation names map onto your workloads — is the foundation for everything else in this section.Dual-pipeline architecture
The gateway emits telemetry through two OTLP exporters that are configured, routed, and stored independently.| Aspect | Application telemetry | GenAI telemetry |
|---|---|---|
| What it captures | HTTP spans, request latency, error rates | LLM-specific spans (model, tokens, TTFT) and histogram metrics |
| Destination | Tempo (via Alloy or a general-purpose OTel Collector) | ClickHouse (via the GenAI OTel Collector) |
| Config section | [telemetry] | [genai_telemetry] |
| Typical retention | Days to weeks | Weeks to months (analytics queries over longer windows) |
| Primary consumers | SRE dashboards, on-call alerting | Product analytics, cost attribution, model comparison |
- Retention differs. Application traces are high-volume, short-lived data useful for debugging recent incidents. GenAI telemetry is lower-volume but needs longer retention for cost and quality analysis across model versions.
- Query patterns differ. Application traces are searched by trace ID or service name. GenAI data is aggregated by model, provider, and operation — queries that benefit from ClickHouse’s columnar storage rather than a trace-oriented store like Tempo.
- Storage requirements differ. Content capture (prompt and completion text) can be large. Storing it alongside lightweight HTTP spans would inflate trace storage costs without benefit.
gateway::genai instrumentation target go to the GenAI exporter. Everything else goes to the application exporter.
traceparent header flows from your application through the gateway to the upstream provider, so GenAI spans and application spans share the same trace ID and can be correlated in a trace viewer that queries both backends.
OTel GenAI semantic conventions
The gateway follows the OpenTelemetry GenAI semantic conventions. This is a deliberate choice: by using the standard attribute names and span structure, any OTel-compatible tool — Jaeger, Grafana Tempo, Honeycomb, Datadog — can parse and visualise gateway spans without custom configuration. The span name follows the pattern{operation} {model}. For example:
chat gpt-4oembeddings text-embedding-3-smallimage_generation dall-e-3
Supported operations
The gateway emits tengen_ai.operation.name values across two families: gateway-native operations (derived from the endpoint and request content) and client-context operations (declared by the caller via headers).
Gateway-native operations
These six operations are determined automatically by the gateway based on which endpoint is called and what the request contains.| Operation | Endpoint(s) | Description |
|---|---|---|
chat | /v1/chat/completions, /v1/messages, /v1/responses | Text-only LLM calls |
generate_content | Same as chat | Multimodal requests (images detected automatically) |
embeddings | /v1/embeddings | Vector embedding generation |
image_generation | /v1/images/generations | Image creation (DALL-E, etc.) |
audio_speech | /v1/audio/speech | Text-to-speech synthesis |
audio_transcription | /v1/audio/transcriptions | Speech-to-text transcription |
/v1/chat/completions, /v1/messages, /v1/responses) share the same handler and produce identical telemetry. The distinction between chat and generate_content is covered in the multimodal detection section below.
Client-context operations
These four operations are declared by the caller viax-to11-context-* headers. When the gateway receives these headers, it emits a sibling span alongside the main GenAI operation span, representing work that happened on the client side between LLM calls.
| Operation | Purpose |
|---|---|
execute_tool | Client-side tool or function execution between LLM calls |
retrieval | Vector search or document retrieval in RAG workflows |
invoke_agent | Delegating to a sub-agent in a multi-agent system |
create_agent | Agent lifecycle creation (recommended as client-emitted OTLP; gateway fallback available) |
See the Context Propagation how-to for the full header specification and examples.
Multimodal detection
When the gateway detectsimage_url content blocks in a chat request, it automatically promotes the operation from chat to generate_content. This distinction exists purely for telemetry: multimodal requests tend to have different latency profiles, token costs, and failure modes, so separating them in the data makes analytics more meaningful.
Callers can override this automatic detection with the x-to11-operation header, which accepts two values: chat and generate_content. This is useful when your application knows the semantic intent of the request better than content inspection can determine — for example, a request that contains base64 image data but is primarily a text classification task.
Multimodal detection is purely a telemetry concern. Routing, credentials, and adapters behave identically regardless of whether the operation is classified as
chat or generate_content.What’s next
Span Attributes
Every GenAI span attribute the gateway emits, grouped by lifecycle stage.
Metrics Reference
Histogram metrics, dimensions, and ClickHouse query examples.
Distributed Tracing
Group multiple LLM calls and agent chains under a single trace.
Context Propagation
Emit client-context spans via headers without an OTel SDK.
Direct Ingestion
Send OTLP spans directly to the gateway’s collector endpoint.
Content Capture
Record prompt and completion text for debugging and evaluation.