Telemetry Overview

The gateway uses OpenTelemetry to capture every LLM interaction as structured traces and metrics. Two independent pipelines serve different purposes: one for general application observability, the other for LLM-specific analytics. Understanding why they are separated — and how the ten operation names map onto your workloads — is the foundation for everything else in this section.

Dual-pipeline architecture

The gateway emits telemetry through two OTLP exporters that are configured, routed, and stored independently.

Aspect	Application telemetry	GenAI telemetry
What it captures	HTTP spans, request latency, error rates	LLM-specific spans (model, tokens, TTFT) and histogram metrics
Destination	Tempo (via Alloy or a general-purpose OTel Collector)	ClickHouse (via the GenAI OTel Collector)
Config section	`[telemetry]`	`[genai_telemetry]`
Typical retention	Days to weeks	Weeks to months (analytics queries over longer windows)
Primary consumers	SRE dashboards, on-call alerting	Product analytics, cost attribution, model comparison

The reason for keeping these separate comes down to three practical concerns:

Retention differs. Application traces are high-volume, short-lived data useful for debugging recent incidents. GenAI telemetry is lower-volume but needs longer retention for cost and quality analysis across model versions.
Query patterns differ. Application traces are searched by trace ID or service name. GenAI data is aggregated by model, provider, and operation — queries that benefit from ClickHouse’s columnar storage rather than a trace-oriented store like Tempo.
Storage requirements differ. Content capture (prompt and completion text) can be large. Storing it alongside lightweight HTTP spans would inflate trace storage costs without benefit.

Internally, spans emitted under the gateway::genai instrumentation target go to the GenAI exporter. Everything else goes to the application exporter.

Your App (traceparent)
    |
    v
Gateway (:4000)
    |--- GenAI spans + metrics ---> OTel Collector (:4317) ---> ClickHouse
    |--- App traces --------------> Alloy/Collector ---------> Tempo
    |
    v
LLM Provider (traceparent injected)

Both pipelines participate in the same W3C Trace Context propagation chain. A single traceparent header flows from your application through the gateway to the upstream provider, so GenAI spans and application spans share the same trace ID and can be correlated in a trace viewer that queries both backends.

OTel GenAI semantic conventions

The gateway follows the OpenTelemetry GenAI semantic conventions. This is a deliberate choice: by using the standard attribute names and span structure, any OTel-compatible tool — Jaeger, Grafana Tempo, Honeycomb, Datadog — can parse and visualise gateway spans without custom configuration. The span name follows the pattern {operation} {model}. For example:

chat gpt-4o
embeddings text-embedding-3-small
image_generation dall-e-3

This naming convention means you can filter spans by operation type using a simple prefix match, or by model using a suffix match, in any trace viewer that supports span name search.

Supported operations

The gateway emits ten gen_ai.operation.name values across two families: gateway-native operations (derived from the endpoint and request content) and client-context operations (declared by the caller via headers).

Gateway-native operations

These six operations are determined automatically by the gateway based on which endpoint is called and what the request contains.

Operation	Endpoint(s)	Description
`chat`	`/v1/chat/completions`, `/v1/messages`, `/v1/responses`	Text-only LLM calls
`generate_content`	Same as `chat`	Multimodal requests (images detected automatically)
`embeddings`	`/v1/embeddings`	Vector embedding generation
`image_generation`	`/v1/images/generations`	Image creation (DALL-E, etc.)
`audio_speech`	`/v1/audio/speech`	Text-to-speech synthesis
`audio_transcription`	`/v1/audio/transcriptions`	Speech-to-text transcription

All three chat endpoints (/v1/chat/completions, /v1/messages, /v1/responses) share the same handler and produce identical telemetry. The distinction between chat and generate_content is covered in the multimodal detection section below.

Client-context operations

These four operations are declared by the caller via x-to11-context-* headers. When the gateway receives these headers, it emits a sibling span alongside the main GenAI operation span, representing work that happened on the client side between LLM calls.

Operation	Purpose
`execute_tool`	Client-side tool or function execution between LLM calls
`retrieval`	Vector search or document retrieval in RAG workflows
`invoke_agent`	Delegating to a sub-agent in a multi-agent system
`create_agent`	Agent lifecycle creation (recommended as client-emitted OTLP; gateway fallback available)

The reason these exist as gateway-emitted spans rather than requiring a full client-side OTel SDK is pragmatic: many applications — especially those using lightweight HTTP clients or serverless functions — do not have an OTel SDK configured. The context headers let these applications participate in structured GenAI tracing with zero additional dependencies.

See the Context Propagation how-to for the full header specification and examples.

Multimodal detection

When the gateway detects image_url content blocks in a chat request, it automatically promotes the operation from chat to generate_content. This distinction exists purely for telemetry: multimodal requests tend to have different latency profiles, token costs, and failure modes, so separating them in the data makes analytics more meaningful. Callers can override this automatic detection with the x-to11-operation header, which accepts two values: chat and generate_content. This is useful when your application knows the semantic intent of the request better than content inspection can determine — for example, a request that contains base64 image data but is primarily a text classification task.

Multimodal detection is purely a telemetry concern. Routing, credentials, and adapters behave identically regardless of whether the operation is classified as chat or generate_content.

What’s next

Span Attributes

Every GenAI span attribute the gateway emits, grouped by lifecycle stage.

Metrics Reference

Histogram metrics, dimensions, and ClickHouse query examples.

Distributed Tracing

Group multiple LLM calls and agent chains under a single trace.

Context Propagation

Emit client-context spans via headers without an OTel SDK.

Direct Ingestion

Send OTLP spans directly to the gateway’s collector endpoint.

Content Capture

Record prompt and completion text for debugging and evaluation.

Get Started

Concepts

Routing

Reference

Security

Telemetry

Telemetry Overview

Telemetry Overview

Dual-pipeline architecture

OTel GenAI semantic conventions

Supported operations

Gateway-native operations

Client-context operations

Multimodal detection

What’s next

Span Attributes

Metrics Reference

Distributed Tracing

Context Propagation

Direct Ingestion

Content Capture

Get Started

Concepts

Routing

Reference

Security

Telemetry

Documentation Index

​Telemetry Overview

​Dual-pipeline architecture

​OTel GenAI semantic conventions

​Supported operations

​Gateway-native operations

​Client-context operations

​Multimodal detection

​What’s next

Span Attributes

Metrics Reference

Distributed Tracing

Context Propagation

Direct Ingestion

Content Capture

Telemetry Overview

Dual-pipeline architecture

OTel GenAI semantic conventions

Supported operations

Gateway-native operations

Client-context operations

Multimodal detection

What’s next