Self-Hosted Observability

The docker compose stack ships a complete observability backend. This guide shows you how to navigate it.

Architecture overview

to11 runs two independent telemetry pipelines. Each serves a different purpose and lands in a different backend.

Gateway / API
  │
  ├─ [telemetry] App traces & logs ──▶ Alloy ──▶ Tempo   (traces)
  │                                          ├──▶ Loki    (logs)
  │                                          └──▶ Mimir   (metrics)
  │
  └─ [genai_telemetry] LLM spans ───▶ OTel GenAI Collector ──▶ ClickHouse

Pipeline	What it captures	Storage	Query UI
Application	HTTP request spans, middleware timing, error rates, container logs	Tempo + Loki + Mimir	Grafana (`:3001`)
GenAI	LLM call spans, token usage, content capture, tool calls	ClickHouse	Web app (`:3000`) or ClickHouse directly

Services and ports

After running docker compose -f docker-compose.production.yml --profile observability up -d, these observability services are available (to pin a specific build, prefix with IMAGE_TAG=sha-abc1234):

Service	URL	Purpose
Grafana	localhost:3001	Dashboards, trace explorer, log viewer
Tempo	localhost:3200	Trace storage (queried via Grafana)
Loki	localhost:3101	Log aggregation (queried via Grafana)
Mimir	localhost:9009	Metrics storage / Prometheus-compatible
Alloy	localhost:12345	OTel collector + log shipper
ClickHouse	localhost:8123	GenAI telemetry storage
OTel GenAI Collector	localhost:4317 / localhost:4318	Receives GenAI OTLP, writes to ClickHouse

Exploring platform traces in Grafana

Grafana ships with pre-configured data sources for Tempo, Loki, and Mimir. No setup required.

Open the Trace Explorer

Open localhost:3001 (anonymous admin access is enabled)
Click Explore in the left sidebar
Select Tempo as the data source

Search by service

Use the Search tab to filter traces:

Service Name: gateway or to11-api
Span Name: filter by operation (e.g. POST /v1/chat/completions)
Duration: find slow requests
Status: filter for errors

TraceQL queries

Switch to the TraceQL tab for more powerful queries:

// All gateway traces
{ resource.service.name = "gateway" }

// Traces with errors
{ resource.service.name = "gateway" && status = error }

// Slow requests (over 2 seconds)
{ resource.service.name = "gateway" } | duration > 2s

// Specific endpoint
{ resource.service.name = "gateway" && name =~ "POST.*chat" }

// API service traces
{ resource.service.name = "to11-api" }

Reading a trace waterfall

Click any trace to open the waterfall view. You’ll see:

Root span — the inbound HTTP request to the gateway
Child spans — middleware processing, upstream provider call, response handling
Span attributes — HTTP method, status code, URL path, timing details
Events — errors, warnings, or other logged events within a span

Traces to logs

The Grafana Tempo data source is configured with Traces-to-Logs linking. When viewing a trace:

Click the Logs for this span button in the trace detail panel
This jumps to Loki with the trace_id pre-filled, showing all log lines emitted during that trace

Exploring logs in Grafana

In Explore, select the Loki data source
Use LogQL to query:

// All gateway container logs
{service="gateway"}

// API logs containing "error"
{service="api"} |= "error"

// Logs correlated to a specific trace
{service="gateway"} | json | trace_id = "abc123..."

Container logs from all Docker Compose services are automatically collected by Alloy and shipped to Loki.

Exploring metrics in Grafana

In Explore, select the Mimir data source
Use PromQL to query span-derived metrics:

// Request rate by service
rate(traces_spanmetrics_calls_total[5m])

// P95 latency for gateway spans
histogram_quantile(0.95, rate(traces_spanmetrics_duration_seconds_bucket{service="gateway"}[5m]))

// OTel Collector health (scraped from GenAI collector :8888)
otelcol_exporter_sent_spans_total

Tempo’s metrics generator automatically derives RED metrics (Rate, Errors, Duration) from ingested traces and pushes them to Mimir.

Querying GenAI telemetry in ClickHouse

GenAI spans (LLM calls, token counts, content capture) go to ClickHouse via the OTel GenAI Collector.

Using the Web UI

The to11 web app at localhost:3000 provides a traces page per project. Navigate to a project and open the Traces tab to see:

Trace list with provider, model, tokens, cost, and status
Trace detail view with span waterfall and attributes

Querying ClickHouse directly

Connect to ClickHouse at localhost:8123 (user: otel, password: otel):

# Quick check for data
curl "http://localhost:8123/?user=otel&password=otel" \
  --data-binary "SELECT count() FROM otel_traces"

# Recent GenAI spans
curl "http://localhost:8123/?user=otel&password=otel" \
  --data-binary "
    SELECT
      Timestamp,
      SpanName,
      SpanAttributes['gen_ai.request.model'] AS model,
      SpanAttributes['gen_ai.provider.name'] AS provider,
      SpanAttributes['gen_ai.usage.input_tokens'] AS input_tokens,
      SpanAttributes['gen_ai.usage.output_tokens'] AS output_tokens
    FROM otel_traces
    ORDER BY Timestamp DESC
    LIMIT 10
    FORMAT Pretty
  "

See the Metrics reference for more query examples.

Host-machine development (no Docker gateway)

When running the Rust gateway on your host machine instead of in Docker (for example cargo run -p gateway), the infra services still run in Docker. The port mappings differ slightly:

Signal	Docker-to-Docker endpoint	Native (host) endpoint
App traces (Alloy)	`alloy:4317`	`localhost:14317`
App telemetry HTTP	`alloy:4318`	`localhost:14318`
GenAI OTLP (Collector)	`otel-genai-collector:4317`	`localhost:4317`

Use docker/gateway/config.dev.toml which has these host endpoints pre-configured:

source .env.local
GATEWAY_CONFIG=docker/gateway/config.dev.toml cargo run -p gateway

Troubleshooting

No traces appearing in Tempo

Verify Alloy is running: curl http://localhost:12345 should return the Alloy UI
Check that [telemetry].enabled = true in the gateway config
Look at Alloy logs: docker compose -f docker-compose.production.yml logs alloy

No GenAI spans in ClickHouse

Verify the OTel GenAI Collector is healthy: curl http://localhost:13133/
Check collector logs: docker compose -f docker-compose.production.yml logs otel-genai-collector
Confirm ClickHouse is accepting writes: docker compose -f docker-compose.production.yml logs clickhouse

ClickHouse shows unhealthy

ClickHouse can take 30-60 seconds to fully initialize. If it stays unhealthy:

# Check ClickHouse logs
docker compose -f docker-compose.production.yml logs clickhouse

# Verify connectivity
curl "http://localhost:8123/?user=otel&password=otel" \
  --data-binary "SELECT 1"

Grafana data source errors

If Grafana shows “Bad Gateway” for a data source, the backend service may still be starting. Wait 30 seconds and refresh — Grafana connects to Tempo, Loki, and Mimir via internal Docker networking.

Self-Hosted

Documentation Index

​Self-Hosted Observability

​Architecture overview

​Services and ports

​Exploring platform traces in Grafana

​Open the Trace Explorer

​Search by service

​TraceQL queries

​Reading a trace waterfall

​Traces to logs

​Exploring logs in Grafana

​Exploring metrics in Grafana

​Querying GenAI telemetry in ClickHouse

​Using the Web UI

​Querying ClickHouse directly

​Host-machine development (no Docker gateway)

​Troubleshooting

​No traces appearing in Tempo

​No GenAI spans in ClickHouse

​ClickHouse shows unhealthy

​Grafana data source errors