Skip to main content

Experiment Routing

Run controlled experiments to compare models, providers, and parameter configurations for the same task. The gateway selects a variant by weighted random on each request, tags telemetry with the variant name, and returns the selection in the X-Gateway-Variant response header.

Use case: Compare models for summarization

You want to know whether GPT-4o-mini or Claude produces better summaries at lower cost. Instead of splitting traffic at the application level, configure an experiment function:
[providers.openai]
base_url = "https://api.openai.com/v1"
credential = "env::OPENAI_API_KEY"
models = ["gpt-4o", "gpt-4o-mini"]

[providers.anthropic]
base_url = "https://api.anthropic.com/v1"
credential = "env::ANTHROPIC_API_KEY"
models = ["claude-sonnet-4-6"]

[functions.summarize]
endpoint = "chat"
strategy = "experiment"

[functions.summarize.variants.fast]
model = "gpt-4o-mini"
weight = 50
temperature = 0.2
max_tokens = 500

[functions.summarize.variants.quality]
model = "claude-sonnet-4-6"
weight = 50
temperature = 0.7
max_tokens = 1500
Send requests using the function name:
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:4000/v1",
  apiKey: "anything", // gateway owns credentials
});

const response = await client.chat.completions.create({
  model: "function::summarize",
  messages: [{ role: "user", content: "Summarize this quarterly report..." }],
});
The gateway selects fast or quality by weight on each request. The caller’s code never changes — you adjust weights or swap models in config.

How variant selection works

  1. Request arrives with model: "function::summarize"
  2. Gateway looks up the function and finds strategy = "experiment"
  3. Weighted random selects a variant (50/50 in this case)
  4. The variant’s model and params are applied to the upstream request
  5. OTel span includes gateway.variant.name, gateway.variant.weight
  6. Response includes X-Gateway-Variant: fast (or quality) header

Variant params (explicit-wins)

Each variant can override inference parameters. If the variant sets a param, it overrides the caller’s value. If the variant omits a param, the caller’s value passes through.
[functions.summarize.variants.conservative]
model = "gpt-4o-mini"
weight = 50
temperature = 0.2       # overrides caller's temperature
max_tokens = 500        # overrides caller's max_tokens
# top_p not set → caller's value passes through

[functions.summarize.variants.creative]
model = "gpt-4o"
weight = 50
temperature = 0.9       # different temperature for this variant
max_tokens = 2000
Params are validated per endpoint type. Setting temperature on an embedding function is a config error. Unknown params pass through with a startup warning.
Protected fields (model, messages, input, file, prompt, stream) cannot be set as variant params. They are routing or content fields managed by the gateway.

Use case: A/B test embedding models

Experiments work on all endpoint types, not just chat:
[functions.embed]
endpoint = "embeddings"
strategy = "experiment"

[functions.embed.variants.small]
model = "text-embedding-3-small"
weight = 50
dimensions = 1536

[functions.embed.variants.large]
model = "text-embedding-3-large"
weight = 50
dimensions = 3072
const embedding = await client.embeddings.create({
  model: "function::embed",
  input: "The quick brown fox",
});
// X-Gateway-Variant header tells you which model was used

Use case: Cross-provider A/B test

Compare the same task across providers:
[functions.classify]
endpoint = "chat"
strategy = "experiment"

[functions.classify.variants.openai]
model = "gpt-4o-mini"
weight = 50

[functions.classify.variants.anthropic]
model = "claude-haiku-4-5-20251001"
weight = 50
No params needed — this is a pure model comparison. The gateway handles credential routing automatically.

Use case: Audio speech voice comparison

[functions.speak]
endpoint = "audio_speech"
strategy = "experiment"

[functions.speak.variants.alloy]
model = "tts-1"
weight = 50
voice = "alloy"
speed = 1.0

[functions.speak.variants.nova]
model = "tts-1-hd"
weight = 50
voice = "nova"
speed = 0.9

Analyzing experiments

Query your OTel data (ClickHouse, Grafana, etc.) to compare variant performance:
SELECT
    SpanAttributes['gateway.variant.name'] AS variant,
    count() AS requests,
    avg(Duration) / 1e6 AS avg_latency_ms,
    quantile(0.95)(Duration) / 1e6 AS p95_latency_ms,
    avg(SpanAttributes['gen_ai.usage.input_tokens']::Int64) AS avg_input_tokens,
    avg(SpanAttributes['gen_ai.usage.output_tokens']::Int64) AS avg_output_tokens
FROM otel_traces
WHERE SpanAttributes['gateway.function.name'] = 'summarize'
  AND Timestamp > now() - INTERVAL 24 HOUR
GROUP BY variant
ORDER BY variant

Telemetry attributes

AttributeTypeDescription
gateway.variant.namestringSelected variant name
gateway.variant.weightintVariant’s configured weight
gateway.experiment.total_variantsintNumber of variants in the function
gateway.function.namestringFunction name
gateway.function.endpointstringEndpoint type (chat, embeddings, etc.)

Requirements

  • At least 2 variants per experiment function
  • All variant weights must be > 0
  • Each variant specifies either model (inline) or target (named reference), not both
  • strategy = "experiment" is only valid on functions, not routes

What experiments don’t do

  • No adaptive optimization — weights are static. Automatic adjustment based on metrics is planned.
  • No cross-variant fallback — if a variant’s endpoint fails, the request fails. Use strategy = "fallback" for reliability.
  • No prompt management — variants don’t own prompts. The caller sends messages; the gateway routes.