Experiment Routing
Run controlled experiments to compare models, providers, and parameter configurations for the same task. The gateway selects a variant by weighted random on each request, tags telemetry with the variant name, and returns the selection in the X-Gateway-Variant response header.
Use case: Compare models for summarization
You want to know whether GPT-4o-mini or Claude produces better summaries at lower cost. Instead of splitting traffic at the application level, configure an experiment function:
[providers.openai]
base_url = "https://api.openai.com/v1"
credential = "env::OPENAI_API_KEY"
models = ["gpt-4o", "gpt-4o-mini"]
[providers.anthropic]
base_url = "https://api.anthropic.com/v1"
credential = "env::ANTHROPIC_API_KEY"
models = ["claude-sonnet-4-6"]
[functions.summarize]
endpoint = "chat"
strategy = "experiment"
[functions.summarize.variants.fast]
model = "gpt-4o-mini"
weight = 50
temperature = 0.2
max_tokens = 500
[functions.summarize.variants.quality]
model = "claude-sonnet-4-6"
weight = 50
temperature = 0.7
max_tokens = 1500
Send requests using the function name:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:4000/v1",
apiKey: "anything", // gateway owns credentials
});
const response = await client.chat.completions.create({
model: "function::summarize",
messages: [{ role: "user", content: "Summarize this quarterly report..." }],
});
The gateway selects fast or quality by weight on each request. The caller’s code never changes — you adjust weights or swap models in config.
How variant selection works
- Request arrives with
model: "function::summarize"
- Gateway looks up the function and finds
strategy = "experiment"
- Weighted random selects a variant (50/50 in this case)
- The variant’s model and params are applied to the upstream request
- OTel span includes
gateway.variant.name, gateway.variant.weight
- Response includes
X-Gateway-Variant: fast (or quality) header
Variant params (explicit-wins)
Each variant can override inference parameters. If the variant sets a param, it overrides the caller’s value. If the variant omits a param, the caller’s value passes through.
[functions.summarize.variants.conservative]
model = "gpt-4o-mini"
weight = 50
temperature = 0.2 # overrides caller's temperature
max_tokens = 500 # overrides caller's max_tokens
# top_p not set → caller's value passes through
[functions.summarize.variants.creative]
model = "gpt-4o"
weight = 50
temperature = 0.9 # different temperature for this variant
max_tokens = 2000
Params are validated per endpoint type. Setting temperature on an embedding function is a config error. Unknown params pass through with a startup warning.
Protected fields (model, messages, input, file, prompt, stream) cannot be set as variant params. They are routing or content fields managed by the gateway.
Use case: A/B test embedding models
Experiments work on all endpoint types, not just chat:
[functions.embed]
endpoint = "embeddings"
strategy = "experiment"
[functions.embed.variants.small]
model = "text-embedding-3-small"
weight = 50
dimensions = 1536
[functions.embed.variants.large]
model = "text-embedding-3-large"
weight = 50
dimensions = 3072
const embedding = await client.embeddings.create({
model: "function::embed",
input: "The quick brown fox",
});
// X-Gateway-Variant header tells you which model was used
Use case: Cross-provider A/B test
Compare the same task across providers:
[functions.classify]
endpoint = "chat"
strategy = "experiment"
[functions.classify.variants.openai]
model = "gpt-4o-mini"
weight = 50
[functions.classify.variants.anthropic]
model = "claude-haiku-4-5-20251001"
weight = 50
No params needed — this is a pure model comparison. The gateway handles credential routing automatically.
Use case: Audio speech voice comparison
[functions.speak]
endpoint = "audio_speech"
strategy = "experiment"
[functions.speak.variants.alloy]
model = "tts-1"
weight = 50
voice = "alloy"
speed = 1.0
[functions.speak.variants.nova]
model = "tts-1-hd"
weight = 50
voice = "nova"
speed = 0.9
Analyzing experiments
Query your OTel data (ClickHouse, Grafana, etc.) to compare variant performance:
SELECT
SpanAttributes['gateway.variant.name'] AS variant,
count() AS requests,
avg(Duration) / 1e6 AS avg_latency_ms,
quantile(0.95)(Duration) / 1e6 AS p95_latency_ms,
avg(SpanAttributes['gen_ai.usage.input_tokens']::Int64) AS avg_input_tokens,
avg(SpanAttributes['gen_ai.usage.output_tokens']::Int64) AS avg_output_tokens
FROM otel_traces
WHERE SpanAttributes['gateway.function.name'] = 'summarize'
AND Timestamp > now() - INTERVAL 24 HOUR
GROUP BY variant
ORDER BY variant
Telemetry attributes
| Attribute | Type | Description |
|---|
gateway.variant.name | string | Selected variant name |
gateway.variant.weight | int | Variant’s configured weight |
gateway.experiment.total_variants | int | Number of variants in the function |
gateway.function.name | string | Function name |
gateway.function.endpoint | string | Endpoint type (chat, embeddings, etc.) |
Requirements
- At least 2 variants per experiment function
- All variant weights must be > 0
- Each variant specifies either
model (inline) or target (named reference), not both
strategy = "experiment" is only valid on functions, not routes
What experiments don’t do
- No adaptive optimization — weights are static. Automatic adjustment based on metrics is planned.
- No cross-variant fallback — if a variant’s endpoint fails, the request fails. Use
strategy = "fallback" for reliability.
- No prompt management — variants don’t own prompts. The caller sends messages; the gateway routes.