Evaluate - to11

Evaluation is not yet available. This page describes where the section is headed; until it ships, the surfaces below cover the same ground today.

Your traces already carry latency and cost. Quality is the harder signal — the part of an LLM application that never shows up in a status code. Evaluate is the section that will close that gap: a place to score outputs, compare versions against each other, and watch quality move over time, right next to the latency and cost you already see in Observe.

What to use today

Until Evaluate ships, two surfaces that are already live cover most of the ground:

Observe — every LLM call lands as a trace carrying its prompt, completion, latency, and cost. Reading those traces back is how you judge output quality by hand today.
Experiment routing — split traffic across models, providers, or parameter variants for the same task, with each request tagged by the variant that served it. Compare the variants’ traces in Observe to see which one wins.

Next steps

Observe

Read your application’s traces, latency, and cost.

Experiments

A/B test models and parameters for the same task.

Traces Overview

​What to use today

​Next steps

Observe

Experiments

What to use today

Next steps