Evaluation is not yet available. This page describes where the section is headed; until it ships, the surfaces below cover the same ground today.
What to use today
Until Evaluate ships, two surfaces that are already live cover most of the ground:- Observe — every LLM call lands as a trace carrying its prompt, completion, latency, and cost. Reading those traces back is how you judge output quality by hand today.
- Experiment routing — split traffic across models, providers, or parameter variants for the same task, with each request tagged by the variant that served it. Compare the variants’ traces in Observe to see which one wins.
Next steps
Observe
Read your application’s traces, latency, and cost.
Experiments
A/B test models and parameters for the same task.