Mini eval runner

A toy eval loop you can poke. One fixed demo agent (customer-support triage), five test cases, one editable LLM-judge rubric. Edit the rubric and re-run to see how judge behaviour moves.

LLM-judge rubric

Edit this and re-run. Scores will move with the rubric.

Demo agent

Haiku-backed customer-support triage agent for a fictional payments SaaS. Five test cases: happy, edge, adversarial, ambiguous, handoff.

Click Run eval to see five cases graded.