Mini eval runner
A toy eval loop you can poke. One fixed demo agent (customer-support triage), five test cases, one editable LLM-judge rubric. Edit the rubric and re-run to see how judge behaviour moves.
LLM-judge rubric
Edit this and re-run. Scores will move with the rubric.
Demo agent
Haiku-backed customer-support triage agent for a fictional payments SaaS. Five test cases: happy, edge, adversarial, ambiguous, handoff.
Click Run eval to see five cases graded.