Scoring Tests
On Task
Evaluate agent's ability to stay focused on the customer's topic and give clear, specific answers.
What it measures
This score measures whether the agent stays focused on the customer's topic and gives clear, specific answers (instead of generic "fluff" or drifting off-topic). It rewards relevance and specificity.
What "good" looks like
- Directly answers the question asked.
- Avoids long generic text.
- Keeps the conversation moving step-by-step.
Common reasons for lower scores
- Generic "I'm here to help" responses without substance.
- Off-topic explanations or irrelevant questions.
- Repeated clarification loops without progress.
Examples
High (9–10): "Customer asks how to register a product; agent gives the exact steps and link, without unnecessary filler."
Mid (6–7): "Agent is mostly helpful but sometimes rambles or asks too many unrelated questions."
Low (1–3): "Agent repeatedly gives generic replies ('I'm here to help!') without actionable details."
How to read the scale
| Score | Description |
|---|---|
| 10 | Always focused and specific; every response moves things forward. |
| 9 | Nearly perfect focus; tiny bit of extra filler. |
| 8 | Strong focus; minor drift quickly corrected. |
| 7 | Good; a few vague moments but still helpful. |
| 6 | Some drift/vagueness slows progress. |
| 5 | Mixed; several responses feel generic. |
| 4 | Frequently vague/off-topic. |
| 3 | Mostly generic; user has to push for specifics. |
| 2 | Almost entirely fluff or irrelevant. |
| 1 | Off-task to the point of being unusable. |