Appearance
Evals
Test whether an agent produces correct outputs and takes the right actions across single-turn or multi-turn conversations. Eval files live alongside the code they test (*.eval.md), just like unit tests.
How it works
You provide an eval file describing tests and any supporting skill directories. The eval runner does the rest — extracting tests, running conversations, grading, and writing a report.
node eval.js --eval api.eval.md --skill skills/debugging --runs 3Here's what happens:
1. Extract tests
The runner sends your eval file to an LLM and gets back structured test definitions. No format constraints — the LLM reads whatever you wrote.
eval.js → LLM: "Read this eval file, extract every test as JSON"
← [{ id: "1.1", openingPrompt: "...", criteria: [...], ... }]2. Set up isolation
The runner creates a temp directory with only the skills you specified:
/tmp/eval-abc123/
.claude/skills/
debugging → /your/repo/skills/debugging (symlink)The agent under test runs here. It sees only these skills — your personal skills from ~/.claude/ are blocked.
3. Run sessions
For each test, the runner starts N sessions in parallel (default 3). Each session is an independent conversation:
┌─────────────────────────────────┐
│ eval.js │
└──────┬──────────┬──────────┬────┘
│ │ │
Run 1 Run 2 Run 3
│ │ │
▼ ▼ ▼Within each run, the conversation alternates between the agent under test and a simulated user:
Turn 1: eval.js sends the opening prompt to the AGENT
← agent responds
Turn 2: eval.js sends the transcript so far to the SIMULATED USER
← "Can you check the connection pool logs?"
eval.js sends that message to the AGENT (resuming the session)
← agent responds
Turn 3: eval.js sends updated transcript to the SIMULATED USER
← "I ran that command, here's what I see: ..."
eval.js sends that to the AGENT
← agent responds
...until max turns or the simulated user responds [DONE]...The agent and simulated user are fully isolated — separate processes, separate sessions, no shared state. The simulated user only knows the conversation because eval.js passes the transcript as text in its prompt.
For single-turn tests (no user briefing), the runner sends one prompt and collects one response. No simulated user involved.
4. Grade
For each run, the runner sends the full transcript plus acceptance criteria to a grader LLM:
eval.js → LLM: "Grade this transcript against these criteria"
← [{ criterion: "Asks about the error", grade: "pass" },
{ criterion: "Identifies root cause", grade: "fail" }]Per-run result: all pass → PASS, mixed → PARTIAL, all fail → FAIL.
5. Verdict
Across all runs: 3/3 pass → PASS, 2/3 → FLAKY, 0–1/3 → FAIL.
6. Report
The runner writes a report alongside the eval file and a JSON file with full transcripts and grades:
api.eval.2026-03-03T14-22-01Z.md ← human-readable report
api.eval.2026-03-03T14-22-01Z.json ← full transcripts + gradesData flow
eval file ──→ EXTRACTOR LLM ──→ test definitions
│
┌─────────────┼─────────────┐
▼ ▼ ▼
Run 1 Run 2 Run 3
│ │ │
┌─────┴─────┐
▼ ▼
AGENT SIMULATED USER
(isolated) (stateless)
│
▼
transcript
│
▼
GRADER LLM ──→ per-criterion grades
│
▼
PASS / PARTIAL / FAIL
│
┌────────────────┼────────────────┐
▼ ▼
report.md results.jsonThe simulated user
For multi-turn tests, instead of scripting exact user messages, you define a user briefing — the user's goals, knowledge, and personality — and the runner generates natural responses that adapt to the conversation.
The briefing fixes the user's intent while letting the expression vary:
User briefing: You have a Node.js API that returns 500 errors intermittently.
You've noticed it happens more under load. When the agent asks about your
setup, tell them it's Express with a PostgreSQL connection pool (max 5).
Follow their diagnostic steps one at a time.The simulated user reads the full conversation history plus this briefing and generates the next message (1–3 sentences). When the user's goals are fully achieved, it responds with [DONE] to end the conversation early.
This produces realistic conversational flows where later turns depend on earlier ones — the agent's actual responses shape where the conversation goes, not a fixed script.
User briefing tips
- State the user's goal clearly ("You have an API returning 500 errors under load and want to find the root cause")
- Include knowledge the user has ("Your setup is Express with PostgreSQL, connection pool max 5")
- Describe how to react to the agent ("Answer questions one at a time; when they suggest a diagnostic step, report the result")
- Keep it focused — the simulated user performs better with clear, concrete instructions
Defining tests
Write your eval file however you want — the runner uses an LLM to extract test definitions, so there's no rigid format. That said, each test needs at minimum:
- An opening prompt — the first message sent to the agent
- Acceptance criteria — what the conversation must achieve
For multi-turn tests, also include a user briefing and optionally a max turns limit. For reference, here's one way to write them:
markdown
### Test 1.1 — Multi-turn debugging session
**What this tests**: Can the agent diagnose an intermittent server error through
a structured conversation?
**User briefing:** You have a Node.js Express API that returns 500 errors
intermittently. Your setup: Express 4, PostgreSQL with a connection pool (max 5
connections). When the agent asks questions, answer one at a time. When they
suggest a diagnostic step, report the result: the connection pool is frequently
exhausted.
**Opening prompt:** My API keeps returning 500 errors intermittently. It seems
to happen more under load. Can you help me figure out what's going on?
**Max turns:** 10
**Acceptance criteria:**
- [ ] Asks clarifying questions before jumping to solutions
- [ ] Asks about the stack trace or error message
- [ ] Identifies pool exhaustion as the likely root cause
- [ ] Recommends increasing pool size or adding connection timeout handling
**Ground truth:**
- Root cause: connection pool too small for request volume
- Fix: increase pool size, add connection timeout, or add retry logicSingle-turn tests
When a test only needs one exchange, omit the user briefing and max turns:
markdown
### Test 2.1 — Error message explanation
**What this tests**: Can the agent explain a common error clearly?
**Opening prompt:** What does "ECONNREFUSED 127.0.0.1:5432" mean?
**Acceptance criteria:**
- [ ] Explains that the connection to PostgreSQL was refused
- [ ] Suggests checking if the database server is running
- [ ] Does NOT suggest unrelated causesFile organization
Eval files live alongside the code they test:
project/
agents/
support-bot/
config.md
config.eval.md # evals for support bot
config.eval.2026-03-01T05-34-25Z.md # eval report
code-reviewer/
config.md
config.eval.md # evals for code reviewer
project.eval.md # project-level evalsFind all evals: glob **/*.eval.md
Running evals
bash
# Run all tests in an eval file
node eval.js --eval path/to/file.eval.md --skill path/to/skill-dir
# Run a single test by ID
node eval.js --eval path/to/file.eval.md --test "1.1" --skill path/to/skill-dir
# Override defaults
node eval.js --eval path/to/file.eval.md --runs 5 --skill path/to/skill-dirCLI reference
| Flag | Purpose | Default |
|---|---|---|
--eval PATH | Eval definition file | (required) |
--test ID | Run only this test | all tests |
--skill PATH | Skill directory (repeatable) | none |
--runs N | Parallel sessions per test | 3 |
--output PATH | JSON transcript + grades file | {name}.eval.{ts}.json |
Grading
For each acceptance criterion, the grader judges whether the conversation met it. Criteria are session-level — they apply to the full transcript, not individual turns.
| Per-run grade | Meaning |
|---|---|
| PASS | All criteria met |
| PARTIAL | Some criteria met, some not |
| FAIL | Most or all criteria failed |
Run every test 3 times — the minimum for a majority signal.
| Verdict | Meaning |
|---|---|
| PASS | 3/3 passed |
| FLAKY | 2/3 passed (needs investigation) |
| FAIL | 0–1/3 passed |
Reports
Each eval run produces a report file: {name}.eval.{timestamp}.md alongside the eval definition.
markdown
# Eval Run — {name} — {timestamp}
**Commit:** {short hash}
**Uncommitted changes:** {description or "none"}
---
### 1.1 — Multi-turn debugging session: PASS
- [x] Asks clarifying questions before jumping to solutions
- [x] Asks about the stack trace or error message
- [x] Asks about the database setup
- [x] Suggests checking connection pool metrics
- [x] Identifies pool exhaustion as root cause
- [x] Recommends increasing pool size
Run 1: PASS | Run 2: PASS | Run 3: PASS
---
## Summary
| # | Test | Run 1 | Run 2 | Run 3 | Verdict |
| --- | -------------------------- | ----- | ----- | ------- | ------- |
| 1.1 | Multi-turn debugging | PASS | PASS | PASS | PASS |
| 2.1 | Error message explanation | PASS | PASS | PARTIAL | FLAKY |
2 tests: 1 passed, 1 flakyReport conventions
- Record the git commit hash and any uncommitted changes at the top
- Grade each criterion with
[x](pass),[ ](fail), or[~](partial) - Show per-run results on one line:
Run 1: PASS | Run 2: PARTIAL | Run 3: PASS - End with a summary table and totals