Skip to content

Evals

Test whether an agent produces correct outputs and takes the right actions across single-turn or multi-turn conversations. Eval files live alongside the code they test (*.eval.md), just like unit tests.

How it works

You provide an eval file describing tests and any supporting skill directories. The eval runner does the rest — extracting tests, running conversations, grading, and writing a report.

node eval.js --eval api.eval.md --skill skills/debugging --runs 3

Here's what happens:

1. Extract tests

The runner sends your eval file to an LLM and gets back structured test definitions. No format constraints — the LLM reads whatever you wrote.

eval.js → LLM: "Read this eval file, extract every test as JSON"
       ← [{ id: "1.1", openingPrompt: "...", criteria: [...], ... }]

2. Set up isolation

The runner creates a temp directory with only the skills you specified:

/tmp/eval-abc123/
  .claude/skills/
    debugging → /your/repo/skills/debugging   (symlink)

The agent under test runs here. It sees only these skills — your personal skills from ~/.claude/ are blocked.

3. Run sessions

For each test, the runner starts N sessions in parallel (default 3). Each session is an independent conversation:

                    ┌─────────────────────────────────┐
                    │            eval.js              │
                    └──────┬──────────┬──────────┬────┘
                           │          │          │
                         Run 1      Run 2      Run 3
                           │          │          │
                           ▼          ▼          ▼

Within each run, the conversation alternates between the agent under test and a simulated user:

Turn 1:  eval.js sends the opening prompt to the AGENT
         ← agent responds

Turn 2:  eval.js sends the transcript so far to the SIMULATED USER
         ← "Can you check the connection pool logs?"

         eval.js sends that message to the AGENT (resuming the session)
         ← agent responds

Turn 3:  eval.js sends updated transcript to the SIMULATED USER
         ← "I ran that command, here's what I see: ..."

         eval.js sends that to the AGENT
         ← agent responds

...until max turns or the simulated user responds [DONE]...

The agent and simulated user are fully isolated — separate processes, separate sessions, no shared state. The simulated user only knows the conversation because eval.js passes the transcript as text in its prompt.

For single-turn tests (no user briefing), the runner sends one prompt and collects one response. No simulated user involved.

4. Grade

For each run, the runner sends the full transcript plus acceptance criteria to a grader LLM:

eval.js → LLM: "Grade this transcript against these criteria"
       ← [{ criterion: "Asks about the error", grade: "pass" },
           { criterion: "Identifies root cause", grade: "fail" }]

Per-run result: all pass → PASS, mixed → PARTIAL, all fail → FAIL.

5. Verdict

Across all runs: 3/3 pass → PASS, 2/3 → FLAKY, 0–1/3 → FAIL.

6. Report

The runner writes a report alongside the eval file and a JSON file with full transcripts and grades:

api.eval.2026-03-03T14-22-01Z.md     ← human-readable report
api.eval.2026-03-03T14-22-01Z.json   ← full transcripts + grades

Data flow

eval file ──→ EXTRACTOR LLM ──→ test definitions

                        ┌─────────────┼─────────────┐
                        ▼             ▼             ▼
                      Run 1         Run 2         Run 3
                        │             │             │
                  ┌─────┴─────┐
                  ▼           ▼
               AGENT    SIMULATED USER
             (isolated)  (stateless)


              transcript


              GRADER LLM ──→ per-criterion grades


                             PASS / PARTIAL / FAIL

                  ┌────────────────┼────────────────┐
                  ▼                                 ▼
             report.md                         results.json

The simulated user

For multi-turn tests, instead of scripting exact user messages, you define a user briefing — the user's goals, knowledge, and personality — and the runner generates natural responses that adapt to the conversation.

The briefing fixes the user's intent while letting the expression vary:

User briefing: You have a Node.js API that returns 500 errors intermittently.
  You've noticed it happens more under load. When the agent asks about your
  setup, tell them it's Express with a PostgreSQL connection pool (max 5).
  Follow their diagnostic steps one at a time.

The simulated user reads the full conversation history plus this briefing and generates the next message (1–3 sentences). When the user's goals are fully achieved, it responds with [DONE] to end the conversation early.

This produces realistic conversational flows where later turns depend on earlier ones — the agent's actual responses shape where the conversation goes, not a fixed script.

User briefing tips

  • State the user's goal clearly ("You have an API returning 500 errors under load and want to find the root cause")
  • Include knowledge the user has ("Your setup is Express with PostgreSQL, connection pool max 5")
  • Describe how to react to the agent ("Answer questions one at a time; when they suggest a diagnostic step, report the result")
  • Keep it focused — the simulated user performs better with clear, concrete instructions

Defining tests

Write your eval file however you want — the runner uses an LLM to extract test definitions, so there's no rigid format. That said, each test needs at minimum:

  • An opening prompt — the first message sent to the agent
  • Acceptance criteria — what the conversation must achieve

For multi-turn tests, also include a user briefing and optionally a max turns limit. For reference, here's one way to write them:

markdown
### Test 1.1 — Multi-turn debugging session

**What this tests**: Can the agent diagnose an intermittent server error through
a structured conversation?

**User briefing:** You have a Node.js Express API that returns 500 errors
intermittently. Your setup: Express 4, PostgreSQL with a connection pool (max 5
connections). When the agent asks questions, answer one at a time. When they
suggest a diagnostic step, report the result: the connection pool is frequently
exhausted.

**Opening prompt:** My API keeps returning 500 errors intermittently. It seems
to happen more under load. Can you help me figure out what's going on?

**Max turns:** 10

**Acceptance criteria:**
- [ ] Asks clarifying questions before jumping to solutions
- [ ] Asks about the stack trace or error message
- [ ] Identifies pool exhaustion as the likely root cause
- [ ] Recommends increasing pool size or adding connection timeout handling

**Ground truth:**
- Root cause: connection pool too small for request volume
- Fix: increase pool size, add connection timeout, or add retry logic

Single-turn tests

When a test only needs one exchange, omit the user briefing and max turns:

markdown
### Test 2.1 — Error message explanation

**What this tests**: Can the agent explain a common error clearly?

**Opening prompt:** What does "ECONNREFUSED 127.0.0.1:5432" mean?

**Acceptance criteria:**
- [ ] Explains that the connection to PostgreSQL was refused
- [ ] Suggests checking if the database server is running
- [ ] Does NOT suggest unrelated causes

File organization

Eval files live alongside the code they test:

project/
  agents/
    support-bot/
      config.md
      config.eval.md                          # evals for support bot
      config.eval.2026-03-01T05-34-25Z.md     # eval report
    code-reviewer/
      config.md
      config.eval.md                          # evals for code reviewer
  project.eval.md                             # project-level evals

Find all evals: glob **/*.eval.md

Running evals

bash
# Run all tests in an eval file
node eval.js --eval path/to/file.eval.md --skill path/to/skill-dir

# Run a single test by ID
node eval.js --eval path/to/file.eval.md --test "1.1" --skill path/to/skill-dir

# Override defaults
node eval.js --eval path/to/file.eval.md --runs 5 --skill path/to/skill-dir

CLI reference

FlagPurposeDefault
--eval PATHEval definition file(required)
--test IDRun only this testall tests
--skill PATHSkill directory (repeatable)none
--runs NParallel sessions per test3
--output PATHJSON transcript + grades file{name}.eval.{ts}.json

Grading

For each acceptance criterion, the grader judges whether the conversation met it. Criteria are session-level — they apply to the full transcript, not individual turns.

Per-run gradeMeaning
PASSAll criteria met
PARTIALSome criteria met, some not
FAILMost or all criteria failed

Run every test 3 times — the minimum for a majority signal.

VerdictMeaning
PASS3/3 passed
FLAKY2/3 passed (needs investigation)
FAIL0–1/3 passed

Reports

Each eval run produces a report file: {name}.eval.{timestamp}.md alongside the eval definition.

markdown
# Eval Run — {name} — {timestamp}

**Commit:** {short hash}
**Uncommitted changes:** {description or "none"}

---

### 1.1 — Multi-turn debugging session: PASS

- [x] Asks clarifying questions before jumping to solutions
- [x] Asks about the stack trace or error message
- [x] Asks about the database setup
- [x] Suggests checking connection pool metrics
- [x] Identifies pool exhaustion as root cause
- [x] Recommends increasing pool size

Run 1: PASS | Run 2: PASS | Run 3: PASS

---

## Summary

| #   | Test                       | Run 1 | Run 2 | Run 3   | Verdict |
| --- | -------------------------- | ----- | ----- | ------- | ------- |
| 1.1 | Multi-turn debugging       | PASS  | PASS  | PASS    | PASS    |
| 2.1 | Error message explanation  | PASS  | PASS  | PARTIAL | FLAKY   |

2 tests: 1 passed, 1 flaky

Report conventions

  • Record the git commit hash and any uncommitted changes at the top
  • Grade each criterion with [x] (pass), [ ] (fail), or [~] (partial)
  • Show per-run results on one line: Run 1: PASS | Run 2: PARTIAL | Run 3: PASS
  • End with a summary table and totals