Benchmarking Skep
This is a maintainer / evaluator tool, not a CLI verb.
Skep deliberately does NOT ship a
skep benchmarktop-level command. Benchmarks run through a script from a checked-out copy of the skep repo; they are not part of the daily user surface. The tool’s always-on observability (tokens_usedon every task, visible inskep task showandskep workspace watch) covers the “what did this cost me?” question for everyday users. The harness inbenchmarks/is for producing reproducible A/B numbers — the kind you put in a blog post or README table.
The harness measures the real token cost of running a task through
Skep vs running the same task with plain claude -p and no Skep MCP.
The three-tier observability policy
Section titled “The three-tier observability policy”| Layer | What | Where it lives | Who runs it |
|---|---|---|---|
| 1. Always-on | tokens_used + tools_used_json on every task row | Built into the executor (internal/agent/session_usage.go) | Runs automatically on every skep task run. No flag, no opt-in. |
| 2. A/B harness | bench.sh + aggregate.go | benchmarks/ in the skep repo | Maintainers and evaluators, manually, once per numbers-producing session |
| 3. No CLI verb | — | — | — |
There is no skep benchmark command, no --benchmark flag on
skep task create, and no --benchmark flag on create_remote_task.
That is intentional and explained at the bottom of this page.
What is measured
Section titled “What is measured”Three numbers per (repo × task):
-
Classifier+plan — one-shot LLM call with a pre-built context from the index. Skep-only; no baseline comparison exists because without skep there is no classifier at all. Captured as wall clock + exit code.
-
Executor baseline — plain
claude -p "<task>" --allowedTools Edit,Write,Bash, no MCP, no pre-built plan. Claude explores the repo withRead/Grep/Globas needed. This is what you’d pay to run the task without skep. -
Executor with MCP — same
claude -p "<task>"but with--mcp-configpointing at skep’s stdio MCP server. Claude can callsearch_symbols,get_file_context,get_call_graph, etc. instead of reading files cold.
The comparison that matters: (classifier + executor_with_MCP) vs
executor_baseline. Skep wins when the sum is smaller.
Running
Section titled “Running”# 1. Build the measurement tool(cd benchmarks/measure && go build -o ../measure.bin .)
# 2. Run the matrix. Pass one or more already-initialized repo paths../benchmarks/bench.sh ~/code/backend ~/code/mobile
# 3. Aggregate the results into a markdown table.go run ./benchmarks/aggregate.go benchmarks/results/<timestamp>/ \ > benchmarks/results/<timestamp>/results.mdEach task runs three times per repo (classify → baseline → with-MCP), each on a throwaway branch that is deleted after measurement. Your working tree is not touched.
Prerequisites
Section titled “Prerequisites”skepinstalled on PATH (make install)claude(Claude Code CLI) on PATH- Each target repo already through
skep initwith a valid llm configured (the classify step uses the configured classify model) - Go 1.21+ for the measure tool and aggregator
Dedup benchmark
Section titled “Dedup benchmark”A second, smaller harness lives at benchmarks/dedup/. It evaluates
the four cheap dedup layers (keyword → trigram → tf-idf → minhash)
against a labeled fixture of task-description pairs and reports
per-layer precision / recall / F1 plus per-category recall.
cd benchmarks/dedupgo run .Use this to tune the SKEP_DEDUP_*_THRESHOLD env vars for your repo’s
real task history. The shipped fixture has 38 hand-labeled pairs across
ten categories (exact, morphological, reorder, paraphrase, synonym,
near-miss, direction-reversed, shared-rare-word, same-surface-different-fix,
report-vs-task). See the benchmarks/dedup/README.md in the repo for
full usage and fixture format.
Why per-turn prompt cache matters
Section titled “Why per-turn prompt cache matters”Claude Code heavily uses Anthropic’s prompt cache — the first turn pays
full price, subsequent turns read ~90% of the context from cache at 10%
of the cost. The measure tool emits both the raw
cache_read_input_tokens and an effective_input_billed which
approximates the real billed amount as:
effective_input_billed = input_tokens + 0.1 × cache_read_input_tokensThe aggregated markdown table reports effective_input_billed because
that’s what your Anthropic invoice actually reflects. Raw input_tokens
overstate cost once you’ve been in a session for a few turns.
Fixing the task set
Section titled “Fixing the task set”Tasks live in benchmarks/tasks/tasks.yaml. The defaults aim for the
five common archetypes (bug / feature / refactor / test / docs) and are
sized to complete in under 5 minutes on a medium repo. Add or replace
them if you want to benchmark a specific workload.
Caveats
Section titled “Caveats”- The harness measures classifier wall clock but not classifier token
cost directly. Classify shells out to
claude -pand does not emit a session JSONL the way an interactive Claude Code run does, so the per-classify token count is not recoverable from the harness. - The “baseline” variant runs plain
claude -pwith no MCP. If you want to compare against a different tool you’ll need to adapt the harness. - Wall-clock numbers reflect LLM latency, not harness overhead.
Why this isn’t a CLI verb
Section titled “Why this isn’t a CLI verb”A skep benchmark subcommand would make benchmarking feel like an
everyday operation. It isn’t. Producing a reproducible A/B number:
- Doubles the LLM cost of the task being measured (once with MCP, once without)
- Doubles the wall clock
- Only makes sense on a curated set of canonical tasks (the ones in
benchmarks/tasks/tasks.yaml), not on whatever real work the user happens to have in their queue - Produces artifacts (per-variant JSON rows, aggregated markdown table) that are intended for a blog post, not for the user’s daily flow
Putting this in the main CLI help menu implies “run this sometimes” and invites the question “when should I benchmark?” — a question with no good answer. The right answer is “never, unless you’re writing a blog post about skep.”
Look at the neighbors for the established pattern:
| Tool | Has <tool> benchmark? | Where benchmarks live |
|---|---|---|
| git | no | outside |
| kubectl | no | kube-burner, k6-kubernetes |
| helm | no | outside |
| docker | no | outside |
| aider (the closest peer) | no | separate harness, maintainers only |
Only cargo bench is a counterexample, and it measures user code
performance — not cargo’s own.
Reproducing a published Skep benchmark result
Section titled “Reproducing a published Skep benchmark result”If you see a token-savings number in the Skep README, blog post, or
docs site, you can reproduce it yourself by cloning the skep repo and
running bench.sh against a few of your own initialized repos. The
script is the source of truth; the numbers in marketing materials are
derived from it.