eval-analysis
>
Install
mkdir -p .claude/skills/eval-analysis && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15878" && unzip -o skill.zip -d .claude/skills/eval-analysis && rm skill.zipInstalls to .claude/skills/eval-analysis
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Comprehensive guide for analyzing SABER evaluation results — model comparison, agent architecture comparison, domain-specific analysis, and cross-domain aggregate analysis. Use this when asked to analyze eval results, compare models, generate visualizations, interpret scores, investigate cost-efficiency trade-offs, or extend analysis to new domains.About this skill
Eval Analysis Skill
Overview
SABER evaluation analysis is a structured data science framework for comparing AI agent performance across cybersecurity benchmark domains. The analysis pipeline transforms .eval log files into actionable insights through 12+ standardized experiments plus domain-specific analyses.
Getting Started: Sample Eval Data
No evals to analyze yet? The repo ships with pre-run sample .eval files and can auto-download more from HuggingFace.
Option 1: Use bundled sample evals (fastest)
The eval_samples/ directory contains pre-run evaluation results for all 3 domains across 5 models and 3 agent architectures:
eval_samples/
├── excytin/ # 20 eval files (5 models × default + reasoning baselines + 3 agents)
├── cybench/ # 17 eval files
└── cti_realm/ # 11 eval files
Models included: Claude Haiku 4.5, Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.4, GPT-5.4-mini
Agent architectures included: React, GH Copilot, Claude Code (for Sonnet 4.6)
Baselines: No-reasoning/no-thinking variants for extended thinking comparison
The notebooks are pre-configured to load from eval_samples/. Just run any notebook — no configuration needed.
Option 2: Auto-download from HuggingFace
If .eval files are missing locally, ensure_eval_files() automatically downloads them from the AcesEvals HuggingFace dataset:
from saber_analysis import ensure_eval_files
ensure_eval_files(
eval_logs=EVAL_LOGS, # {display_name: filename.eval}
log_dir=LOG_DIR, # local directory
domain='excytin', # HuggingFace subfolder
fallback_dirs=[ # check these before downloading
str(REPO_ROOT / 'eval_samples' / 'excytin'),
str(REPO_ROOT / 'latest_experiments' / 'excytin'),
],
)
Resolution order: local log_dir → fallback_dirs → HuggingFace download
HuggingFace repo: anandmudgerikar/AcesEvals (dataset type)
HuggingFace path: latest_experiment_samples/{domain}/{filename}.eval
Option 3: Run your own evaluations
If you need results for a model or agent architecture not in the sample evals:
# Run for a new model
uv run inspect eval domains/<domain> --model <api/model-id> --display plain
# Run for a specific agent architecture
uv run inspect eval domains/<domain> --model <api/model-id> -T agent=react --display plain
uv run inspect eval domains/<domain> --model <api/model-id> -T agent=copilot --display plain
# Run with extended reasoning disabled (for baseline comparison)
# Set INSPECT_EVAL_MODEL_ARGS=reasoning_effort=none or configure in .env
Then copy the .eval file to eval_samples/<domain>/ or latest_experiments/<domain>/ and add it to the notebook's EVAL_LOGS configuration.
What's available vs. what you need to run
| Domain | Models (sample evals) | Agent Architectures (sample evals) | What to run yourself |
|---|---|---|---|
| Excytin | Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.4-mini | React, Copilot, Claude Code (Sonnet) | New models, new agents on non-Sonnet models |
| CyBench | Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.4-mini | React, Copilot, Claude Code (Sonnet) | New models, new agents, more challenges |
| CTI Realm | Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.4-mini | Copilot, Claude Code (Sonnet) | React for non-Sonnet models, new models |
Notebook Inventory
| Notebook | Scope | When to use |
|---|---|---|
notebooks/eval_analysis.ipynb | Domain-agnostic template | Analyzing any single domain's model results; starting point for new domains |
notebooks/excytin_analysis.ipynb | Excytin-specific (self-contained) | Full analysis of Excytin incident response domain |
notebooks/cybench_analysis.ipynb | CyBench-specific (self-contained) | Full analysis of CyBench CTF challenges |
notebooks/cti_realm_analysis.ipynb | CTI Realm-specific (self-contained) | Full analysis of CTI Realm threat intel domain |
notebooks/agent_architecture_analysis.ipynb | Agent comparison template | Comparing agent architectures (React, Copilot, Claude Code) on any domain |
notebooks/excytin_agent_architecture_analysis.ipynb | Excytin agent comparison | Agent architecture analysis specific to Excytin |
notebooks/cybench_agent_architecture_analysis.ipynb | CyBench agent comparison | Agent architecture analysis specific to CyBench |
notebooks/cti_realm_agent_architecture_analysis.ipynb | CTI Realm agent comparison | Agent architecture analysis specific to CTI Realm |
notebooks/aggregate_model_analysis.ipynb | Cross-domain model comparison | Aggregate model analysis across all domains (normalized) |
notebooks/aggregate_agent_architecture_analysis.ipynb | Cross-domain agent comparison | Aggregate agent architecture analysis across all domains |
notebooks/model_safety_filters_analysis.ipynb | Safety filter investigation | Analyzing model safety refusals and guardrail triggers |
notebooks/excytin_basic_analysis.ipynb | Quick Excytin analysis | Lightweight Excytin analysis for rapid iteration |
Shared Analysis Library
The notebooks/saber_analysis/ package provides reusable utilities:
| Module | Functions | Purpose |
|---|---|---|
data_loader.py | ensure_eval_files(), load_eval_logs(), load_baseline_logs(), load_trajectory_data() | Standardized .eval parsing; HuggingFace fallback downloads; typed inspect_ai log API |
cost.py | calc_cost(), extract_cost_rows() | Token→dollar conversion using per-model pricing |
plots.py | setup_plotting(), make_legend_patches(), classify_score_type() | Consistent matplotlib styling across all notebooks |
Data Pipeline
.eval ZIP archives
│
├── header.json → run metadata, overall saber_overall scores
├── samples/*.json → per-sample messages, scores, timing, tool calls
└── summaries.json → aggregated score summaries
│
▼
inspect_ai.log API
├── read_eval_log(path, header_only=True) → EvalLog with results.scores
├── read_eval_log_sample_summaries(path) → per-sample scores & timing
└── read_eval_log_sample(path, id=...) → full messages & tool calls
│
▼
saber_analysis library
├── ensure_eval_files() → locates locally or downloads from HuggingFace
├── load_eval_logs() → parses into DataFrames (overall_df, samples_df, subtasks_df)
├── load_trajectory_data() → extracts tool calls, steps, timing per sample
└── extract_cost_rows() → builds cost DataFrame from token usage + pricing
│
▼
Notebook Configuration
├── EVAL_LOGS: dict[str, str] → {display_name: filename.eval}
├── COLORS: dict[str, str] → {display_name: hex_color}
├── PRICING: dict[str, dict] → {api_model_id: {input, output, cache_read, cache_write}}
├── GROUP_FN: Callable → domain-specific sample ID → group label
└── NO_THINKING_LOGS (optional) → baseline without extended reasoning
│
▼
Analysis Cells (Experiments 1–14+)
│
▼
Artifacts saved to notebooks/artifacts/{domain}/*.png
Key DataFrames
| DataFrame | Columns | Source |
|---|---|---|
overall_df | model, mean, stderr | load_eval_logs() — header-level saber_overall |
samples_df | model, sample_id, group, score | load_eval_logs() — per-sample saber_overall |
subtasks_df | model, sample_id, group, score_type, score | load_eval_logs() — submission/checkpoint/aggregate breakdown |
traj_df | model, sample_id, n_tool_calls, n_steps, total_time, tool_counts | load_trajectory_data() — agent behavior data |
cost_df | model, score, input_tokens, output_tokens, cache_read, cache_write, reasoning_tokens, total_cost, cost_per_sample | extract_cost_rows() |
Domain-Agnostic Analysis Types (Experiments 1–14)
These 14 core analyses apply to any SABER domain. They appear in eval_analysis.ipynb and every domain-specific notebook.
Experiment 1: Overall Reward Distribution
- Metric: Mean
saber_overallscore per model - Visualization: Bar chart with error bars (stderr) + optional hatched overlay for reasoning delta
- Interpretation:
- Higher bars = better performance
- Overlapping error bars = statistically insignificant differences
- Reasoning delta (hatched) shows impact of extended thinking (positive = reasoning helps)
- Code pattern:
overall_df.plot.bar(x='model', y='mean', yerr='stderr')
Experiment 2: Per-Group Breakdown (Domain-Specific Grouping)
- Metric: Scores segmented by
GROUP_FNdimension - Visualization: Clustered bar chart (models × groups)
- Group functions by domain:
- Excytin:
'_'.join(id.split('_')[:2])→ 8 security incidents - CyBench:
'_'.join(id.split('_')[:-2])→ challenge types - CTI Realm:
id.split('_')[0]→ 3 platforms (Linux, AKS, Cloud)
- Excytin:
- Interpretation:
- Large within-group variance = model-dependent difficulty
- Shifting model rankings across groups = no single dominant model
- Uniform clusters = inherently hard/easy groups
- Requires: Setting
GROUP_FNin configuration; setGROUP_FN = Noneto skip
Experiment 3: Cost Analysis
- Metrics: Total cost ($), cost per sample, token breakdown (input/output/cache/reasoning)
- Visualizations:
- Total cost bar chart
- Cost per sample bar chart
- Pareto frontier (Score vs. Cost — upper-left corner = best value)
- Interpretation:
- Pareto-dominant models: no other model is both cheaper AND better
- Arrows on Pareto chart show reasoning cost delta
- Diminishing returns visible from frontier plateau
- Code pattern:
cost_df = pd.DataFrame(extract_cost_rows(EVAL_LOGS
Content truncated.