agentskills.codes
EV

eval-analysis

>

Install

mkdir -p .claude/skills/eval-analysis && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15878" && unzip -o skill.zip -d .claude/skills/eval-analysis && rm skill.zip

Installs to .claude/skills/eval-analysis

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Comprehensive guide for analyzing SABER evaluation results — model comparison, agent architecture comparison, domain-specific analysis, and cross-domain aggregate analysis. Use this when asked to analyze eval results, compare models, generate visualizations, interpret scores, investigate cost-efficiency trade-offs, or extend analysis to new domains.
351 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

Eval Analysis Skill

Overview

SABER evaluation analysis is a structured data science framework for comparing AI agent performance across cybersecurity benchmark domains. The analysis pipeline transforms .eval log files into actionable insights through 12+ standardized experiments plus domain-specific analyses.

Getting Started: Sample Eval Data

No evals to analyze yet? The repo ships with pre-run sample .eval files and can auto-download more from HuggingFace.

Option 1: Use bundled sample evals (fastest)

The eval_samples/ directory contains pre-run evaluation results for all 3 domains across 5 models and 3 agent architectures:

eval_samples/
├── excytin/          # 20 eval files (5 models × default + reasoning baselines + 3 agents)
├── cybench/          # 17 eval files
└── cti_realm/        # 11 eval files

Models included: Claude Haiku 4.5, Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.4, GPT-5.4-mini
Agent architectures included: React, GH Copilot, Claude Code (for Sonnet 4.6)
Baselines: No-reasoning/no-thinking variants for extended thinking comparison

The notebooks are pre-configured to load from eval_samples/. Just run any notebook — no configuration needed.

Option 2: Auto-download from HuggingFace

If .eval files are missing locally, ensure_eval_files() automatically downloads them from the AcesEvals HuggingFace dataset:

from saber_analysis import ensure_eval_files

ensure_eval_files(
    eval_logs=EVAL_LOGS,       # {display_name: filename.eval}
    log_dir=LOG_DIR,           # local directory
    domain='excytin',          # HuggingFace subfolder
    fallback_dirs=[            # check these before downloading
        str(REPO_ROOT / 'eval_samples' / 'excytin'),
        str(REPO_ROOT / 'latest_experiments' / 'excytin'),
    ],
)

Resolution order: local log_dirfallback_dirs → HuggingFace download
HuggingFace repo: anandmudgerikar/AcesEvals (dataset type)
HuggingFace path: latest_experiment_samples/{domain}/{filename}.eval

Option 3: Run your own evaluations

If you need results for a model or agent architecture not in the sample evals:

# Run for a new model
uv run inspect eval domains/<domain> --model <api/model-id> --display plain

# Run for a specific agent architecture
uv run inspect eval domains/<domain> --model <api/model-id> -T agent=react --display plain
uv run inspect eval domains/<domain> --model <api/model-id> -T agent=copilot --display plain

# Run with extended reasoning disabled (for baseline comparison)
# Set INSPECT_EVAL_MODEL_ARGS=reasoning_effort=none or configure in .env

Then copy the .eval file to eval_samples/<domain>/ or latest_experiments/<domain>/ and add it to the notebook's EVAL_LOGS configuration.

What's available vs. what you need to run

DomainModels (sample evals)Agent Architectures (sample evals)What to run yourself
ExcytinHaiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.4-miniReact, Copilot, Claude Code (Sonnet)New models, new agents on non-Sonnet models
CyBenchHaiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.4-miniReact, Copilot, Claude Code (Sonnet)New models, new agents, more challenges
CTI RealmHaiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.4-miniCopilot, Claude Code (Sonnet)React for non-Sonnet models, new models

Notebook Inventory

NotebookScopeWhen to use
notebooks/eval_analysis.ipynbDomain-agnostic templateAnalyzing any single domain's model results; starting point for new domains
notebooks/excytin_analysis.ipynbExcytin-specific (self-contained)Full analysis of Excytin incident response domain
notebooks/cybench_analysis.ipynbCyBench-specific (self-contained)Full analysis of CyBench CTF challenges
notebooks/cti_realm_analysis.ipynbCTI Realm-specific (self-contained)Full analysis of CTI Realm threat intel domain
notebooks/agent_architecture_analysis.ipynbAgent comparison templateComparing agent architectures (React, Copilot, Claude Code) on any domain
notebooks/excytin_agent_architecture_analysis.ipynbExcytin agent comparisonAgent architecture analysis specific to Excytin
notebooks/cybench_agent_architecture_analysis.ipynbCyBench agent comparisonAgent architecture analysis specific to CyBench
notebooks/cti_realm_agent_architecture_analysis.ipynbCTI Realm agent comparisonAgent architecture analysis specific to CTI Realm
notebooks/aggregate_model_analysis.ipynbCross-domain model comparisonAggregate model analysis across all domains (normalized)
notebooks/aggregate_agent_architecture_analysis.ipynbCross-domain agent comparisonAggregate agent architecture analysis across all domains
notebooks/model_safety_filters_analysis.ipynbSafety filter investigationAnalyzing model safety refusals and guardrail triggers
notebooks/excytin_basic_analysis.ipynbQuick Excytin analysisLightweight Excytin analysis for rapid iteration

Shared Analysis Library

The notebooks/saber_analysis/ package provides reusable utilities:

ModuleFunctionsPurpose
data_loader.pyensure_eval_files(), load_eval_logs(), load_baseline_logs(), load_trajectory_data()Standardized .eval parsing; HuggingFace fallback downloads; typed inspect_ai log API
cost.pycalc_cost(), extract_cost_rows()Token→dollar conversion using per-model pricing
plots.pysetup_plotting(), make_legend_patches(), classify_score_type()Consistent matplotlib styling across all notebooks

Data Pipeline

.eval ZIP archives
    │
    ├── header.json          → run metadata, overall saber_overall scores
    ├── samples/*.json       → per-sample messages, scores, timing, tool calls
    └── summaries.json       → aggregated score summaries
    │
    ▼
inspect_ai.log API
    ├── read_eval_log(path, header_only=True)       → EvalLog with results.scores
    ├── read_eval_log_sample_summaries(path)         → per-sample scores & timing
    └── read_eval_log_sample(path, id=...)           → full messages & tool calls
    │
    ▼
saber_analysis library
    ├── ensure_eval_files()  → locates locally or downloads from HuggingFace
    ├── load_eval_logs()     → parses into DataFrames (overall_df, samples_df, subtasks_df)
    ├── load_trajectory_data() → extracts tool calls, steps, timing per sample
    └── extract_cost_rows()  → builds cost DataFrame from token usage + pricing
    │
    ▼
Notebook Configuration
    ├── EVAL_LOGS: dict[str, str]      → {display_name: filename.eval}
    ├── COLORS: dict[str, str]         → {display_name: hex_color}
    ├── PRICING: dict[str, dict]       → {api_model_id: {input, output, cache_read, cache_write}}
    ├── GROUP_FN: Callable             → domain-specific sample ID → group label
    └── NO_THINKING_LOGS (optional)    → baseline without extended reasoning
    │
    ▼
Analysis Cells (Experiments 1–14+)
    │
    ▼
Artifacts saved to notebooks/artifacts/{domain}/*.png

Key DataFrames

DataFrameColumnsSource
overall_dfmodel, mean, stderrload_eval_logs() — header-level saber_overall
samples_dfmodel, sample_id, group, scoreload_eval_logs() — per-sample saber_overall
subtasks_dfmodel, sample_id, group, score_type, scoreload_eval_logs() — submission/checkpoint/aggregate breakdown
traj_dfmodel, sample_id, n_tool_calls, n_steps, total_time, tool_countsload_trajectory_data() — agent behavior data
cost_dfmodel, score, input_tokens, output_tokens, cache_read, cache_write, reasoning_tokens, total_cost, cost_per_sampleextract_cost_rows()

Domain-Agnostic Analysis Types (Experiments 1–14)

These 14 core analyses apply to any SABER domain. They appear in eval_analysis.ipynb and every domain-specific notebook.

Experiment 1: Overall Reward Distribution

  • Metric: Mean saber_overall score per model
  • Visualization: Bar chart with error bars (stderr) + optional hatched overlay for reasoning delta
  • Interpretation:
    • Higher bars = better performance
    • Overlapping error bars = statistically insignificant differences
    • Reasoning delta (hatched) shows impact of extended thinking (positive = reasoning helps)
  • Code pattern:
    overall_df.plot.bar(x='model', y='mean', yerr='stderr')
    

Experiment 2: Per-Group Breakdown (Domain-Specific Grouping)

  • Metric: Scores segmented by GROUP_FN dimension
  • Visualization: Clustered bar chart (models × groups)
  • Group functions by domain:
    • Excytin: '_'.join(id.split('_')[:2]) → 8 security incidents
    • CyBench: '_'.join(id.split('_')[:-2]) → challenge types
    • CTI Realm: id.split('_')[0] → 3 platforms (Linux, AKS, Cloud)
  • Interpretation:
    • Large within-group variance = model-dependent difficulty
    • Shifting model rankings across groups = no single dominant model
    • Uniform clusters = inherently hard/easy groups
  • Requires: Setting GROUP_FN in configuration; set GROUP_FN = None to skip

Experiment 3: Cost Analysis

  • Metrics: Total cost ($), cost per sample, token breakdown (input/output/cache/reasoning)
  • Visualizations:
    1. Total cost bar chart
    2. Cost per sample bar chart
    3. Pareto frontier (Score vs. Cost — upper-left corner = best value)
  • Interpretation:
    • Pareto-dominant models: no other model is both cheaper AND better
    • Arrows on Pareto chart show reasoning cost delta
    • Diminishing returns visible from frontier plateau
  • Code pattern:
    cost_df = pd.DataFrame(extract_cost_rows(EVAL_LOGS
    

Content truncated.

More by microsoft

View all by microsoft

Search skills

Search the agent skills registry