eval-analysis

Name: eval-analysis
Author: microsoft

Install

mkdir -p .claude/skills/eval-analysis && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15878" && unzip -o skill.zip -d .claude/skills/eval-analysis && rm skill.zip

Installs to .claude/skills/eval-analysis

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Comprehensive guide for analyzing SABER evaluation results — model comparison, agent architecture comparison, domain-specific analysis, and cross-domain aggregate analysis. Use this when asked to analyze eval results, compare models, generate visualizations, interpret scores, investigate cost-efficiency trade-offs, or extend analysis to new domains.

351 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

Eval Analysis Skill

Overview

SABER evaluation analysis is a structured data science framework for comparing AI agent performance across cybersecurity benchmark domains. The analysis pipeline transforms .eval log files into actionable insights through 12+ standardized experiments plus domain-specific analyses.

Getting Started: Sample Eval Data

No evals to analyze yet? The repo ships with pre-run sample .eval files and can auto-download more from HuggingFace.

Option 1: Use bundled sample evals (fastest)

The eval_samples/ directory contains pre-run evaluation results for all 3 domains across 5 models and 3 agent architectures:

eval_samples/
├── excytin/          # 20 eval files (5 models × default + reasoning baselines + 3 agents)
├── cybench/          # 17 eval files
└── cti_realm/        # 11 eval files

Models included: Claude Haiku 4.5, Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.4, GPT-5.4-mini
Agent architectures included: React, GH Copilot, Claude Code (for Sonnet 4.6)
Baselines: No-reasoning/no-thinking variants for extended thinking comparison

The notebooks are pre-configured to load from eval_samples/. Just run any notebook — no configuration needed.

Option 2: Auto-download from HuggingFace

If .eval files are missing locally, ensure_eval_files() automatically downloads them from the AcesEvals HuggingFace dataset:

from saber_analysis import ensure_eval_files

ensure_eval_files(
    eval_logs=EVAL_LOGS,       # {display_name: filename.eval}
    log_dir=LOG_DIR,           # local directory
    domain='excytin',          # HuggingFace subfolder
    fallback_dirs=[            # check these before downloading
        str(REPO_ROOT / 'eval_samples' / 'excytin'),
        str(REPO_ROOT / 'latest_experiments' / 'excytin'),
    ],
)

Resolution order: local log_dir → fallback_dirs → HuggingFace download
HuggingFace repo: anandmudgerikar/AcesEvals (dataset type)
HuggingFace path: latest_experiment_samples/{domain}/{filename}.eval

Option 3: Run your own evaluations

If you need results for a model or agent architecture not in the sample evals:

# Run for a new model
uv run inspect eval domains/<domain> --model <api/model-id> --display plain

# Run for a specific agent architecture
uv run inspect eval domains/<domain> --model <api/model-id> -T agent=react --display plain
uv run inspect eval domains/<domain> --model <api/model-id> -T agent=copilot --display plain

# Run with extended reasoning disabled (for baseline comparison)
# Set INSPECT_EVAL_MODEL_ARGS=reasoning_effort=none or configure in .env

Then copy the .eval file to eval_samples/<domain>/ or latest_experiments/<domain>/ and add it to the notebook's EVAL_LOGS configuration.

What's available vs. what you need to run

Domain	Models (sample evals)	Agent Architectures (sample evals)	What to run yourself
Excytin	Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.4-mini	React, Copilot, Claude Code (Sonnet)	New models, new agents on non-Sonnet models
CyBench	Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.4-mini	React, Copilot, Claude Code (Sonnet)	New models, new agents, more challenges
CTI Realm	Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.4-mini	Copilot, Claude Code (Sonnet)	React for non-Sonnet models, new models

Notebook Inventory

Notebook	Scope	When to use
`notebooks/eval_analysis.ipynb`	Domain-agnostic template	Analyzing any single domain's model results; starting point for new domains
`notebooks/excytin_analysis.ipynb`	Excytin-specific (self-contained)	Full analysis of Excytin incident response domain
`notebooks/cybench_analysis.ipynb`	CyBench-specific (self-contained)	Full analysis of CyBench CTF challenges
`notebooks/cti_realm_analysis.ipynb`	CTI Realm-specific (self-contained)	Full analysis of CTI Realm threat intel domain
`notebooks/agent_architecture_analysis.ipynb`	Agent comparison template	Comparing agent architectures (React, Copilot, Claude Code) on any domain
`notebooks/excytin_agent_architecture_analysis.ipynb`	Excytin agent comparison	Agent architecture analysis specific to Excytin
`notebooks/cybench_agent_architecture_analysis.ipynb`	CyBench agent comparison	Agent architecture analysis specific to CyBench
`notebooks/cti_realm_agent_architecture_analysis.ipynb`	CTI Realm agent comparison	Agent architecture analysis specific to CTI Realm
`notebooks/aggregate_model_analysis.ipynb`	Cross-domain model comparison	Aggregate model analysis across all domains (normalized)
`notebooks/aggregate_agent_architecture_analysis.ipynb`	Cross-domain agent comparison	Aggregate agent architecture analysis across all domains
`notebooks/model_safety_filters_analysis.ipynb`	Safety filter investigation	Analyzing model safety refusals and guardrail triggers
`notebooks/excytin_basic_analysis.ipynb`	Quick Excytin analysis	Lightweight Excytin analysis for rapid iteration

Shared Analysis Library

The notebooks/saber_analysis/ package provides reusable utilities:

Module	Functions	Purpose
`data_loader.py`	`ensure_eval_files()`, `load_eval_logs()`, `load_baseline_logs()`, `load_trajectory_data()`	Standardized .eval parsing; HuggingFace fallback downloads; typed inspect_ai log API
`cost.py`	`calc_cost()`, `extract_cost_rows()`	Token→dollar conversion using per-model pricing
`plots.py`	`setup_plotting()`, `make_legend_patches()`, `classify_score_type()`	Consistent matplotlib styling across all notebooks

Data Pipeline

.eval ZIP archives
    │
    ├── header.json          → run metadata, overall saber_overall scores
    ├── samples/*.json       → per-sample messages, scores, timing, tool calls
    └── summaries.json       → aggregated score summaries
    │
    ▼
inspect_ai.log API
    ├── read_eval_log(path, header_only=True)       → EvalLog with results.scores
    ├── read_eval_log_sample_summaries(path)         → per-sample scores & timing
    └── read_eval_log_sample(path, id=...)           → full messages & tool calls
    │
    ▼
saber_analysis library
    ├── ensure_eval_files()  → locates locally or downloads from HuggingFace
    ├── load_eval_logs()     → parses into DataFrames (overall_df, samples_df, subtasks_df)
    ├── load_trajectory_data() → extracts tool calls, steps, timing per sample
    └── extract_cost_rows()  → builds cost DataFrame from token usage + pricing
    │
    ▼
Notebook Configuration
    ├── EVAL_LOGS: dict[str, str]      → {display_name: filename.eval}
    ├── COLORS: dict[str, str]         → {display_name: hex_color}
    ├── PRICING: dict[str, dict]       → {api_model_id: {input, output, cache_read, cache_write}}
    ├── GROUP_FN: Callable             → domain-specific sample ID → group label
    └── NO_THINKING_LOGS (optional)    → baseline without extended reasoning
    │
    ▼
Analysis Cells (Experiments 1–14+)
    │
    ▼
Artifacts saved to notebooks/artifacts/{domain}/*.png

Key DataFrames

DataFrame	Columns	Source
`overall_df`	model, mean, stderr	`load_eval_logs()` — header-level saber_overall
`samples_df`	model, sample_id, group, score	`load_eval_logs()` — per-sample saber_overall
`subtasks_df`	model, sample_id, group, score_type, score	`load_eval_logs()` — submission/checkpoint/aggregate breakdown
`traj_df`	model, sample_id, n_tool_calls, n_steps, total_time, tool_counts	`load_trajectory_data()` — agent behavior data
`cost_df`	model, score, input_tokens, output_tokens, cache_read, cache_write, reasoning_tokens, total_cost, cost_per_sample	`extract_cost_rows()`

Domain-Agnostic Analysis Types (Experiments 1–14)

These 14 core analyses apply to any SABER domain. They appear in eval_analysis.ipynb and every domain-specific notebook.

Experiment 1: Overall Reward Distribution

Metric: Mean saber_overall score per model
Visualization: Bar chart with error bars (stderr) + optional hatched overlay for reasoning delta
Interpretation:
- Higher bars = better performance
- Overlapping error bars = statistically insignificant differences
- Reasoning delta (hatched) shows impact of extended thinking (positive = reasoning helps)

Code pattern:

overall_df.plot.bar(x='model', y='mean', yerr='stderr')

Experiment 2: Per-Group Breakdown (Domain-Specific Grouping)

Metric: Scores segmented by GROUP_FN dimension
Visualization: Clustered bar chart (models × groups)
Group functions by domain:
- Excytin: '_'.join(id.split('_')[:2]) → 8 security incidents
- CyBench: '_'.join(id.split('_')[:-2]) → challenge types
- CTI Realm: id.split('_')[0] → 3 platforms (Linux, AKS, Cloud)
Interpretation:
- Large within-group variance = model-dependent difficulty
- Shifting model rankings across groups = no single dominant model
- Uniform clusters = inherently hard/easy groups
Requires: Setting GROUP_FN in configuration; set GROUP_FN = None to skip

Experiment 3: Cost Analysis

Metrics: Total cost ($), cost per sample, token breakdown (input/output/cache/reasoning)
Visualizations:
1. Total cost bar chart
2. Cost per sample bar chart
3. Pareto frontier (Score vs. Cost — upper-left corner = best value)
Interpretation:
- Pareto-dominant models: no other model is both cheaper AND better
- Arrows on Pareto chart show reasoning cost delta
- Diminishing returns visible from frontier plateau

Code pattern:

cost_df = pd.DataFrame(extract_cost_rows(EVAL_LOGS

Content truncated.

More by microsoft

View all by microsoft →

fix-dependabot-alerts

microsoft

Fix Dependabot security alerts by updating vulnerable npm dependencies. Use when the user mentions "dependabot", "security alerts", "vulnerability", "CVE", or wants to update packages with security issues.

1872

wiki-architect

microsoft

Analyzes code repositories and generates hierarchical documentation structures with onboarding guides. Use when the user wants to create a wiki, generate documentation, map a codebase structure, or understand a project's architecture at a high level.

1144

azure-ai-vision-imageanalysis-py

microsoft

Azure AI Vision Image Analysis SDK for captions, tags, objects, OCR, people detection, and smart cropping. Use for computer vision and image understanding tasks. Triggers: "image analysis", "computer vision", "OCR", "object detection", "ImageAnalysisClient", "image caption".

622

fastapi-router-py

microsoft

Create FastAPI routers with CRUD operations, authentication dependencies, and proper response models. Use when building REST API endpoints, creating new routes, implementing CRUD operations, or adding authenticated endpoints in FastAPI applications.

525

playwright-mcp-dev

microsoft

Explains how to add and debug playwright MCP tools and CLI commands.

529

react-flow-node-ts

microsoft

Create React Flow node components with TypeScript types, handles, and Zustand integration. Use when building custom nodes for React Flow canvas, creating visual workflow editors, or implementing node-based UI components.

530

Install

mkdir -p .claude/skills/eval-analysis && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15878" && unzip -o skill.zip -d .claude/skills/eval-analysis && rm skill.zip

Installs to .claude/skills/eval-analysis

Safety

Review before install

Runs shell / code
Reads credentials

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

3mo ago

License

MIT

Repo stars

Loads

~7,484 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

microsoft

7 skills published

Links

Source code

eval-analysis

Install

Activation

About this skill

Eval Analysis Skill

Overview

Getting Started: Sample Eval Data

Option 1: Use bundled sample evals (fastest)

Option 2: Auto-download from HuggingFace

Option 3: Run your own evaluations

What's available vs. what you need to run

Notebook Inventory

Shared Analysis Library

Data Pipeline

Key DataFrames

Domain-Agnostic Analysis Types (Experiments 1–14)

Experiment 1: Overall Reward Distribution

Experiment 2: Per-Group Breakdown (Domain-Specific Grouping)

Experiment 3: Cost Analysis

More by microsoft

fix-dependabot-alerts

wiki-architect

azure-ai-vision-imageanalysis-py

fastapi-router-py

playwright-mcp-dev

react-flow-node-ts

Search skills