agentskills.codes
LL

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

Install

mkdir -p .claude/skills/llm-evaluation-h4d3zs && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13370" && unzip -o skill.zip -d .claude/skills/llm-evaluation-h4d3zs && rm skill.zip

Installs to .claude/skills/llm-evaluation-h4d3zs

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
232 chars✓ has a “when” trigger

About this skill

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

Do not use this skill when

  • The task is unrelated to llm evaluation
  • You need a different domain or tool outside this scope

Instructions

  • Clarify goals, constraints, and required inputs.
  • Apply relevant best practices and validate outcomes.
  • Provide actionable steps and verification.
  • If detailed examples are required, open resources/implementation-playbook.md.

Use this skill when

  • Measuring LLM application performance systematically
  • Comparing different models or prompts
  • Detecting performance regressions before deployment
  • Validating improvements from prompt changes
  • Building confidence in production systems
  • Establishing baselines and tracking progress over time
  • Debugging unexpected model behavior

Core Evaluation Types

🧠 Knowledge Modules (Fractal Skills)

1. 1. Automated Metrics

2. 2. Human Evaluation

3. 3. LLM-as-Judge

4. BLEU Score

5. ROUGE Score

6. BERTScore

7. Custom Metrics

8. Single Output Evaluation

9. Pairwise Comparison

10. Annotation Guidelines

11. Inter-Rater Agreement

12. Statistical Testing Framework

13. Regression Detection

14. Running Benchmarks

Search skills

Search the agent skills registry