llm-evaluation

Name: llm-evaluation
Author: H4D3ZS

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

Install

mkdir -p .claude/skills/llm-evaluation-h4d3zs && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13370" && unzip -o skill.zip -d .claude/skills/llm-evaluation-h4d3zs && rm skill.zip

Installs to .claude/skills/llm-evaluation-h4d3zs

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

232 chars✓ has a “when” trigger

About this skill

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

Do not use this skill when

The task is unrelated to llm evaluation
You need a different domain or tool outside this scope

Instructions

Clarify goals, constraints, and required inputs.
Apply relevant best practices and validate outcomes.
Provide actionable steps and verification.
If detailed examples are required, open resources/implementation-playbook.md.

Use this skill when

Measuring LLM application performance systematically
Comparing different models or prompts
Detecting performance regressions before deployment
Validating improvements from prompt changes
Building confidence in production systems
Establishing baselines and tracking progress over time
Debugging unexpected model behavior

Core Evaluation Types

🧠 Knowledge Modules (Fractal Skills)

1. 1. Automated Metrics

2. 2. Human Evaluation

3. 3. LLM-as-Judge

4. BLEU Score

5. ROUGE Score

6. BERTScore

7. Custom Metrics

8. Single Output Evaluation

9. Pairwise Comparison

10. Annotation Guidelines

11. Inter-Rater Agreement

12. Statistical Testing Framework

13. Regression Detection

14. Running Benchmarks

More by H4D3ZS

View all by H4D3ZS →

email-sequence

H4D3ZS

When the user wants to create or optimize an email sequence, drip campaign, automated email flow, or lifecycle email program. Also use when the user mentions "email sequence," "drip campaign," "nurture sequence," "onboarding emails," "welcome sequence," "re-engagement emails," "email automation," or

agent-code-guide

H4D3ZS

Master guide for using Agent Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

conductor-implement

H4D3ZS

Execute tasks from a track's implementation plan following TDD workflow

interactive-portfolio

H4D3ZS

Expert in building portfolios that actually land jobs and clients - not just showing work, but creating memorable experiences. Covers developer portfolios, designer portfolios, creative portfolios, and portfolios that convert visitors into opportunities. Use when: portfolio, personal website, showca

Install

mkdir -p .claude/skills/llm-evaluation-h4d3zs && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13370" && unzip -o skill.zip -d .claude/skills/llm-evaluation-h4d3zs && rm skill.zip

Installs to .claude/skills/llm-evaluation-h4d3zs

Safety

No risk patterns found

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

1mo ago

Repo stars

Loads

~462 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

H4D3ZS

5 skills published

Links

Source code

llm-evaluation

Install

Activation

About this skill

LLM Evaluation

Do not use this skill when

Instructions

Use this skill when

Core Evaluation Types

🧠 Knowledge Modules (Fractal Skills)

1. 1. Automated Metrics

2. 2. Human Evaluation

3. 3. LLM-as-Judge

4. BLEU Score

5. ROUGE Score

6. BERTScore

7. Custom Metrics

8. Single Output Evaluation

9. Pairwise Comparison

10. Annotation Guidelines

11. Inter-Rater Agreement

12. Statistical Testing Framework

13. Regression Detection

14. Running Benchmarks

More by H4D3ZS

email-sequence

agent-code-guide

conductor-implement

interactive-portfolio

Search skills