agentskills.codes
LA

langsmith-evaluator

INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run v

Install

mkdir -p .claude/skills/langsmith-evaluator && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15369" && unzip -o skill.zip -d .claude/skills/langsmith-evaluator && rm skill.zip

Installs to .claude/skills/langsmith-evaluator

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run v
300 chars · catalog description✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

<oneliner> Three core components: **(1) Creating Evaluators** - LLM-as-Judge, custom code; **(2) Defining Run Functions** - capture agent outputs/trajectories for evaluation; **(3) Running Evaluations** - locally with `evaluate()` or auto-run via uploaded evaluators. Python and TypeScript examples included. </oneliner> <setup> Environment Variables
LANGSMITH_API_KEY=<your_api_key_here>               # Required
LANGSMITH_PROJECT=your-project-name                   # Check this to know which project has traces
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge

IMPORTANT: Always check the environment variables or .env file for LANGSMITH_PROJECT before querying or interacting with LangSmith. This tells you which project contains the relevant traces and data. If the LangSmith project is not available, use your best judgement to identify the right one.

Python Dependencies

pip install langsmith langchain-openai python-dotenv

CLI Tool (for uploading evaluators)

pip install langsmith

JavaScript Dependencies

npm install langsmith openai
</setup>

<crucial_requirement>

Golden Rule: Inspect Before You Implement

CRITICAL: Before writing ANY evaluator or extraction logic, you MUST:

  1. Run your agent on sample inputs and capture the actual output
  2. Inspect the output - print it, query LangSmith traces, understand the exact structure
  3. Only then write code that processes that output

Output structures vary significantly by framework, agent type, and configuration. Never assume the shape - always verify first. Query LangSmith traces when outputs don't contain needed data to understand how to extract from execution. </crucial_requirement>

<evaluator_format>

Offline vs Online Evaluators

Offline Evaluators (attached to datasets):

  • Function signature: (run, example) - receives both run outputs and dataset example
  • Use case: Comparing agent outputs to expected values in a dataset
  • Upload with: --dataset "Dataset Name"

Online Evaluators (attached to projects):

  • Function signature: (run) - receives only run outputs, NO example parameter
  • Use case: Real-time quality checks on production runs (no reference data)
  • Upload with: --project "Project Name"

CRITICAL - Return Format:

  • Each evaluator returns ONE metric only. For multiple metrics, create multiple evaluator functions.
  • Do NOT return {"metric_name": value} or lists of metrics - this will error.

CRITICAL - Local vs Uploaded Differences:

Local evaluate()Uploaded to LangSmith
Column namePython: auto-derived from function name. TypeScript: must include key field or column is untitledComes from evaluator name set at upload time. Do NOT include key — it creates a duplicate column
Python run typeRunTree object → run.outputs (attribute)dictrun["outputs"] (subscript). Handle both: run.outputs if hasattr(run, "outputs") else run.get("outputs", {})
TypeScript run typeAlways attribute access: run.outputs?.fieldAlways attribute access: run.outputs?.field
Python return{"score": value, "comment": "..."}{"score": value, "comment": "..."}
TypeScript return{ key: "name", score: value, comment: "..." }{ score: value, comment: "..." }
</evaluator_format>

<evaluator_types>

  • LLM as Judge - Uses an LLM to grade outputs. Best for subjective quality (accuracy, helpfulness, relevance).
  • Custom Code - Deterministic logic. Best for objective checks (exact match, trajectory validation, format compliance). </evaluator_types>

<llm_judge>

LLM as Judge Evaluators

NOTE: LLM-as-Judge upload is currently not supported by the CLI — only code evaluators are supported. For evaluations against a dataset, STRONGLY PREFER defining local evaluators to use with evaluate(evaluators=[...]).

<python> ```python from typing import TypedDict, Annotated from langchain_openai import ChatOpenAI

class Grade(TypedDict): reasoning: Annotated[str, ..., "Explain your reasoning"] is_accurate: Annotated[bool, ..., "True if response is accurate"]

judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(Grade, method="json_schema", strict=True)

async def accuracy_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} grade = await judge.ainvoke([{"role": "user", "content": f"Expected: {example_outputs}\nActual: {run_outputs}\nIs this accurate?"}]) return {"score": 1 if grade["is_accurate"] else 0, "comment": grade["reasoning"]}

</python>

<typescript>
```javascript
import OpenAI from "openai";

const openai = new OpenAI();

async function accuracyEvaluator(run, example) {
    const runOutputs = run.outputs ?? {};
    const exampleOutputs = example.outputs ?? {};

    const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
        { role: "system", content: 'Respond with JSON: {"is_accurate": boolean, "reasoning": string}' },
        { role: "user", content: `Expected: ${JSON.stringify(exampleOutputs)}\nActual: ${JSON.stringify(runOutputs)}\nIs this accurate?` }
    ]
    });

    const grade = JSON.parse(response.choices[0].message.content || "{}");
    return { score: grade.is_accurate ? 1 : 0, comment: grade.reasoning };
}
</typescript> </llm_judge>

<code_evaluators>

Custom Code Evaluators

Before writing an evaluator:

  1. Inspect your dataset to understand expected field names (see Golden Rule above)
  2. Test your run function and verify its output structure matches the dataset schema
  3. Query LangSmith traces to debug any mismatches
<python> ```python def trajectory_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} # IMPORTANT: Replace these placeholders with your actual field names # 1. Query your LangSmith trace to see what fields exist in run outputs # 2. Check your dataset schema for expected field names # Note: Trajectory data may not appear in default output - verify against trace! actual = run_outputs.get("YOUR_TRAJECTORY_FIELD", []) expected = example_outputs.get("YOUR_EXPECTED_FIELD", []) return {"score": 1 if actual == expected else 0, "comment": f"Expected {expected}, got {actual}"} ``` </python> <typescript> ```javascript function trajectoryEvaluator(run, example) { const runOutputs = run.outputs ?? {}; const exampleOutputs = example.outputs ?? {}; // IMPORTANT: Replace these placeholders with your actual field names // 1. Query your LangSmith trace to see what fields exist in run outputs // 2. Check your dataset schema for expected field names const actual = runOutputs.YOUR_TRAJECTORY_FIELD ?? []; const expected = exampleOutputs.YOUR_EXPECTED_FIELD ?? []; const match = JSON.stringify(actual) === JSON.stringify(expected); return { score: match ? 1 : 0, comment: `Expected ${JSON.stringify(expected)}, got ${JSON.stringify(actual)}` }; } ``` </typescript> </code_evaluators>

<run_functions>

Defining Run Functions

Run functions execute your agent and return outputs for evaluation.

CRITICAL - Test Your Run Function First: Before writing evaluators, you MUST test your run function and inspect the actual output structure. Output shapes vary by framework, agent type, and configuration.

Debugging workflow:

  1. Run your agent once on sample input
  2. Query the trace to see the execution structure
  3. Print the raw output and verify against the trace to ensure the output contains the right data
  4. Adjust the run function as needed
  5. Verify your output matches your dataset schema

Try your hardest to match your run function output to your dataset schema. This makes evaluators simple and reusable. If matching isn't possible, your evaluator must know how to extract and compare the right fields from each side.

<python> ```python def run_agent(inputs: dict) -> dict: result = your_agent.run(inputs) # ALWAYS inspect output shape first - run this, check the print, query traces print(f"DEBUG - type: {type(result)}, keys: {result.keys() if hasattr(result, 'keys') else 'N/A'}") print(f"DEBUG - value: {result}") return {"output": result} # Adjust to match your dataset schema ``` </python> <typescript> ```javascript async function runAgent(inputs) { const result = await yourAgent.invoke(inputs); // ALWAYS inspect output shape first console.log("DEBUG - type:", typeof result, "keys:", Object.keys(result)); console.log("DEBUG - value:", result); return { output: result }; // Adjust to match your dataset schema } ``` </typescript>

Capturing Trajectories

For trajectory evaluation, your run function must capture tool calls during execution.

CRITICAL: Run output formats vary significantly by framework and agent type. You MUST inspect before implementing:

LangGraph agents (LangChain OSS): Use stream_mode="debug" with subgraphs=True to capture nested subagent tool calls.

import uuid

def run_agent_with_trajectory(agent, inputs: dict) -> dict:
    config = {"configurable": {"thread_id": f"eval-{uuid.uuid4()}"}}
    trajectory = []
    final_result = None

    for chunk in agent.stream(inputs, config=config, s

---

*Content truncated.*

Search skills

Search the agent skills registry