langsmith-evaluator

Name: langsmith-evaluator
Author: dhar174

INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run v

Install

mkdir -p .claude/skills/langsmith-evaluator && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15369" && unzip -o skill.zip -d .claude/skills/langsmith-evaluator && rm skill.zip

Installs to .claude/skills/langsmith-evaluator

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run v

300 chars · catalog description✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

<oneliner> Three core components: **(1) Creating Evaluators** - LLM-as-Judge, custom code; **(2) Defining Run Functions** - capture agent outputs/trajectories for evaluation; **(3) Running Evaluations** - locally with `evaluate()` or auto-run via uploaded evaluators. Python and TypeScript examples included. </oneliner> <setup> Environment Variables

LANGSMITH_API_KEY=<your_api_key_here>               # Required
LANGSMITH_PROJECT=your-project-name                   # Check this to know which project has traces
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge

IMPORTANT: Always check the environment variables or .env file for LANGSMITH_PROJECT before querying or interacting with LangSmith. This tells you which project contains the relevant traces and data. If the LangSmith project is not available, use your best judgement to identify the right one.

Python Dependencies

pip install langsmith langchain-openai python-dotenv

CLI Tool (for uploading evaluators)

pip install langsmith

JavaScript Dependencies

npm install langsmith openai

</setup>

<crucial_requirement>

Golden Rule: Inspect Before You Implement

CRITICAL: Before writing ANY evaluator or extraction logic, you MUST:

Run your agent on sample inputs and capture the actual output
Inspect the output - print it, query LangSmith traces, understand the exact structure
Only then write code that processes that output

Output structures vary significantly by framework, agent type, and configuration. Never assume the shape - always verify first. Query LangSmith traces when outputs don't contain needed data to understand how to extract from execution. </crucial_requirement>

<evaluator_format>

Offline vs Online Evaluators

Offline Evaluators (attached to datasets):

Function signature: (run, example) - receives both run outputs and dataset example
Use case: Comparing agent outputs to expected values in a dataset
Upload with: --dataset "Dataset Name"

Online Evaluators (attached to projects):

Function signature: (run) - receives only run outputs, NO example parameter
Use case: Real-time quality checks on production runs (no reference data)
Upload with: --project "Project Name"

CRITICAL - Return Format:

Each evaluator returns ONE metric only. For multiple metrics, create multiple evaluator functions.
Do NOT return {"metric_name": value} or lists of metrics - this will error.

CRITICAL - Local vs Uploaded Differences:

	Local `evaluate()`	Uploaded to LangSmith
Column name	Python: auto-derived from function name. TypeScript: must include `key` field or column is untitled	Comes from evaluator name set at upload time. Do NOT include `key` — it creates a duplicate column
Python `run` type	`RunTree` object → `run.outputs` (attribute)	`dict` → `run["outputs"]` (subscript). Handle both: `run.outputs if hasattr(run, "outputs") else run.get("outputs", {})`
TypeScript `run` type	Always attribute access: `run.outputs?.field`	Always attribute access: `run.outputs?.field`
Python return	`{"score": value, "comment": "..."}`	`{"score": value, "comment": "..."}`
TypeScript return	`{ key: "name", score: value, comment: "..." }`	`{ score: value, comment: "..." }`
</evaluator_format>

<evaluator_types>

LLM as Judge - Uses an LLM to grade outputs. Best for subjective quality (accuracy, helpfulness, relevance).
Custom Code - Deterministic logic. Best for objective checks (exact match, trajectory validation, format compliance). </evaluator_types>

<llm_judge>

LLM as Judge Evaluators

NOTE: LLM-as-Judge upload is currently not supported by the CLI — only code evaluators are supported. For evaluations against a dataset, STRONGLY PREFER defining local evaluators to use with evaluate(evaluators=[...]).

<python> ```python from typing import TypedDict, Annotated from langchain_openai import ChatOpenAI

class Grade(TypedDict): reasoning: Annotated[str, ..., "Explain your reasoning"] is_accurate: Annotated[bool, ..., "True if response is accurate"]

judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(Grade, method="json_schema", strict=True)

async def accuracy_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} grade = await judge.ainvoke([{"role": "user", "content": f"Expected: {example_outputs}\nActual: {run_outputs}\nIs this accurate?"}]) return {"score": 1 if grade["is_accurate"] else 0, "comment": grade["reasoning"]}

</python>

<typescript>
```javascript
import OpenAI from "openai";

const openai = new OpenAI();

async function accuracyEvaluator(run, example) {
    const runOutputs = run.outputs ?? {};
    const exampleOutputs = example.outputs ?? {};

    const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
        { role: "system", content: 'Respond with JSON: {"is_accurate": boolean, "reasoning": string}' },
        { role: "user", content: `Expected: ${JSON.stringify(exampleOutputs)}\nActual: ${JSON.stringify(runOutputs)}\nIs this accurate?` }
    ]
    });

    const grade = JSON.parse(response.choices[0].message.content || "{}");
    return { score: grade.is_accurate ? 1 : 0, comment: grade.reasoning };
}

</typescript> </llm_judge>

<code_evaluators>

Custom Code Evaluators

Before writing an evaluator:

Inspect your dataset to understand expected field names (see Golden Rule above)
Test your run function and verify its output structure matches the dataset schema
Query LangSmith traces to debug any mismatches

<python> ```python def trajectory_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} # IMPORTANT: Replace these placeholders with your actual field names # 1. Query your LangSmith trace to see what fields exist in run outputs # 2. Check your dataset schema for expected field names # Note: Trajectory data may not appear in default output - verify against trace! actual = run_outputs.get("YOUR_TRAJECTORY_FIELD", []) expected = example_outputs.get("YOUR_EXPECTED_FIELD", []) return {"score": 1 if actual == expected else 0, "comment": f"Expected {expected}, got {actual}"} ``` </python> <typescript> ```javascript function trajectoryEvaluator(run, example) { const runOutputs = run.outputs ?? {}; const exampleOutputs = example.outputs ?? {}; // IMPORTANT: Replace these placeholders with your actual field names // 1. Query your LangSmith trace to see what fields exist in run outputs // 2. Check your dataset schema for expected field names const actual = runOutputs.YOUR_TRAJECTORY_FIELD ?? []; const expected = exampleOutputs.YOUR_EXPECTED_FIELD ?? []; const match = JSON.stringify(actual) === JSON.stringify(expected); return { score: match ? 1 : 0, comment: `Expected ${JSON.stringify(expected)}, got ${JSON.stringify(actual)}` }; } ``` </typescript> </code_evaluators>

<run_functions>

Defining Run Functions

Run functions execute your agent and return outputs for evaluation.

CRITICAL - Test Your Run Function First: Before writing evaluators, you MUST test your run function and inspect the actual output structure. Output shapes vary by framework, agent type, and configuration.

Debugging workflow:

Run your agent once on sample input
Query the trace to see the execution structure
Print the raw output and verify against the trace to ensure the output contains the right data
Adjust the run function as needed
Verify your output matches your dataset schema

Try your hardest to match your run function output to your dataset schema. This makes evaluators simple and reusable. If matching isn't possible, your evaluator must know how to extract and compare the right fields from each side.

<python> ```python def run_agent(inputs: dict) -> dict: result = your_agent.run(inputs) # ALWAYS inspect output shape first - run this, check the print, query traces print(f"DEBUG - type: {type(result)}, keys: {result.keys() if hasattr(result, 'keys') else 'N/A'}") print(f"DEBUG - value: {result}") return {"output": result} # Adjust to match your dataset schema ``` </python> <typescript> ```javascript async function runAgent(inputs) { const result = await yourAgent.invoke(inputs); // ALWAYS inspect output shape first console.log("DEBUG - type:", typeof result, "keys:", Object.keys(result)); console.log("DEBUG - value:", result); return { output: result }; // Adjust to match your dataset schema } ``` </typescript>

Capturing Trajectories

For trajectory evaluation, your run function must capture tool calls during execution.

CRITICAL: Run output formats vary significantly by framework and agent type. You MUST inspect before implementing:

LangGraph agents (LangChain OSS): Use stream_mode="debug" with subgraphs=True to capture nested subagent tool calls.

import uuid

def run_agent_with_trajectory(agent, inputs: dict) -> dict:
    config = {"configurable": {"thread_id": f"eval-{uuid.uuid4()}"}}
    trajectory = []
    final_result = None

    for chunk in agent.stream(inputs, config=config, s

---

*Content truncated.*

More by dhar174

View all by dhar174 →

update-llms

dhar174

Update the llms.txt file in the root folder to reflect changes in documentation or specifications following the llms.txt specification at https://llmstxt.org/

Install

mkdir -p .claude/skills/langsmith-evaluator && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15369" && unzip -o skill.zip -d .claude/skills/langsmith-evaluator && rm skill.zip

Installs to .claude/skills/langsmith-evaluator

Safety

Review before install

Runs shell / code
Reads credentials

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

3mo ago

Repo stars

Loads

~3,946 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

dhar174

2 skills published

Links

Source code

langsmith-evaluator

Install

Activation

About this skill

Golden Rule: Inspect Before You Implement

Offline vs Online Evaluators

LLM as Judge Evaluators

Custom Code Evaluators

Defining Run Functions

Capturing Trajectories

More by dhar174

update-llms

Search skills