langsmith-evaluator
INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run v
Install
mkdir -p .claude/skills/langsmith-evaluator && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15369" && unzip -o skill.zip -d .claude/skills/langsmith-evaluator && rm skill.zipInstalls to .claude/skills/langsmith-evaluator
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run vAbout this skill
LANGSMITH_API_KEY=<your_api_key_here> # Required
LANGSMITH_PROJECT=your-project-name # Check this to know which project has traces
LANGSMITH_WORKSPACE_ID=your-workspace-id # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key # For LLM as Judge
IMPORTANT: Always check the environment variables or .env file for LANGSMITH_PROJECT before querying or interacting with LangSmith. This tells you which project contains the relevant traces and data. If the LangSmith project is not available, use your best judgement to identify the right one.
Python Dependencies
pip install langsmith langchain-openai python-dotenv
CLI Tool (for uploading evaluators)
pip install langsmith
JavaScript Dependencies
npm install langsmith openai
</setup>
<crucial_requirement>
Golden Rule: Inspect Before You Implement
CRITICAL: Before writing ANY evaluator or extraction logic, you MUST:
- Run your agent on sample inputs and capture the actual output
- Inspect the output - print it, query LangSmith traces, understand the exact structure
- Only then write code that processes that output
Output structures vary significantly by framework, agent type, and configuration. Never assume the shape - always verify first. Query LangSmith traces when outputs don't contain needed data to understand how to extract from execution. </crucial_requirement>
<evaluator_format>
Offline vs Online Evaluators
Offline Evaluators (attached to datasets):
- Function signature:
(run, example)- receives both run outputs and dataset example - Use case: Comparing agent outputs to expected values in a dataset
- Upload with:
--dataset "Dataset Name"
Online Evaluators (attached to projects):
- Function signature:
(run)- receives only run outputs, NO example parameter - Use case: Real-time quality checks on production runs (no reference data)
- Upload with:
--project "Project Name"
CRITICAL - Return Format:
- Each evaluator returns ONE metric only. For multiple metrics, create multiple evaluator functions.
- Do NOT return
{"metric_name": value}or lists of metrics - this will error.
CRITICAL - Local vs Uploaded Differences:
Local evaluate() | Uploaded to LangSmith | |
|---|---|---|
| Column name | Python: auto-derived from function name. TypeScript: must include key field or column is untitled | Comes from evaluator name set at upload time. Do NOT include key — it creates a duplicate column |
Python run type | RunTree object → run.outputs (attribute) | dict → run["outputs"] (subscript). Handle both: run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) |
TypeScript run type | Always attribute access: run.outputs?.field | Always attribute access: run.outputs?.field |
| Python return | {"score": value, "comment": "..."} | {"score": value, "comment": "..."} |
| TypeScript return | { key: "name", score: value, comment: "..." } | { score: value, comment: "..." } |
| </evaluator_format> |
<evaluator_types>
- LLM as Judge - Uses an LLM to grade outputs. Best for subjective quality (accuracy, helpfulness, relevance).
- Custom Code - Deterministic logic. Best for objective checks (exact match, trajectory validation, format compliance). </evaluator_types>
<llm_judge>
LLM as Judge Evaluators
NOTE: LLM-as-Judge upload is currently not supported by the CLI — only code evaluators are supported. For evaluations against a dataset, STRONGLY PREFER defining local evaluators to use with evaluate(evaluators=[...]).
class Grade(TypedDict): reasoning: Annotated[str, ..., "Explain your reasoning"] is_accurate: Annotated[bool, ..., "True if response is accurate"]
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(Grade, method="json_schema", strict=True)
async def accuracy_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} grade = await judge.ainvoke([{"role": "user", "content": f"Expected: {example_outputs}\nActual: {run_outputs}\nIs this accurate?"}]) return {"score": 1 if grade["is_accurate"] else 0, "comment": grade["reasoning"]}
</python>
<typescript>
```javascript
import OpenAI from "openai";
const openai = new OpenAI();
async function accuracyEvaluator(run, example) {
const runOutputs = run.outputs ?? {};
const exampleOutputs = example.outputs ?? {};
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
temperature: 0,
response_format: { type: "json_object" },
messages: [
{ role: "system", content: 'Respond with JSON: {"is_accurate": boolean, "reasoning": string}' },
{ role: "user", content: `Expected: ${JSON.stringify(exampleOutputs)}\nActual: ${JSON.stringify(runOutputs)}\nIs this accurate?` }
]
});
const grade = JSON.parse(response.choices[0].message.content || "{}");
return { score: grade.is_accurate ? 1 : 0, comment: grade.reasoning };
}
</typescript>
</llm_judge>
<code_evaluators>
Custom Code Evaluators
Before writing an evaluator:
- Inspect your dataset to understand expected field names (see Golden Rule above)
- Test your run function and verify its output structure matches the dataset schema
- Query LangSmith traces to debug any mismatches
<run_functions>
Defining Run Functions
Run functions execute your agent and return outputs for evaluation.
CRITICAL - Test Your Run Function First: Before writing evaluators, you MUST test your run function and inspect the actual output structure. Output shapes vary by framework, agent type, and configuration.
Debugging workflow:
- Run your agent once on sample input
- Query the trace to see the execution structure
- Print the raw output and verify against the trace to ensure the output contains the right data
- Adjust the run function as needed
- Verify your output matches your dataset schema
Try your hardest to match your run function output to your dataset schema. This makes evaluators simple and reusable. If matching isn't possible, your evaluator must know how to extract and compare the right fields from each side.
<python> ```python def run_agent(inputs: dict) -> dict: result = your_agent.run(inputs) # ALWAYS inspect output shape first - run this, check the print, query traces print(f"DEBUG - type: {type(result)}, keys: {result.keys() if hasattr(result, 'keys') else 'N/A'}") print(f"DEBUG - value: {result}") return {"output": result} # Adjust to match your dataset schema ``` </python> <typescript> ```javascript async function runAgent(inputs) { const result = await yourAgent.invoke(inputs); // ALWAYS inspect output shape first console.log("DEBUG - type:", typeof result, "keys:", Object.keys(result)); console.log("DEBUG - value:", result); return { output: result }; // Adjust to match your dataset schema } ``` </typescript>Capturing Trajectories
For trajectory evaluation, your run function must capture tool calls during execution.
CRITICAL: Run output formats vary significantly by framework and agent type. You MUST inspect before implementing:
LangGraph agents (LangChain OSS): Use stream_mode="debug" with subgraphs=True to capture nested subagent tool calls.
import uuid
def run_agent_with_trajectory(agent, inputs: dict) -> dict:
config = {"configurable": {"thread_id": f"eval-{uuid.uuid4()}"}}
trajectory = []
final_result = None
for chunk in agent.stream(inputs, config=config, s
---
*Content truncated.*