genai-platform-eval

Name: genai-platform-eval
Author: tsemana

Use when evaluating, comparing, researching, or preparing vendor questions for Generative AI platforms, AI services, orchestration layers, agent frameworks, or AI tooling adoption. Provides a structured scope-research-synthesis workflow covering governance, integrations, tenancy, portability, compli

Install

mkdir -p .claude/skills/genai-platform-eval && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16298" && unzip -o skill.zip -d .claude/skills/genai-platform-eval && rm skill.zip

Installs to .claude/skills/genai-platform-eval

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Use when evaluating, comparing, researching, or preparing vendor questions for Generative AI platforms, AI services, orchestration layers, agent frameworks, or AI tooling adoption. Provides a structured scope-research-synthesis workflow covering governance, integrations, tenancy, portability, compliance, pricing, accuracy controls, DevOps maturity, and vendor viability across internal enterprise, B2B SaaS embedding, consumer-facing, and hybrid deployment models.

466 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

GenAI Platform Evaluation Framework

Overview

This skill guides a structured evaluation of Generative AI platforms, orchestration layers, agent frameworks, or AI services. Use it to turn a broad adoption question into a defensible evaluation artifact: scope, research findings, vendor questions, architectural risks, and a prioritized gap list.

The workflow has three phases:

Scope the evaluation — clarify the deployment model, data stack, target platform, stakeholders, and decision being supported.
Research and document — work through the evaluation domains in references/evaluation-domains.md, separating confirmed evidence from inferred or absent capabilities.
Synthesize and deliver — produce an internal briefing, vendor-call question package, buy-vs-build analysis, or docx-ready report.

The goal is not a generic feature checklist. The goal is to determine whether the platform fits the organization's actual architecture, risk model, operating model, commercial model, and deployment path.

When to Use

Use this skill when the user asks to:

Evaluate, compare, assess, or research an AI platform, AI vendor, agent framework, orchestration tool, LLM gateway, AI governance layer, or GenAI service.
Prepare for a vendor meeting or diligence call with specific questions.
Create a technology assessment for AI tooling adoption.
Build a buy-vs-build analysis for an AI platform capability.
Assess fit for internal enterprise use, B2B SaaS embedding, consumer-facing AI features, or hybrid deployment models.
Identify unresolved risks before a pilot, procurement decision, architecture review, or executive briefing.

Do not use this skill for:

Narrow model-quality evaluations where the platform, governance, tenancy, pricing, and operating model are out of scope.
Pure benchmark work focused only on model metrics.
General AI market research with no adoption or architecture decision attached.

Phase 1: Scope the Evaluation

Before researching anything, establish the evaluation context. These inputs determine which domains are primary, secondary, or skippable.

1. Identify the deployment model

Ask which deployment model best describes the situation:

Internal enterprise — AI capabilities used by employees, analysts, operations, or internal teams. Governance, identity, compliance, development experience, and data integration are usually primary.
B2B SaaS embedding — AI capabilities embedded inside a product sold to many external customers or tenants. Multi-tenancy, data isolation, API-first architecture, OEM pricing, white-labeling, auditability, and downstream customer assurances become primary. This is usually the most architecturally demanding model.
Consumer-facing product — AI features exposed directly to end users. Latency, accuracy, scale, content safety, cost, and rollback/change management are primary.
Hybrid — Multiple deployment models are in play. Evaluate against the most demanding model first, then check for gaps in the others.

If the user is unsure, proceed with a stated assumption and mark deployment-model uncertainty as an open question.

2. Map the data stack

Capture the current or expected data infrastructure. Include:

Databases and operational systems.
Data warehouses and data lakes.
Transformation tools.
ETL/ELT or ingestion pipelines.
Event streaming systems.
Existing AI/ML tools, vector stores, model gateways, or governance layers.
Identity providers, secrets managers, audit/SIEM systems, and observability tooling.

For each component, note whether it sits at the storage, transformation, movement, streaming, governance, observability, or application layer. This inventory determines which integration and identity questions to prioritize.

3. Identify the evaluation target

Gather:

Vendor name.
Product or platform name.
Public documentation URLs.
Pricing pages, security pages, trust center, API docs, SDK docs, and architecture docs if available.
Whether the evaluation is single-vendor or comparative.

For multi-vendor comparisons, apply the framework independently to each platform, then synthesize a comparison table only after researching each one on its own terms.

4. Define stakeholder context

Clarify:

Who requested the evaluation.
Who will read the output.
What decision it supports: buy/no-buy, pilot scope, architecture review, procurement negotiation, risk acceptance, partner diligence, or build-vs-buy.
Whether the intended output is a short briefing, a detailed report, a vendor question list, a comparison matrix, or a docx-ready artifact.

Phase 2: Research and Document

Read references/evaluation-domains.md before starting research. It contains the full domain taxonomy, what to look for in each domain, and deployment-specific priority guidance.

For each relevant domain:

Research public materials — Use public documentation, security pages, pricing pages, changelogs, blog posts, case studies, analyst coverage, marketplace listings, and credible third-party coverage.
Document findings — Summarize what the evidence shows. Name features, exact product terms, pricing units, certification names, version numbers, API capabilities, or published limits where possible.
Separate confirmed, inferred, and absent — Use clear labels:
- Confirmed — directly supported by cited public materials.
- Not confirmed — plausible from adjacent evidence, but not explicitly documented.
- Not found — no public evidence found after reasonable search.
Draft follow-up questions — Target the gap between vendor claims and the organization's requirements.

Research discipline

Distinguish claims from architecture. Vendor language like “enterprise-grade” or “secure by design” is not enough. Look for where policies are enforced, how identity flows, how audit logs are structured, and which controls are configurable.
Note what is absent. Silence in documentation is itself a finding, especially for tenancy, data training, audit export, pricing, rollback, and exit strategy.
Attribute specifics. Dates, versions, connector counts, pricing tiers, certification names, marketplace listings, and published limits make the assessment useful later.
Prefer primary sources. Use vendor docs and trust-center materials first; use blogs, analyst notes, and third-party coverage as supporting context.
Avoid overclaiming. Do not infer production readiness from demo videos, marketing screenshots, or isolated case studies.

When to Read the Reference Files

references/evaluation-domains.md — Read at the start of Phase 2. Contains the full taxonomy of evaluation domains, what to research in each, and which domains matter most for each deployment model.
references/question-bank.md — Read when drafting follow-up questions. Contains proven question patterns organized by domain, drawn from real enterprise AI evaluations.

Evaluation Domains

Use references/evaluation-domains.md as the source of truth for domain details. At minimum, consider these categories:

Tier 1: Always Evaluate

Platform identity and market position.
Development experience.
Integration and connector ecosystem.
Governance, policy enforcement, and identity.
Vendor lock-in, portability, and exit strategy.
Compliance, data residency, and data training.

Tier 2: Deployment-Dependent

Multi-tenancy and data isolation.
Pricing and commercial model.
API-first architecture and product embedding.

Tier 3: Augmented Concerns

Agent accuracy and hallucination controls.
Rollback and change management.
Latency and performance at scale.
Company maturity and viability.
Build vs. buy framing.

Phase 3: Synthesize and Deliver

1. Prioritize gaps

After completing domain research, rank unresolved questions by decision impact. The top items should be potential blockers if answered unfavorably. Lower-priority items may still matter, but should not distract from deal-shaping or architecture-shaping gaps.

Weight domains by deployment model using the priority guidance in references/evaluation-domains.md. For example:

B2B SaaS embedding should heavily weight multi-tenancy, data isolation, OEM pricing, API control, white-labeling, audit evidence, and downstream customer assurances.
Internal enterprise should heavily weight governance, identity, data integration, compliance, development experience, and portability.
Consumer-facing products should heavily weight latency, accuracy, scale, rollback, safety, and cost behavior.

2. Build the evaluation artifact

Produce the requested output format. If the user does not specify a format, default to a structured markdown report that is easy to convert into a document.

Recommended structure:

Title: [Platform Name] Evaluation
Subtitle: [Purpose] | [Organization Name] | [Date]

1. Executive Summary
   - Recommendation or current read
   - Top strengths
   - Top unresolved risks
   - Decision the report supports

2. Platform Overview
   - What the platform is
   - Key components
   - Market positioning
   - Deployment model fit

3. Prepared Questions & Research
   - One section per evaluated domain
   - Each section includes: What We Found, Evidence, Questions for the Vendor

4. Strategic & Architectural Concerns
   - Deeper analysis of domains that intersect with the organization's deployment model and data stack

5. Commercial and Operating Model Considerations
   - Pricing, scale assumptions, support model, procurement concerns, rollout/rollback, ownership

6. Summary: Prioritized Gaps to Resolve
   - Ranked list with impact and why it matters

7. Appendix
   - Source list
   - Assumptions
   - Skipped domains and rationale

3. Voice and audience

Write as an internal report prepared by an evaluator for colleagues. Use first-person plural when referring to t

Content truncated.

Install

mkdir -p .claude/skills/genai-platform-eval && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16298" && unzip -o skill.zip -d .claude/skills/genai-platform-eval && rm skill.zip

Installs to .claude/skills/genai-platform-eval

Safety

No risk patterns found

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

1mo ago

License

MIT

Repo stars

Loads

~3,744 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

tsemana

Links

Source code

genai-platform-eval

Install

Activation

About this skill

GenAI Platform Evaluation Framework

Overview

When to Use

Phase 1: Scope the Evaluation

1. Identify the deployment model

2. Map the data stack

3. Identify the evaluation target

4. Define stakeholder context

Phase 2: Research and Document

Research discipline

When to Read the Reference Files

Evaluation Domains

Tier 1: Always Evaluate

Tier 2: Deployment-Dependent

Tier 3: Augmented Concerns

Phase 3: Synthesize and Deliver

1. Prioritize gaps

2. Build the evaluation artifact

3. Voice and audience

Search skills