genai-platform-eval
Use when evaluating, comparing, researching, or preparing vendor questions for Generative AI platforms, AI services, orchestration layers, agent frameworks, or AI tooling adoption. Provides a structured scope-research-synthesis workflow covering governance, integrations, tenancy, portability, compli
Install
mkdir -p .claude/skills/genai-platform-eval && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16298" && unzip -o skill.zip -d .claude/skills/genai-platform-eval && rm skill.zipInstalls to .claude/skills/genai-platform-eval
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Use when evaluating, comparing, researching, or preparing vendor questions for Generative AI platforms, AI services, orchestration layers, agent frameworks, or AI tooling adoption. Provides a structured scope-research-synthesis workflow covering governance, integrations, tenancy, portability, compliance, pricing, accuracy controls, DevOps maturity, and vendor viability across internal enterprise, B2B SaaS embedding, consumer-facing, and hybrid deployment models.About this skill
GenAI Platform Evaluation Framework
Overview
This skill guides a structured evaluation of Generative AI platforms, orchestration layers, agent frameworks, or AI services. Use it to turn a broad adoption question into a defensible evaluation artifact: scope, research findings, vendor questions, architectural risks, and a prioritized gap list.
The workflow has three phases:
- Scope the evaluation — clarify the deployment model, data stack, target platform, stakeholders, and decision being supported.
- Research and document — work through the evaluation domains in
references/evaluation-domains.md, separating confirmed evidence from inferred or absent capabilities. - Synthesize and deliver — produce an internal briefing, vendor-call question package, buy-vs-build analysis, or docx-ready report.
The goal is not a generic feature checklist. The goal is to determine whether the platform fits the organization's actual architecture, risk model, operating model, commercial model, and deployment path.
When to Use
Use this skill when the user asks to:
- Evaluate, compare, assess, or research an AI platform, AI vendor, agent framework, orchestration tool, LLM gateway, AI governance layer, or GenAI service.
- Prepare for a vendor meeting or diligence call with specific questions.
- Create a technology assessment for AI tooling adoption.
- Build a buy-vs-build analysis for an AI platform capability.
- Assess fit for internal enterprise use, B2B SaaS embedding, consumer-facing AI features, or hybrid deployment models.
- Identify unresolved risks before a pilot, procurement decision, architecture review, or executive briefing.
Do not use this skill for:
- Narrow model-quality evaluations where the platform, governance, tenancy, pricing, and operating model are out of scope.
- Pure benchmark work focused only on model metrics.
- General AI market research with no adoption or architecture decision attached.
Phase 1: Scope the Evaluation
Before researching anything, establish the evaluation context. These inputs determine which domains are primary, secondary, or skippable.
1. Identify the deployment model
Ask which deployment model best describes the situation:
- Internal enterprise — AI capabilities used by employees, analysts, operations, or internal teams. Governance, identity, compliance, development experience, and data integration are usually primary.
- B2B SaaS embedding — AI capabilities embedded inside a product sold to many external customers or tenants. Multi-tenancy, data isolation, API-first architecture, OEM pricing, white-labeling, auditability, and downstream customer assurances become primary. This is usually the most architecturally demanding model.
- Consumer-facing product — AI features exposed directly to end users. Latency, accuracy, scale, content safety, cost, and rollback/change management are primary.
- Hybrid — Multiple deployment models are in play. Evaluate against the most demanding model first, then check for gaps in the others.
If the user is unsure, proceed with a stated assumption and mark deployment-model uncertainty as an open question.
2. Map the data stack
Capture the current or expected data infrastructure. Include:
- Databases and operational systems.
- Data warehouses and data lakes.
- Transformation tools.
- ETL/ELT or ingestion pipelines.
- Event streaming systems.
- Existing AI/ML tools, vector stores, model gateways, or governance layers.
- Identity providers, secrets managers, audit/SIEM systems, and observability tooling.
For each component, note whether it sits at the storage, transformation, movement, streaming, governance, observability, or application layer. This inventory determines which integration and identity questions to prioritize.
3. Identify the evaluation target
Gather:
- Vendor name.
- Product or platform name.
- Public documentation URLs.
- Pricing pages, security pages, trust center, API docs, SDK docs, and architecture docs if available.
- Whether the evaluation is single-vendor or comparative.
For multi-vendor comparisons, apply the framework independently to each platform, then synthesize a comparison table only after researching each one on its own terms.
4. Define stakeholder context
Clarify:
- Who requested the evaluation.
- Who will read the output.
- What decision it supports: buy/no-buy, pilot scope, architecture review, procurement negotiation, risk acceptance, partner diligence, or build-vs-buy.
- Whether the intended output is a short briefing, a detailed report, a vendor question list, a comparison matrix, or a docx-ready artifact.
Phase 2: Research and Document
Read references/evaluation-domains.md before starting research. It contains the full domain taxonomy, what to look for in each domain, and deployment-specific priority guidance.
For each relevant domain:
- Research public materials — Use public documentation, security pages, pricing pages, changelogs, blog posts, case studies, analyst coverage, marketplace listings, and credible third-party coverage.
- Document findings — Summarize what the evidence shows. Name features, exact product terms, pricing units, certification names, version numbers, API capabilities, or published limits where possible.
- Separate confirmed, inferred, and absent — Use clear labels:
- Confirmed — directly supported by cited public materials.
- Not confirmed — plausible from adjacent evidence, but not explicitly documented.
- Not found — no public evidence found after reasonable search.
- Draft follow-up questions — Target the gap between vendor claims and the organization's requirements.
Research discipline
- Distinguish claims from architecture. Vendor language like “enterprise-grade” or “secure by design” is not enough. Look for where policies are enforced, how identity flows, how audit logs are structured, and which controls are configurable.
- Note what is absent. Silence in documentation is itself a finding, especially for tenancy, data training, audit export, pricing, rollback, and exit strategy.
- Attribute specifics. Dates, versions, connector counts, pricing tiers, certification names, marketplace listings, and published limits make the assessment useful later.
- Prefer primary sources. Use vendor docs and trust-center materials first; use blogs, analyst notes, and third-party coverage as supporting context.
- Avoid overclaiming. Do not infer production readiness from demo videos, marketing screenshots, or isolated case studies.
When to Read the Reference Files
references/evaluation-domains.md— Read at the start of Phase 2. Contains the full taxonomy of evaluation domains, what to research in each, and which domains matter most for each deployment model.references/question-bank.md— Read when drafting follow-up questions. Contains proven question patterns organized by domain, drawn from real enterprise AI evaluations.
Evaluation Domains
Use references/evaluation-domains.md as the source of truth for domain details. At minimum, consider these categories:
Tier 1: Always Evaluate
- Platform identity and market position.
- Development experience.
- Integration and connector ecosystem.
- Governance, policy enforcement, and identity.
- Vendor lock-in, portability, and exit strategy.
- Compliance, data residency, and data training.
Tier 2: Deployment-Dependent
- Multi-tenancy and data isolation.
- Pricing and commercial model.
- API-first architecture and product embedding.
Tier 3: Augmented Concerns
- Agent accuracy and hallucination controls.
- Rollback and change management.
- Latency and performance at scale.
- Company maturity and viability.
- Build vs. buy framing.
Phase 3: Synthesize and Deliver
1. Prioritize gaps
After completing domain research, rank unresolved questions by decision impact. The top items should be potential blockers if answered unfavorably. Lower-priority items may still matter, but should not distract from deal-shaping or architecture-shaping gaps.
Weight domains by deployment model using the priority guidance in references/evaluation-domains.md. For example:
- B2B SaaS embedding should heavily weight multi-tenancy, data isolation, OEM pricing, API control, white-labeling, audit evidence, and downstream customer assurances.
- Internal enterprise should heavily weight governance, identity, data integration, compliance, development experience, and portability.
- Consumer-facing products should heavily weight latency, accuracy, scale, rollback, safety, and cost behavior.
2. Build the evaluation artifact
Produce the requested output format. If the user does not specify a format, default to a structured markdown report that is easy to convert into a document.
Recommended structure:
Title: [Platform Name] Evaluation
Subtitle: [Purpose] | [Organization Name] | [Date]
1. Executive Summary
- Recommendation or current read
- Top strengths
- Top unresolved risks
- Decision the report supports
2. Platform Overview
- What the platform is
- Key components
- Market positioning
- Deployment model fit
3. Prepared Questions & Research
- One section per evaluated domain
- Each section includes: What We Found, Evidence, Questions for the Vendor
4. Strategic & Architectural Concerns
- Deeper analysis of domains that intersect with the organization's deployment model and data stack
5. Commercial and Operating Model Considerations
- Pricing, scale assumptions, support model, procurement concerns, rollout/rollback, ownership
6. Summary: Prioritized Gaps to Resolve
- Ranked list with impact and why it matters
7. Appendix
- Source list
- Assumptions
- Skipped domains and rationale
3. Voice and audience
Write as an internal report prepared by an evaluator for colleagues. Use first-person plural when referring to t
Content truncated.