web-scraper
Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
Install
mkdir -p .claude/skills/web-scraper-kevanpatira && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14096" && unzip -o skill.zip -d .claude/skills/web-scraper-kevanpatira && rm skill.zipInstalls to .claude/skills/web-scraper-kevanpatira
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.About this skill
Web Scraper
Overview
Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
When to Use This Skill
- When the user mentions "scraper" or related topics
- When the user mentions "scraping" or related topics
- When the user mentions "extrair dados web" or related topics
- When the user mentions "web scraping" or related topics
- When the user mentions "raspar dados" or related topics
- When the user mentions "coletar dados site" or related topics
Do Not Use This Skill When
- The task is unrelated to web scraper
- A simpler, more specific tool can handle the request
- The user needs general-purpose assistance without domain expertise
How It Works
Execute phases in strict order. Each phase feeds the next.
1. CLARIFY -> 2. RECON -> 3. STRATEGY -> 4. EXTRACT -> 5. TRANSFORM -> 6. VALIDATE -> 7. FORMAT
Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.
Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.
Capabilities
- Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
- Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
- Output formats: Markdown tables (default), JSON, CSV
- Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
- Multi-URL: extract same structure across sources with comparison and diff
- Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
- Auto-escalation: WebFetch fails silently -> automatic Browser fallback
- Data transforms: cleaning, normalization, deduplication, enrichment
- Differential mode: detect changes between scraping runs
Web Scraper
Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.
Phase 1: Clarify
Establish extraction parameters before touching any URL.
Required Parameters
| Parameter | Resolve | Default |
|---|---|---|
| Target URL(s) | Which page(s) to scrape? | (required) |
| Data Target | What specific data to extract? | (required) |
| Output Format | Markdown table, JSON, CSV, or text? | Markdown table |
| Scope | Single page, paginated, or multi-URL? | Single page |
Optional Parameters
| Parameter | Resolve | Default |
|---|---|---|
| Pagination | Follow pagination? Max pages? | No, 1 page |
| Max Items | Maximum number of items to collect? | Unlimited |
| Filters | Data to exclude or include? | None |
| Sort Order | How to sort results? | Source order |
| Save Path | Save to file? Which path? | Display only |
| Language | Respond in which language? | User's lang |
| Diff Mode | Compare with previous run? | No |
Clarification Rules
- If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
- If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
- Default to Markdown table output. Mention alternatives only if relevant.
- Accept requests in any language. Always respond in the user's language.
- If user says "everything" or "all data", perform recon first, then present what's available and let user choose.
Discovery Mode
When user has a topic but no specific URL:
- Use WebSearch to find the most relevant pages
- Present top 3-5 URLs with descriptions
- Let user choose which to scrape, or scrape all
- Proceed to Phase 2 with selected URL(s)
Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract
Phase 2: Reconnaissance
Analyze the target page before extraction.
Step 2.1: Initial Fetch
Use WebFetch to retrieve and analyze the page structure:
WebFetch(
url = TARGET_URL,
prompt = "Analyze this page structure and report:
1. Page type: article, product listing, search results, data table,
directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
accordion/collapsible sections, tabs
3. Approximate number of distinct data items visible
4. JavaScript rendering indicators: empty containers, loading spinners,
SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
5. Pagination: next/prev links, page numbers, load-more buttons,
infinite scroll indicators, total results count
6. Data density: how much structured, extractable data exists
7. List the main data fields/columns available for extraction
8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
9. Available download links: CSV, Excel, PDF, API endpoints"
)
Step 2.2: Evaluate Fetch Quality
| Signal | Interpretation | Action |
|---|---|---|
| Rich content with data clearly visible | Static page | Strategy A (WebFetch) |
| Empty containers, "loading...", minimal text | JS-rendered | Strategy B (Browser) |
| Login wall, CAPTCHA, 403/401 response | Blocked | Report to user |
| Content present but poorly structured | Needs precision | Strategy B (Browser) |
| JSON or XML response body | API endpoint | Strategy C (Bash/curl) |
| Download links for CSV/Excel available | Direct data file | Strategy C (download) |
Step 2.3: Content Classification
Classify into an extraction mode:
| Mode | Indicators | Examples |
|---|---|---|
table | HTML <table>, grid layout with headers | Price comparison, statistics, specs |
list | Repeated similar elements, card grids | Search results, product listings |
article | Long-form text with headings/paragraphs | Blog post, news article, docs |
product | Product name, price, specs, images, rating | E-commerce product page |
contact | Names, emails, phones, addresses, roles | Team page, staff directory |
faq | Question-answer pairs, accordions | FAQ page, help center |
pricing | Plan names, prices, features, tiers | SaaS pricing page |
events | Dates, locations, titles, descriptions | Event listings, conferences |
jobs | Titles, companies, locations, salaries | Job boards, career pages |
custom | User specified CSS selectors or fields | Anything not matching above |
Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).
If user asked for "everything", present the available fields and let them choose.
Phase 3: Strategy Selection
Choose the extraction approach based on recon results.
Decision Tree
Structured data (JSON-LD, microdata) has what we need?
|
+-- YES --> STRATEGY E: Extract structured data directly
|
+-- NO: Content fully visible in WebFetch?
|
+-- YES: Need precise element targeting?
| |
| +-- NO --> STRATEGY A: WebFetch + AI extraction
| +-- YES --> STRATEGY B: Browser automation
|
+-- NO: JavaScript rendering detected?
|
+-- YES --> STRATEGY B: Browser automation
+-- NO: API/JSON/XML endpoint or download link?
|
+-- YES --> STRATEGY C: Bash (curl + jq)
+-- NO --> Report access issue to user
Strategy A: Webfetch With Ai Extraction
Best for: Static pages, articles, simple tables, well-structured HTML.
Use WebFetch with a targeted extraction prompt tailored to the mode:
WebFetch(
url = URL,
prompt = "Extract [DATA_TARGET] from this page.
Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
Rules:
- If a value is missing or unclear, use 'N/A'
- Do not include navigation, ads, footers, or unrelated content
- Preserve original values exactly (numbers, currencies, dates)
- Include ALL matching items, not just the first few
- For each item, also extract the URL/link if available"
)
Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.
Strategy B: Browser Automation
Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.
Sequence:
- Get tab context:
tabs_context_mcp(createIfEmpty=true)-> get tabId - Navigate to URL:
navigate(url=TARGET_URL, tabId=TAB) - Wait for content to load:
computer(action="wait", duration=3, tabId=TAB) - Check for cookie/consent banners:
find(query="cookie consent or accept button", tabId=TAB)- If
Content truncated.