agentskills.codes
WE

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

Install

mkdir -p .claude/skills/web-scraper-milind-ranjan && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14099" && unzip -o skill.zip -d .claude/skills/web-scraper-milind-ranjan && rm skill.zip

Installs to .claude/skills/web-scraper-milind-ranjan

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
154 charsno explicit “when” trigger

About this skill

Web Scraper

Overview

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

When to Use This Skill

  • When the user mentions "scraper" or related topics
  • When the user mentions "scraping" or related topics
  • When the user mentions "extrair dados web" or related topics
  • When the user mentions "web scraping" or related topics
  • When the user mentions "raspar dados" or related topics
  • When the user mentions "coletar dados site" or related topics

Do Not Use This Skill When

  • The task is unrelated to web scraper
  • A simpler, more specific tool can handle the request
  • The user needs general-purpose assistance without domain expertise

How It Works

Execute phases in strict order. Each phase feeds the next.

1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT

Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.

Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.


Capabilities

  • Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
  • Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
  • Output formats: Markdown tables (default), JSON, CSV
  • Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
  • Multi-URL: extract same structure across sources with comparison and diff
  • Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
  • Auto-escalation: WebFetch fails silently -> automatic Browser fallback
  • Data transforms: cleaning, normalization, deduplication, enrichment
  • Differential mode: detect changes between scraping runs

Web Scraper

Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.

Phase 1: Clarify

Establish extraction parameters before touching any URL.

Required Parameters

ParameterResolveDefault
Target URL(s)Which page(s) to scrape?(required)
Data TargetWhat specific data to extract?(required)
Output FormatMarkdown table, JSON, CSV, or text?Markdown table
ScopeSingle page, paginated, or multi-URL?Single page

Optional Parameters

ParameterResolveDefault
PaginationFollow pagination? Max pages?No, 1 page
Max ItemsMaximum number of items to collect?Unlimited
FiltersData to exclude or include?None
Sort OrderHow to sort results?Source order
Save PathSave to file? Which path?Display only
LanguageRespond in which language?User's lang
Diff ModeCompare with previous run?No

Clarification Rules

  • If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
  • If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
  • Default to Markdown table output. Mention alternatives only if relevant.
  • Accept requests in any language. Always respond in the user's language.
  • If user says "everything" or "all data", perform recon first, then present what's available and let user choose.

Discovery Mode

When user has a topic but no specific URL:

  1. Use WebSearch to find the most relevant pages
  2. Present top 3-5 URLs with descriptions
  3. Let user choose which to scrape, or scrape all
  4. Proceed to Phase 2 with selected URL(s)

Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract


Phase 2: Reconnaissance

Analyze the target page before extraction.

Step 2.1: Initial Fetch

Use WebFetch to retrieve and analyze the page structure:

WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)

Step 2.2: Evaluate Fetch Quality

SignalInterpretationAction
Rich content with data clearly visibleStatic pageStrategy A (WebFetch)
Empty containers, "loading...", minimal textJS-renderedStrategy B (Browser)
Login wall, CAPTCHA, 403/401 responseBlockedReport to user
Content present but poorly structuredNeeds precisionStrategy B (Browser)
JSON or XML response bodyAPI endpointStrategy C (Bash/curl)
Download links for CSV/Excel availableDirect data fileStrategy C (download)

Step 2.3: Content Classification

Classify into an extraction mode:

ModeIndicatorsExamples
tableHTML <table>, grid layout with headersPrice comparison, statistics, specs
listRepeated similar elements, card gridsSearch results, product listings
articleLong-form text with headings/paragraphsBlog post, news article, docs
productProduct name, price, specs, images, ratingE-commerce product page
contactNames, emails, phones, addresses, rolesTeam page, staff directory
faqQuestion-answer pairs, accordionsFAQ page, help center
pricingPlan names, prices, features, tiersSaaS pricing page
eventsDates, locations, titles, descriptionsEvent listings, conferences
jobsTitles, companies, locations, salariesJob boards, career pages
customUser specified CSS selectors or fieldsAnything not matching above

Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).

If user asked for "everything", present the available fields and let them choose.


Phase 3: Strategy Selection

Choose the extraction approach based on recon results.

Decision Tree

Structured data (JSON-LD, microdata) has what we need?
 |
 +-- YES --> STRATEGY E: Extract structured data directly
 |
 +-- NO: Content fully visible in WebFetch?
      |
      +-- YES: Need precise element targeting?
      |    |
      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
      |    +-- YES --> STRATEGY B: Browser automation
      |
      +-- NO: JavaScript rendering detected?
           |
           +-- YES --> STRATEGY B: Browser automation
           +-- NO:  API/JSON/XML endpoint or download link?
                |
                +-- YES --> STRATEGY C: Bash (curl + jq)
                +-- NO  --> Report access issue to user

Strategy A: Webfetch With Ai Extraction

Best for: Static pages, articles, simple tables, well-structured HTML.

Use WebFetch with a targeted extraction prompt tailored to the mode:

WebFetch(
  url = URL,
  prompt = "Extract [DATA_TARGET] from this page.
    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
    Rules:
    - If a value is missing or unclear, use 'N/A'
    - Do not include navigation, ads, footers, or unrelated content
    - Preserve original values exactly (numbers, currencies, dates)
    - Include ALL matching items, not just the first few
    - For each item, also extract the URL/link if available"
)

Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.

Strategy B: Browser Automation

Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.

Sequence:

  1. Get tab context: tabs_context_mcp(createIfEmpty=true) -> get tabId
  2. Navigate to URL: navigate(url=TARGET_URL, tabId=TAB)
  3. Wait for content to load: computer(action="wait", duration=3, tabId=TAB)
  4. Check for cookie/consent banners: find(query="cookie consent or accept button", tabId=TAB)
    • If

Content truncated.

Search skills

Search the agent skills registry