web-scraper

Name: web-scraper
Author: Milind-Ranjan

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

Install

mkdir -p .claude/skills/web-scraper-milind-ranjan && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14099" && unzip -o skill.zip -d .claude/skills/web-scraper-milind-ranjan && rm skill.zip

Installs to .claude/skills/web-scraper-milind-ranjan

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

154 charsno explicit “when” trigger

About this skill

Web Scraper

Overview

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

When to Use This Skill

When the user mentions "scraper" or related topics
When the user mentions "scraping" or related topics
When the user mentions "extrair dados web" or related topics
When the user mentions "web scraping" or related topics
When the user mentions "raspar dados" or related topics
When the user mentions "coletar dados site" or related topics

Do Not Use This Skill When

The task is unrelated to web scraper
A simpler, more specific tool can handle the request
The user needs general-purpose assistance without domain expertise

How It Works

Execute phases in strict order. Each phase feeds the next.

1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT

Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.

Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.

Capabilities

Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
Output formats: Markdown tables (default), JSON, CSV
Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
Multi-URL: extract same structure across sources with comparison and diff
Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
Auto-escalation: WebFetch fails silently -> automatic Browser fallback
Data transforms: cleaning, normalization, deduplication, enrichment
Differential mode: detect changes between scraping runs

Web Scraper

Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.

Phase 1: Clarify

Establish extraction parameters before touching any URL.

Required Parameters

Parameter	Resolve	Default
Target URL(s)	Which page(s) to scrape?	(required)
Data Target	What specific data to extract?	(required)
Output Format	Markdown table, JSON, CSV, or text?	Markdown table
Scope	Single page, paginated, or multi-URL?	Single page

Optional Parameters

Parameter	Resolve	Default
Pagination	Follow pagination? Max pages?	No, 1 page
Max Items	Maximum number of items to collect?	Unlimited
Filters	Data to exclude or include?	None
Sort Order	How to sort results?	Source order
Save Path	Save to file? Which path?	Display only
Language	Respond in which language?	User's lang
Diff Mode	Compare with previous run?	No

Clarification Rules

If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
Default to Markdown table output. Mention alternatives only if relevant.
Accept requests in any language. Always respond in the user's language.
If user says "everything" or "all data", perform recon first, then present what's available and let user choose.

Discovery Mode

When user has a topic but no specific URL:

Use WebSearch to find the most relevant pages
Present top 3-5 URLs with descriptions
Let user choose which to scrape, or scrape all
Proceed to Phase 2 with selected URL(s)

Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract

Phase 2: Reconnaissance

Analyze the target page before extraction.

Step 2.1: Initial Fetch

Use WebFetch to retrieve and analyze the page structure:

WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)

Step 2.2: Evaluate Fetch Quality

Signal	Interpretation	Action
Rich content with data clearly visible	Static page	Strategy A (WebFetch)
Empty containers, "loading...", minimal text	JS-rendered	Strategy B (Browser)
Login wall, CAPTCHA, 403/401 response	Blocked	Report to user
Content present but poorly structured	Needs precision	Strategy B (Browser)
JSON or XML response body	API endpoint	Strategy C (Bash/curl)
Download links for CSV/Excel available	Direct data file	Strategy C (download)

Step 2.3: Content Classification

Classify into an extraction mode:

Mode	Indicators	Examples
`table`	HTML `<table>`, grid layout with headers	Price comparison, statistics, specs
`list`	Repeated similar elements, card grids	Search results, product listings
`article`	Long-form text with headings/paragraphs	Blog post, news article, docs
`product`	Product name, price, specs, images, rating	E-commerce product page
`contact`	Names, emails, phones, addresses, roles	Team page, staff directory
`faq`	Question-answer pairs, accordions	FAQ page, help center
`pricing`	Plan names, prices, features, tiers	SaaS pricing page
`events`	Dates, locations, titles, descriptions	Event listings, conferences
`jobs`	Titles, companies, locations, salaries	Job boards, career pages
`custom`	User specified CSS selectors or fields	Anything not matching above

Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).

If user asked for "everything", present the available fields and let them choose.

Phase 3: Strategy Selection

Choose the extraction approach based on recon results.

Decision Tree

Structured data (JSON-LD, microdata) has what we need?
 |
 +-- YES --> STRATEGY E: Extract structured data directly
 |
 +-- NO: Content fully visible in WebFetch?
      |
      +-- YES: Need precise element targeting?
      |    |
      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
      |    +-- YES --> STRATEGY B: Browser automation
      |
      +-- NO: JavaScript rendering detected?
           |
           +-- YES --> STRATEGY B: Browser automation
           +-- NO:  API/JSON/XML endpoint or download link?
                |
                +-- YES --> STRATEGY C: Bash (curl + jq)
                +-- NO  --> Report access issue to user

Strategy A: Webfetch With Ai Extraction

Best for: Static pages, articles, simple tables, well-structured HTML.

Use WebFetch with a targeted extraction prompt tailored to the mode:

WebFetch(
  url = URL,
  prompt = "Extract [DATA_TARGET] from this page.
    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
    Rules:
    - If a value is missing or unclear, use 'N/A'
    - Do not include navigation, ads, footers, or unrelated content
    - Preserve original values exactly (numbers, currencies, dates)
    - Include ALL matching items, not just the first few
    - For each item, also extract the URL/link if available"
)

Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.

Strategy B: Browser Automation

Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.

Sequence:

Get tab context: tabs_context_mcp(createIfEmpty=true) -> get tabId
Navigate to URL: navigate(url=TARGET_URL, tabId=TAB)
Wait for content to load: computer(action="wait", duration=3, tabId=TAB)
Check for cookie/consent banners: find(query="cookie consent or accept button", tabId=TAB)
- If

Content truncated.

More by Milind-Ranjan

View all by Milind-Ranjan →

evolution

Milind-Ranjan

This skill enables makepad-skills to self-improve continuously during development.

freshservice-automation

Milind-Ranjan

Automate Freshservice ITSM tasks via Rube MCP (Composio): create/update tickets, bulk operations, service requests, and outbound emails. Always search tools first for current schemas.

polars

Milind-Ranjan

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

ml-engineer

Milind-Ranjan

Build production ML systems with PyTorch 2.x, TensorFlow, and modern ML frameworks. Implements model serving, feature engineering, A/B testing, and monitoring.

fp-ts-react

Milind-Ranjan

Practical patterns for using fp-ts with React - hooks, state, forms, data fetching. Use when building React apps with functional programming patterns. Works with React 18/19, Next.js 14/15.

Install

mkdir -p .claude/skills/web-scraper-milind-ranjan && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14099" && unzip -o skill.zip -d .claude/skills/web-scraper-milind-ranjan && rm skill.zip

Installs to .claude/skills/web-scraper-milind-ranjan

Safety

Review before install

Runs shell / code
Network access

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

2mo ago

Repo stars

Loads

~7,109 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

Milind-Ranjan

6 skills published

Links

Source code

web-scraper

Install

Activation

About this skill

Web Scraper

Overview

When to Use This Skill

Do Not Use This Skill When

How It Works

Capabilities

Web Scraper

Phase 1: Clarify

Required Parameters

Optional Parameters

Clarification Rules

Discovery Mode

Phase 2: Reconnaissance

Step 2.1: Initial Fetch

Step 2.2: Evaluate Fetch Quality

Step 2.3: Content Classification

Phase 3: Strategy Selection

Decision Tree

Strategy A: Webfetch With Ai Extraction

Strategy B: Browser Automation

More by Milind-Ranjan

evolution

freshservice-automation

polars

ml-engineer

fp-ts-react

Search skills