conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-image-captioning-arxiv

Name: conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-image-captioning-arxiv
Author: feiyang-k

by feiyang-kSource

conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-image-captioning-arxiv — an agent skill by feiyang-k.

Install

mkdir -p .claude/skills/conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-im && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14378" && unzip -o skill.zip -d .claude/skills/conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-im && rm skill.zip

Installs to .claude/skills/conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-im

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-image-captioning-arxiv — an agent skill by feiyang-k.

131 chars · catalog descriptionno explicit “when” trigger

About this skill

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

One-line decision

Use this skill when you want to build an image captioning dataset from web alt-text using automated cleaning and hypernym replacement. Avoid it when you need fine-grained or domain-specific captions that web alt-text cannot provide.

Skill metadata

Skill type: alt-text-cleaning-pipeline
Paper kind: operational-method
Actionability: high
Evidence quality: full_paper

Goal

Construct a large-scale image captioning dataset (Conceptual Captions, CC3M) from web alt-text using automated pipelines for text cleaning, hypernym replacement, and image-text relevance filtering.

Problem signature

Modality: image-text pairs derived from web page alt-text with automated cleaning.
Data state: raw web alt-text cleaned through multi-stage filtering and text normalization.
Scale regime: 3.3 million image-caption pairs from billions of web page candidates.
Model requirement: No model training required for dataset construction; uses existing NLP and vision tools for filtering.

Use when

You want to build an image-caption dataset from web alt-text.
You need automated caption cleaning without manual annotation.
You want a general-purpose pretraining dataset at million-scale.

Do not use when

You need domain-specific or expert-level captions.
You need fine-grained spatial or attribute descriptions.
You require captions with precise named entities (hypernym replacement removes them).

Required inputs

web_crawl: Web pages with image alt-text attributes.
nlp_pipeline: POS tagger, NER, hypernym database (WordNet) for text cleaning.
image_text_filter: Classifier to measure image-text relevance.

Optional inputs

profanity_filter: Filter to remove offensive or inappropriate content.
deduplication: Near-duplicate image detection for deduplication.

Outputs

cc3m_dataset: 3.3M cleaned image-caption pairs.
cleaning_pipeline: Reusable multi-stage text cleaning pipeline.

Assumptions and prerequisites

Web alt-text, while noisy, contains useful image descriptions after cleaning.
Hypernym replacement (e.g., 'Barack Obama' → 'a person') improves generalization.
Multi-stage filtering produces captions suitable for pretraining.

Procedure

Extract alt-text from web pages Action: Parse HTML to extract image-alt-text pairs from web crawl. Why: Alt-text is the primary source of image descriptions on the web. Note: See paper for details.
Filter by text quality Action: Remove alt-text that is too short, too long, contains HTML/boilerplate, or is not English. Why: Low-quality alt-text adds noise to the dataset. Note: See paper for details.
Apply hypernym replacement Action: Replace named entities (persons, locations, brands) with hypernyms using NER + WordNet. Why: Improves generalization by reducing memorization of specific entities. Note: See paper for details.
Filter by image-text relevance Action: Use a learned classifier to score image-text relevance and remove misaligned pairs. Why: Many alt-text strings describe the page context rather than the image. Note: See paper for details.
Deduplicate and finalize Action: Remove near-duplicate images and export the final dataset. Why: Deduplication prevents overfitting to repeated examples. Note: See paper for details.

Parameters to set

min_caption_length — Role: Minimum words in cleaned caption. How to set: 3-5 words minimum. Default/range: ~3. Effect: Removes trivially short captions.
hypernym_level — Role: How far up the WordNet hierarchy to replace. How to set: One level up from the entity. Default/range: 1 level. Effect: More abstract hypernyms reduce specificity.
relevance_threshold — Role: Minimum image-text relevance score. How to set: Tune on held-out samples. Default/range: Not specified. Effect: Stricter threshold reduces dataset size but improves quality.

Validation checks

Cleaned captions should be grammatically correct and image-relevant.
The dataset should be large enough for effective pretraining (>1M pairs).
Hypernym replacement should improve generalization on downstream tasks.

Failure modes

Hypernym replacement may remove useful specificity from captions.
Text cleaning rules may be too aggressive and remove valid captions.
Image-text relevance filtering depends on the quality of the classifier.

Adaptation notes for VLM training

The CC3M cleaning pipeline inspired subsequent datasets (CC12M, SBU, etc.).
Adapt the hypernym replacement for domain-specific applications.
Use CC3M as a pretraining alignment dataset for VLMs (as in LLaVA).

Implementation notes

Use spaCy or similar NLP toolkit for efficient NER and POS tagging.
Batch image-text relevance scoring for efficiency.
Store both original and cleaned captions for analysis.

Evidence from the paper

Conceptual Captions (CC3M) provides 3.3M automatically cleaned image-caption pairs from web alt-text.
The hypernyming step replaces named entities with generic concepts, improving model generalization.
Models trained on CC3M perform comparably to those trained on manually annotated COCO captions.
The automated pipeline enables dataset construction at scale without manual annotation.

Source paper

Title: Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
Year: 2018
Venue: ACL
Paper ID: arxiv-1809.00470v1
URL: http://arxiv.org/abs/1809.00470v1
arXiv ID: 1809.00470v1

Install

mkdir -p .claude/skills/conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-im && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14378" && unzip -o skill.zip -d .claude/skills/conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-im && rm skill.zip

Installs to .claude/skills/conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-im

Safety

No risk patterns found

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

1mo ago

License

Apache-2.0

Repo stars

Loads

~1,427 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

feiyang-k

Links

Source code

conceptual-captions-a-cleaned-hypernymed-image-alt-text-dataset-for-automatic-image-captioning-arxiv

Install

Activation

About this skill

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

One-line decision

Skill metadata

Goal

Problem signature

Use when

Do not use when

Required inputs

Optional inputs

Outputs

Assumptions and prerequisites

Procedure

Parameters to set

Validation checks

Failure modes

Adaptation notes for VLM training

Implementation notes

Evidence from the paper

Source paper

Search skills