bio-methylation-array-preprocessing

Name: bio-methylation-array-preprocessing
Author: bg-szy

Turns raw Illumina Infinium methylation BeadChip IDATs (450K, EPIC, EPICv2) into a defensible beta/M matrix with sesame (openSesame/SigDF) or minfi (RGChannelSet -> MethylSet -> GenomicRatioSet). Covers Type I vs Type II probe chemistry and why raw Type II beta is compressed, the signal-to-beta math

Install

mkdir -p .claude/skills/bio-methylation-array-preprocessing && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15995" && unzip -o skill.zip -d .claude/skills/bio-methylation-array-preprocessing && rm skill.zip

Installs to .claude/skills/bio-methylation-array-preprocessing

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Turns raw Illumina Infinium methylation BeadChip IDATs (450K, EPIC, EPICv2) into a defensible beta/M matrix with sesame (openSesame/SigDF) or minfi (RGChannelSet -> MethylSet -> GenomicRatioSet). Covers Type I vs Type II probe chemistry and why raw Type II beta is compressed, the signal-to-beta math (beta = M/(M+U+100)) and M-value logit, detection-p / pOOBAH masking including the out-of-band deletion-artifact catch, dye-bias correction, and the normalization decision (noob, funnorm, quantile, SWAN, BMIQ, dasen, sesame QCDPB). Use when reading IDATs, choosing a normalization for a 450K/EPIC/EPICv2 cohort, deciding beta vs M, masking failed probes, or producing the corrected matrix before testing. For probe/sample filtering, EPICv2 replicate collapse, and sample-identity QC see array-qc-filtering; for native long-read 5mC see long-read-sequencing/nanopore-methylation (a different platform).

902 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

Version Compatibility

Reference examples tested with: sesame 1.20+, minfi 1.48+, ChAMP 2.32+, wateRmelon 2.8+.

Before using code patterns, verify installed versions match. If versions differ:

R: packageVersion('<pkg>') then ?function_name to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

The ARRAY VERSION and GENOME BUILD are versions that matter as much as the package. EPICv2 REQUIRES sesame (mainstream minfi does not auto-detect it and returns "Unknown"); the manifest/annotation packages are array-version- and genome-build-specific (450K and EPICv1 are hg19; EPICv2 is hg38-native). sesame pulls platform/address data from its own hub, so sesameDataCache() must run once before processing. Record the array (450K/EPIC/EPICv2) and the genome build in any output, the way a sequencing run records its reference.

Array Preprocessing

"Give me a clean methylation matrix from my IDATs" -> Read the raw two-channel intensities, correct the Type I/II design mismatch, dye/background bias, and failed probes, then emit beta (for reporting) and M (for testing) - because an Infinium beta is a two-chemistry fluorescence ratio, not a methylation value, until those corrections are applied.

R: openSesame(idat_dir, prep='QCDPB', func=getBetas) (sesame) or preprocessFunnorm(rgSet) (minfi)

Scope: IDAT -> corrected, masked, normalized beta/M matrix for one array version. Probe filtering (cross-reactive/SNP/sex), EPICv2 replicate collapse, and sample-identity QC -> array-qc-filtering. Per-CpG testing -> differential-cpg-testing. Region calling -> dmr-detection. Native long-read 5mC -> long-read-sequencing/nanopore-methylation. Bisulfite-sequencing (Bismark/WGBS/RRBS) is the other modality in this category, not this skill.

The Single Most Important Modern Insight -- An Infinium Beta Value Is a Two-Chemistry Fluorescence Ratio, Not a Methylation Measurement

An Infinium array does not measure methylation - it measures the relative fluorescence of a methylated vs unmethylated allele at a fixed, manufacturer-chosen set of CpGs, glued together from two incompatible chemistries. A raw beta becomes a comparable methylation estimate only after preprocessing; preprocessing IS the measurement, not optional cleanup. Three corollaries every misuse violates:

The manifest is the experiment. The array interrogates <3% of human CpGs, and a DIFFERENT <3% across 450K (~485K), EPIC (~865K), and EPICv2 (~935K). "Absent" almost always means "not on this array," and a 450K-trained clock or EWAS does not transfer to EPICv2 without intersecting probe sets. Do not start from a supplied beta matrix when IDATs exist - the raw two-channel intensities, control probes, and out-of-band signal that noob/funnorm/pOOBAH need are already gone.
Type I and Type II betas disagree by design. Type II probes (one bead, two dyes) have a narrower dynamic range and dye-incorporation bias, so raw Type II betas are compressed toward 0.5 relative to Type I (two beads, one channel). Mixing the two chemistries without a design correction (BMIQ/SWAN/sesame matchDesign) injects a probe-type artifact that can exceed the biological effect. The diagnostic is a per-type beta-density plot showing two mismatched peaks.
Raw beta is uninterpretable until detection-masked. Failed probes - low signal, germline/somatic deletions, cross-reactive, SNP-hit - return confident-looking betas that are pure noise. pOOBAH / detection-p masking is what separates a number from a measurement of nothing; pOOBAH additionally catches deletion-driven false-intermediate methylation that negative-control detection-p misses.

Organize the work around DELIVERING a defensible matrix (read -> correct -> mask -> normalize), not around listing minfi functions.

Three Modalities of the Same Biology

DNA methylation is measured three ways, each with different tradeoffs - state which one the data is before choosing tools:

Modality	Readout	Coverage	Cohort-comparability	This skill
Infinium array (450K/EPIC/EPICv2)	intensity ratio, no depth	fixed <3% of CpGs, regulatory-enriched	high (shared manifest, no alignment)	YES
WGBS / RRBS bisulfite	count ratio, depth-gated	genome-wide (WGBS) or enriched (RRBS)	needs alignment + matched genome	-> bismark-alignment, methylkit-analysis
Long-read native (ONT/PacBio)	per-molecule modification calls	genome-wide, phased	growing	-> long-read-sequencing/nanopore-methylation

Arrays dominate human epigenetic epidemiology (essentially every published clock and large EWAS is array-based) because cost is a fraction of WGBS and the fixed manifest makes cohorts directly comparable.

Object Models (do not start from a beta matrix)

The raw output per sample is a pair of binary IDATs (_Grn.idat, _Red.idat); background, dye, and detection-p correction REQUIRE these plus the control probes.

minfi: RGChannelSet (raw red/green) -> a preprocess* step -> MethylSet (M/U intensities) -> RatioSet (beta/M) -> GenomicRatioSet (genome-mapped). read.metharray.exp() reads IDATs; getBeta(), getM(), getCN() extract values.
sesame: a SigDF (one signal data.frame per sample). readIDATpair() reads one sample; openSesame() drives the whole pipeline across a directory and returns a betas matrix directly.

Tool Taxonomy

Tool	Citation	Mechanism / role	When
sesame	Zhou 2018 Nucleic Acids Res 46:e123	SigDF; openSesame QCDPB; pOOBAH OOB masking; EPICv2-native	EPICv2; best detection masking; the modern default
minfi	Aryee 2014 Bioinformatics 30:1363	RGChannelSet->GenomicRatioSet; noob/funnorm/quantile/SWAN	450K/EPICv1; large downstream ecosystem (DMRcate, conumee)
ChAMP	Tian 2017 Bioinformatics 33:3982	end-to-end pipeline; BMIQ default	one-call newcomer pipeline on 450K/EPICv1
wateRmelon	Pidsley 2013 BMC Genomics 14:293	dasen/nasen + metric-driven normalization eval	dasen default; normalization benchmarking

Normalization Decision Tree by Scenario

Separate the two correction layers that get conflated: (a) background + dye bias (within-sample): noob, sesame dyeBias, dasen background step; (b) Type I/II design correction + between-array harmonization: SWAN, BMIQ, quantile, funnorm, dasen quantile step. A complete pipeline does both.

Scenario	Recommended	Why
EPICv2 (any design)	sesame `openSesame(prep='QCDPB')`	EPICv2-native; pOOBAH; minfi mis-handles duplicate IDs
Cancer / cross-tissue (global differences expected)	minfi `preprocessFunnorm` (noob + control-PCs)	preserves real global shifts; quantile would erase them
Subtle blood EWAS (no global difference expected)	`preprocessQuantile` or wateRmelon `dasen`	marginal distributions assumed equal; safe to harmonize
Strong Type I/II design correction wanted	BMIQ (Teschendorff 2013) or SWAN (Maksimovic 2012)	dilate Type II onto the Type I distribution; pair with a between-array step
Single-sample / clinical / streaming	ssNoob or per-IDAT openSesame	reproducible without re-normalizing the cohort
Probe/sample filtering, EPICv2 collapse, identity	-> array-qc-filtering	this skill stops at the corrected matrix
Per-CpG testing on the matrix	-> differential-cpg-testing	test on M-values; report delta-beta

There is no universally best normalization (Pidsley 2013 favored dasen; Fortin 2014 favored funnorm for global-difference studies; Welsh 2023 ranked a sesame/pOOBAH pipeline best and quantile worst on EPIC replicate-concordance). Key the choice on array version + whether global differences are expected + single-sample vs cohort, and verify against current benchmarks rather than hard-coding one method.

Signal -> Beta -> M

Beta: beta = M / (M + U + alpha), M = methylated-allele intensity, U = unmethylated, alpha = 100 (minfi default) stabilizes the ratio when both intensities are near zero. beta in [0,1] is interpretable but HETEROSCEDASTIC (variance collapses near 0 and 1), violating the constant-variance assumption of linear models.
M-value: M = log2((M_int + alpha) / (U_int + alpha)), the logit of beta. Approximately homoscedastic; the correct scale for limma/t-tests (Du 2010 BMC Bioinformatics 11:587). Rule: test on M-values, report delta-beta for effect size - the same rule as bisulfite sequencing.

Process IDATs with sesame (the EPICv2-safe default)

Goal: Produce a corrected, detection-masked betas matrix from a directory of IDAT pairs without manually juggling manifest packages.

Approach: Cache the sesame data hub once, then run openSesame with the default QCDPB prep (qualityMask, inferInfiniumIChannel, dyeBiasNL, pOOBAH, noob, in that order), which auto-detects the platform and returns betas; pOOBAH writes NA into failed probes in place.

library(sesame)
sesameDataCache()                          # once per machine; pulls platform/address data
betas <- openSesame('idat_dir', prep = 'QCDPB', func = getBetas)
# prep codes: Q qualityMask  C inferInfiniumIChannel  D dyeBiasNL  P pOOBAH  B noob
# pOOBAH masks (sets NA) probes whose out-of-band signal is indistinguishable from background,
# catching deletion-driven false-intermediate methylation that negative-control detection-p misses
mvals <- log2(betas / (1 - betas))         # M-values for statistical testing (logit of beta)

For EPICv2, openSesame detects the platform automatically; the replicate-probe collapse (betasCollapseToPfx) belongs to the next stage and is documented in array-qc-filtering.

Process IDATs with minfi (450K / EPICv1)

Goal: Build a normalized GenomicRatioSet and extract beta and M, choosing the normalizati

Content truncated.

More by bg-szy

View all by bg-szy →

join-meeting

bg-szy

paper-2-web

bg-szy

This skill should be used when converting academic papers into promotional and presentation formats including interactive websites (Paper2Web), presentation videos (Paper2Video), and conference posters (Paper2Poster). Use this skill for tasks involving paper dissemination, conference preparation, cr

autoskill

bg-szy

Observe the user's screen via screenpipe, detect repeated research workflows, match them against existing scientific-agent-skills, and draft new skills (or composition recipes that chain existing ones) for the patterns not yet covered. Use when the user asks to analyze their recent work and propose

ddd-context-mapping

bg-szy

Map relationships between bounded contexts and define integration contracts using DDD context mapping patterns.

notebooklm-management

bg-szy

NotebookLM MCP server management - query notebooks, add from share links, handle auth, reset sessions. Use when working with Google NotebookLM notebooks for conversational research tasks.

Install

mkdir -p .claude/skills/bio-methylation-array-preprocessing && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15995" && unzip -o skill.zip -d .claude/skills/bio-methylation-array-preprocessing && rm skill.zip

Installs to .claude/skills/bio-methylation-array-preprocessing

Safety

No risk patterns found

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

19d ago

Repo stars

Loads

~4,273 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

bg-szy

6 skills published

Links

Source code

bio-methylation-array-preprocessing

Install

Activation

About this skill

Version Compatibility

Array Preprocessing

The Single Most Important Modern Insight -- An Infinium Beta Value Is a Two-Chemistry Fluorescence Ratio, Not a Methylation Measurement

Three Modalities of the Same Biology

Object Models (do not start from a beta matrix)

Tool Taxonomy

Normalization Decision Tree by Scenario

Signal -> Beta -> M

Process IDATs with sesame (the EPICv2-safe default)

Process IDATs with minfi (450K / EPICv1)

More by bg-szy

join-meeting

paper-2-web

autoskill

ddd-context-mapping

notebooklm-management

Search skills