pydeseq2

Name: pydeseq2
Author: boisenoise

Differential gene expression analysis for bulk RNA-seq with PyDESeq2, including formulaic designs, Wald tests, FDR correction, LFC shrinkage, and result visualization.

Install

mkdir -p .claude/skills/pydeseq2-boisenoise && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13445" && unzip -o skill.zip -d .claude/skills/pydeseq2-boisenoise && rm skill.zip

Installs to .claude/skills/pydeseq2-boisenoise

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Differential gene expression analysis for bulk RNA-seq with PyDESeq2, including formulaic designs, Wald tests, FDR correction, LFC shrinkage, and result visualization.

167 charsno explicit “when” trigger

About this skill

PyDESeq2

Overview

PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including formulaic single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.

When to Use This Skill

This skill should be used when:

Analyzing bulk RNA-seq count data for differential expression
Comparing gene expression between experimental conditions (e.g., treated vs control)
Performing multi-factor designs accounting for batch effects or covariates
Converting R-based DESeq2 workflows to Python
Integrating differential expression analysis into Python-based pipelines
Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"

Quick Start Workflow

For users who want to perform a standard differential expression analysis:

import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

# 1. Load data
counts_df = pd.read_csv("counts.csv", index_col=0).T  # Transpose to samples × genes
metadata = pd.read_csv("metadata.csv", index_col=0)

# 2. Filter low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

# 3. Make the reference level explicit and fit DESeq2
metadata["condition"] = pd.Categorical(
    metadata["condition"], categories=["control", "treated"]
)
inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True,
    inference=inference,
)
dds.deseq2()

# 4. Perform statistical testing
ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],
    inference=inference,
)
ds.summary()

# 5. Access results
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")

Core Workflow Steps

Step 1: Data Preparation

Input requirements:

Count matrix: Samples × genes DataFrame with non-negative integer read counts
Metadata: Samples × variables DataFrame with experimental factors

Common data loading patterns:

# From CSV (typical format: genes × samples, needs transpose)
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)

# From TSV
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T

# From AnnData
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
metadata = adata.obs

Data filtering:

# Remove low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

# Remove samples with missing metadata
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]

Step 2: Design Specification

The design formula specifies how gene expression is modeled.

Single-factor designs:

design = "~condition"  # Simple two-group comparison

Multi-factor designs:

design = "~batch + condition"  # Control for batch effects
design = "~age + condition"     # Include continuous covariate
design = "~group + condition + group:condition"  # Interaction effects

Design formula guidelines:

Use formulaic/Wilkinson formula notation (R-style)
Put adjustment variables (e.g., batch) before the main variable of interest
Ensure variables exist as columns in the metadata DataFrame
Use appropriate data types; continuous variables are detected from the formula, and categorical variables can be forced with C(variable) or a pandas categorical dtype
Do not use deprecated design_factors, continuous_factors, or ref_level arguments in new workflows

Step 3: DESeq2 Fitting

Initialize the DeseqDataSet and run the complete pipeline:

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference

inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True,  # Refit after removing outliers
    inference=inference,
    low_memory=False,
)

# Run the complete DESeq2 pipeline
dds.deseq2()

What deseq2() does:

Computes size factors (normalization)
Fits genewise dispersions
Fits dispersion trend curve
Computes dispersion priors
Fits MAP dispersions (shrinkage)
Fits log fold changes
Calculates Cook's distances (outlier detection)
Refits if outliers detected (optional)

Step 4: Statistical Testing

Perform Wald tests to identify differentially expressed genes:

from pydeseq2.ds import DeseqStats

ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],  # Test treated vs control
    alpha=0.05,                # Significance threshold
    cooks_filter=True,         # Filter outliers
    independent_filter=True    # Filter low-power tests
)

ds.summary()

Contrast specification:

Format: [variable, test_level, reference_level]
Example: ["condition", "treated", "control"] tests treated vs control
Use a numeric contrast vector for continuous variables or complex coefficients
Default contrasts are no longer supported in PyDESeq2 0.5.x; always provide contrast

Result DataFrame columns:

baseMean: Mean normalized count across samples
log2FoldChange: Log2 fold change between conditions
lfcSE: Standard error of LFC
stat: Wald test statistic
pvalue: Raw p-value
padj: Adjusted p-value (FDR-corrected via Benjamini-Hochberg)

Step 5: Optional LFC Shrinkage

Apply shrinkage to reduce noise in fold change estimates:

ds.lfc_shrink(coeff="condition[T.treated]")  # Applies apeGLM shrinkage

When to use LFC shrinkage:

For visualization (volcano plots, heatmaps)
For ranking genes by effect size
When prioritizing genes for follow-up experiments

Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.

Step 6: Result Export

Save results and intermediate objects:

# Export results as CSV
ds.results_df.to_csv("deseq2_results.csv")

# Save significant genes only
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")

# Save a portable AnnData object for later inspection
dds.to_picklable_anndata().write_h5ad("dds_result.h5ad")

Avoid loading pickle files from untrusted sources. For exchange between agents, pipelines, or collaborators, prefer CSV results and .h5ad AnnData files.

Common Analysis Patterns

Two-Group Comparison

Standard case-control comparison:

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

results = ds.results_df
significant = results[results.padj < 0.05]

Multiple Comparisons

Testing multiple treatment groups against control:

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}

for treatment in treatments:
    ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
    ds.summary()
    all_results[treatment] = ds.results_df

    sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
    print(f"{treatment}: {sig_count} significant genes")

Accounting for Batch Effects

Control for technical variation:

# Include batch in design
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
dds.deseq2()

# Test condition while controlling for batch
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

Continuous Covariates

Include continuous variables like age or dosage:

# Ensure continuous variable is numeric
metadata["age"] = pd.to_numeric(metadata["age"])

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

Using the Analysis Script

This skill includes a complete command-line script for standard analyses:

# Basic usage
python scripts/run_deseq2_analysis.py \
  --counts counts.csv \
  --metadata metadata.csv \
  --design "~condition" \
  --contrast condition treated control \
  --output results/

# With additional options
python scripts/run_deseq2_analysis.py \
  --counts counts.csv \
  --metadata metadata.csv \
  --design "~batch + condition" \
  --contrast condition treated control \
  --output results/ \
  --min-counts 10 \
  --alpha 0.05 \
  --n-cpus 4 \
  --shrink-coeff "condition[T.treated]" \
  --plots

Script features:

Automatic data loading and validation
Gene and sample filtering
Complete DESeq2 pipeline execution
Statistical testing with customizable parameters
Result export (CSV and portable AnnData/H5AD)
Explicit LFC shrinkage coefficient support for PyDESeq2 0.5.x
Optional visualization (volcano and MA plots)

Refer users to scripts/run_deseq2_analysis.py when they need a standalone analysis tool or want to batch process multiple datasets.

Result Interpretation

Identifying Significant Genes

# Filter by adjusted p-value
significant = ds.results_df[ds.results_df.padj < 0.05]

# Filter by both significance and effect size
sig_and_large = ds.results_df[
    (ds.results_df.padj < 0.05) &
    (abs(ds.results_df.log2FoldChange) > 1)
]

# Separate up- and down-re

---

*Content truncated.*

More by boisenoise

View all by boisenoise →

omero-integration

boisenoise

Microscopy data management platform. Access images via Python, retrieve datasets, analyze pixels, manage ROIs/annotations, batch processing, for high-content screening and microscopy workflows.

fillout-forms-automation

boisenoise

Automate Fillout tasks via Rube MCP (Composio). Always search tools first for current schemas.

codeinterpreter-automation

boisenoise

Automate Codeinterpreter tasks via Rube MCP (Composio). Always search tools first for current schemas.

social-metadata-hardening

boisenoise

Fix social sharing previews so URLs render as rich cards on Facebook, LinkedIn, X/Twitter, WhatsApp, Telegram, and more. Covers OG tags, Twitter cards, absolute image URLs, and debugging.

Sharing Skills

boisenoise

Contribute skills back to upstream via branch and PR

Install

mkdir -p .claude/skills/pydeseq2-boisenoise && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13445" && unzip -o skill.zip -d .claude/skills/pydeseq2-boisenoise && rm skill.zip

Installs to .claude/skills/pydeseq2-boisenoise

Safety

Review before install

Runs shell / code
Bundles scripts

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

18d ago

License

MIT license

Repo stars

Loads

~4,252 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

boisenoise

6 skills published

Links

Source code

pydeseq2

Install

Activation

About this skill

PyDESeq2

Overview

When to Use This Skill

Quick Start Workflow

Core Workflow Steps

Step 1: Data Preparation

Step 2: Design Specification

Step 3: DESeq2 Fitting

Step 4: Statistical Testing

Step 5: Optional LFC Shrinkage

Step 6: Result Export

Common Analysis Patterns

Two-Group Comparison

Multiple Comparisons

Accounting for Batch Effects

Continuous Covariates

Using the Analysis Script

Result Interpretation

Identifying Significant Genes

More by boisenoise

omero-integration

fillout-forms-automation

codeinterpreter-automation

social-metadata-hardening

Sharing Skills

Search skills