pydeseq2
Differential gene expression analysis for bulk RNA-seq with PyDESeq2, including formulaic designs, Wald tests, FDR correction, LFC shrinkage, and result visualization.
Install
mkdir -p .claude/skills/pydeseq2-boisenoise && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13445" && unzip -o skill.zip -d .claude/skills/pydeseq2-boisenoise && rm skill.zipInstalls to .claude/skills/pydeseq2-boisenoise
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Differential gene expression analysis for bulk RNA-seq with PyDESeq2, including formulaic designs, Wald tests, FDR correction, LFC shrinkage, and result visualization.About this skill
PyDESeq2
Overview
PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including formulaic single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.
When to Use This Skill
This skill should be used when:
- Analyzing bulk RNA-seq count data for differential expression
- Comparing gene expression between experimental conditions (e.g., treated vs control)
- Performing multi-factor designs accounting for batch effects or covariates
- Converting R-based DESeq2 workflows to Python
- Integrating differential expression analysis into Python-based pipelines
- Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"
Quick Start Workflow
For users who want to perform a standard differential expression analysis:
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
# 1. Load data
counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes
metadata = pd.read_csv("metadata.csv", index_col=0)
# 2. Filter low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
# 3. Make the reference level explicit and fit DESeq2
metadata["condition"] = pd.Categorical(
metadata["condition"], categories=["control", "treated"]
)
inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True,
inference=inference,
)
dds.deseq2()
# 4. Perform statistical testing
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"],
inference=inference,
)
ds.summary()
# 5. Access results
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")
Core Workflow Steps
Step 1: Data Preparation
Input requirements:
- Count matrix: Samples × genes DataFrame with non-negative integer read counts
- Metadata: Samples × variables DataFrame with experimental factors
Common data loading patterns:
# From CSV (typical format: genes × samples, needs transpose)
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)
# From TSV
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
# From AnnData
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
metadata = adata.obs
Data filtering:
# Remove low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
# Remove samples with missing metadata
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]
Step 2: Design Specification
The design formula specifies how gene expression is modeled.
Single-factor designs:
design = "~condition" # Simple two-group comparison
Multi-factor designs:
design = "~batch + condition" # Control for batch effects
design = "~age + condition" # Include continuous covariate
design = "~group + condition + group:condition" # Interaction effects
Design formula guidelines:
- Use formulaic/Wilkinson formula notation (R-style)
- Put adjustment variables (e.g., batch) before the main variable of interest
- Ensure variables exist as columns in the metadata DataFrame
- Use appropriate data types; continuous variables are detected from the formula, and categorical variables can be forced with
C(variable)or a pandas categorical dtype - Do not use deprecated
design_factors,continuous_factors, orref_levelarguments in new workflows
Step 3: DESeq2 Fitting
Initialize the DeseqDataSet and run the complete pipeline:
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True, # Refit after removing outliers
inference=inference,
low_memory=False,
)
# Run the complete DESeq2 pipeline
dds.deseq2()
What deseq2() does:
- Computes size factors (normalization)
- Fits genewise dispersions
- Fits dispersion trend curve
- Computes dispersion priors
- Fits MAP dispersions (shrinkage)
- Fits log fold changes
- Calculates Cook's distances (outlier detection)
- Refits if outliers detected (optional)
Step 4: Statistical Testing
Perform Wald tests to identify differentially expressed genes:
from pydeseq2.ds import DeseqStats
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"], # Test treated vs control
alpha=0.05, # Significance threshold
cooks_filter=True, # Filter outliers
independent_filter=True # Filter low-power tests
)
ds.summary()
Contrast specification:
- Format:
[variable, test_level, reference_level] - Example:
["condition", "treated", "control"]tests treated vs control - Use a numeric contrast vector for continuous variables or complex coefficients
- Default contrasts are no longer supported in PyDESeq2 0.5.x; always provide
contrast
Result DataFrame columns:
baseMean: Mean normalized count across sampleslog2FoldChange: Log2 fold change between conditionslfcSE: Standard error of LFCstat: Wald test statisticpvalue: Raw p-valuepadj: Adjusted p-value (FDR-corrected via Benjamini-Hochberg)
Step 5: Optional LFC Shrinkage
Apply shrinkage to reduce noise in fold change estimates:
ds.lfc_shrink(coeff="condition[T.treated]") # Applies apeGLM shrinkage
When to use LFC shrinkage:
- For visualization (volcano plots, heatmaps)
- For ranking genes by effect size
- When prioritizing genes for follow-up experiments
Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.
Step 6: Result Export
Save results and intermediate objects:
# Export results as CSV
ds.results_df.to_csv("deseq2_results.csv")
# Save significant genes only
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")
# Save a portable AnnData object for later inspection
dds.to_picklable_anndata().write_h5ad("dds_result.h5ad")
Avoid loading pickle files from untrusted sources. For exchange between agents, pipelines, or collaborators, prefer CSV results and .h5ad AnnData files.
Common Analysis Patterns
Two-Group Comparison
Standard case-control comparison:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
results = ds.results_df
significant = results[results.padj < 0.05]
Multiple Comparisons
Testing multiple treatment groups against control:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}
for treatment in treatments:
ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
ds.summary()
all_results[treatment] = ds.results_df
sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
print(f"{treatment}: {sig_count} significant genes")
Accounting for Batch Effects
Control for technical variation:
# Include batch in design
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
dds.deseq2()
# Test condition while controlling for batch
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
Continuous Covariates
Include continuous variables like age or dosage:
# Ensure continuous variable is numeric
metadata["age"] = pd.to_numeric(metadata["age"])
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
Using the Analysis Script
This skill includes a complete command-line script for standard analyses:
# Basic usage
python scripts/run_deseq2_analysis.py \
--counts counts.csv \
--metadata metadata.csv \
--design "~condition" \
--contrast condition treated control \
--output results/
# With additional options
python scripts/run_deseq2_analysis.py \
--counts counts.csv \
--metadata metadata.csv \
--design "~batch + condition" \
--contrast condition treated control \
--output results/ \
--min-counts 10 \
--alpha 0.05 \
--n-cpus 4 \
--shrink-coeff "condition[T.treated]" \
--plots
Script features:
- Automatic data loading and validation
- Gene and sample filtering
- Complete DESeq2 pipeline execution
- Statistical testing with customizable parameters
- Result export (CSV and portable AnnData/H5AD)
- Explicit LFC shrinkage coefficient support for PyDESeq2 0.5.x
- Optional visualization (volcano and MA plots)
Refer users to scripts/run_deseq2_analysis.py when they need a standalone analysis tool or want to batch process multiple datasets.
Result Interpretation
Identifying Significant Genes
# Filter by adjusted p-value
significant = ds.results_df[ds.results_df.padj < 0.05]
# Filter by both significance and effect size
sig_and_large = ds.results_df[
(ds.results_df.padj < 0.05) &
(abs(ds.results_df.log2FoldChange) > 1)
]
# Separate up- and down-re
---
*Content truncated.*