esm2-sequence-scoring

Name: esm2-sequence-scoring
Author: BioTender-max

Install

mkdir -p .claude/skills/esm2-sequence-scoring && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16391" && unzip -o skill.zip -d .claude/skills/esm2-sequence-scoring && rm skill.zip

Installs to .claude/skills/esm2-sequence-scoring

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

ESM2 protein language model for sequence scoring, embeddings, and plausibility checks. Use this skill when: (1) Computing pseudo-log-likelihood (PLL) scores, (2) Getting protein embeddings for clustering, (3) Filtering designs by sequence plausibility, (4) Zero-shot variant effect prediction, (5) Analyzing sequence-function relationships. For structure prediction, use chai1-structure-prediction or boltz-structure-prediction. For QC thresholds, use protein-design-qc.

470 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

ESM2 Protein Language Model

Plain-language role: Use ESM when you want sequence-level scoring or embeddings rather than 3D structure prediction.

Prerequisites

Requirement	Minimum	Recommended
Python	3.8+	3.10
PyTorch	1.10+	2.0+
CUDA	11.0+	11.7+
GPU VRAM	8GB	24GB (A10G)
RAM	16GB	32GB

How to run

First time? See Installation Guide to set up Modal and biomodals.

Option 1: Modal

cd biomodals
modal run modal_esm2_predict_masked.py \
  --input-faa sequences.fasta \
  --out-dir embeddings/

GPU: A10G (24GB) | Timeout: 300s default

Option 2: Python API (recommended)

import torch
import esm2-sequence-scoring

# Load model
model, alphabet = esm2-sequence-scoring.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model = model.eval().cuda()

# Process sequences
data = [("seq1", "MKTAYIAKQRQISFVK...")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

with torch.no_grad():
    results = model(batch_tokens.cuda(), repr_layers=[33])

# Get embeddings
embeddings = results["representations"][33]

Key parameters

ESM2 Models

Model	Parameters	Speed	Quality
esm2_t6_8M	8M	Fastest	Fast screening
esm2_t12_35M	35M	Fast	Good
esm2_t33_650M	650M	Medium	Better
esm2_t36_3B	3B	Slow	Best

Output format

embeddings/
├── embeddings.npy       # (N, 1280) array
├── pll_scores.csv       # PLL for each sequence
└── metadata.json        # Sequence info

Sample output

Successful run

$ modal run modal_esm2_predict_masked.py --input-faa designs.fasta
[INFO] Loading ESM2-650M model...
[INFO] Processing 100 sequences...
[INFO] Computing pseudo-log-likelihood...

embeddings/pll_scores.csv:
sequence_id,pll,pll_normalized,length
design_0,-0.82,0.15,78
design_1,-0.95,0.08,85
design_2,-1.23,-0.12,72
...

Summary:
  Mean PLL: -0.91
  Sequences with PLL > 0: 42/100 (42%)

What good output looks like:

PLL_normalized: > 0.0 (more natural-like)
Embeddings shape: (N, 1280) for 650M model
Higher PLL = more natural sequence

Decision tree

Should I use ESM2?
│
├─ What do you need?
│  ├─ Sequence plausibility score → ESM2 PLL ✓
│  ├─ Embeddings for clustering → ESM2 ✓
│  ├─ Variant effect prediction → ESM2 ✓
│  └─ Structure prediction → Use ESMFold
│
├─ What model size?
│  ├─ Fast screening → esm2_t12_35M
│  ├─ Standard use → esm2_t33_650M ✓
│  └─ Best quality → esm2_t36_3B
│
└─ Use case?
   ├─ QC filtering → PLL > 0.0 threshold
   ├─ Diversity analysis → Mean-pooled embeddings
   └─ Mutation scanning → Per-position log-odds

PLL interpretation

Normalized PLL	Interpretation
> 0.2	Very natural sequence
0.0 - 0.2	Good, natural-like
-0.5 - 0.0	Acceptable
< -0.5	May be unnatural

Typical performance

Campaign Size	Time (A10G)	Cost (Modal)	Notes
100 sequences	5-10 min	~$1	Quick screen
1000 sequences	30-60 min	~$5	Standard
5000 sequences	2-3h	~$20	Large batch

Throughput: ~100-200 sequences/minute with 650M model.

Verify

wc -l embeddings/pll_scores.csv  # Should match input + 1 (header)

Troubleshooting

OOM errors: Use smaller model or batch sequences Slow processing: Use esm2_t12_35M for speed Low PLL scores: May indicate unusual/designed sequences

Error interpretation

Error	Cause	Fix
`RuntimeError: CUDA out of memory`	Sequence too long or large batch	Reduce batch size
`KeyError: representation`	Wrong layer requested	Use layer 33 for 650M model
`ValueError: sequence`	Invalid amino acid	Check for non-standard AAs

Next: Structure prediction with chai1-structure-prediction or boltz-structure-prediction → protein-design-qc for filtering.

Inputs

Protein sequences in FASTA or CSV form for scoring, embedding, or variant analysis.
A selected ESM model size and any layer or batching preferences.
Optional metadata linking each sequence back to a design campaign.

Outputs

Sequence-level embeddings, pseudo-log-likelihood scores, or mutation effect estimates.
A plausibility signal that can be merged into protein-design-qc composite ranking.
Sequence features that support clustering, deduplication, or active-learning loops.

Next Step

Merge ESM-derived scores into protein-design-qc, then send surviving candidates to chai1-structure-prediction or boltz-structure-prediction for structure validation.

More by BioTender-max

View all by BioTender-max →

Bulk RNAseq differential expression (DeSeq2)

BioTender-max

Core DESeq2 workflow for RNA-seq differential expression analysis with count data.

dnasp

BioTender-max

bio-population-genetics-plink-basics

BioTender-max

PLINK file formats, format conversion, and quality control filtering for population genetics. Convert between VCF, BED/BIM/FAM, and PED/MAP formats, apply MAF, genotyping rate, and HWE filters using PLINK 1.9 and 2.0. Use when working with PLINK format files or running QC.

tooluniverse-precision-medicine-stratification

BioTender-max

Comprehensive patient stratification for precision medicine by integrating genomic, clinical, and therapeutic data. Given a disease/condition, genomic data (germline variants, somatic mutations, expression), and optional clinical parameters, performs multi-phase analysis across 9 phases covering dis

torch-geometric

BioTender-max

Guide for building Graph Neural Networks with PyTorch Geometric (PyG). Use this skill whenever the user asks about graph neural networks, GNNs, node classification, link prediction, graph classification, message passing networks, heterogeneous graphs, neighbor sampling, or any task involving torch_g

query-alphafold

BioTender-max

Query AlphaFold protein structure predictions. Use when user asks about protein structure, 3D structure, protein folding, or structure prediction. Triggers on "alphafold", "protein structure", "3D structure", "folding", "pLDDT", "structure prediction".

Install

mkdir -p .claude/skills/esm2-sequence-scoring && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16391" && unzip -o skill.zip -d .claude/skills/esm2-sequence-scoring && rm skill.zip

Installs to .claude/skills/esm2-sequence-scoring

Safety

Review before install

Runs shell / code

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

1mo ago

License

MIT

Repo stars

Loads

~1,200 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

BioTender-max

7 skills published

Links

Source code

esm2-sequence-scoring

Install

Activation

About this skill

ESM2 Protein Language Model

Prerequisites

How to run

Option 1: Modal

Option 2: Python API (recommended)

Key parameters

ESM2 Models

Output format

Sample output

Successful run

Decision tree

PLL interpretation

Typical performance

Verify

Troubleshooting

Error interpretation

Inputs

Outputs

Next Step

More by BioTender-max

Bulk RNAseq differential expression (DeSeq2)

dnasp

bio-population-genetics-plink-basics

tooluniverse-precision-medicine-stratification

torch-geometric

query-alphafold

Search skills