agentskills.codes
BI

bio-sam-bam-basics

View, convert, and understand SAM/BAM/CRAM alignment files using samtools and pysam. Use when inspecting alignments, converting between formats, or understanding alignment file structure.

Install

mkdir -p .claude/skills/bio-sam-bam-basics && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16179" && unzip -o skill.zip -d .claude/skills/bio-sam-bam-basics && rm skill.zip

Installs to .claude/skills/bio-sam-bam-basics

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

View, convert, and understand SAM/BAM/CRAM alignment files using samtools and pysam. Use when inspecting alignments, converting between formats, or understanding alignment file structure.
187 chars✓ has a “when” trigger

About this skill

Version Compatibility

Reference examples tested with: pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures
  • CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

SAM/BAM/CRAM Basics

"Read a BAM file" → Open a binary alignment file and iterate over aligned reads with their mapping coordinates, flags, and quality scores.

  • Python: pysam.AlignmentFile() (pysam)
  • CLI: samtools view (samtools)
  • R: scanBam() (Rsamtools)

View and convert alignment files using samtools and pysam.

Format Overview

FormatDescriptionUse Case
SAMText format, human-readableDebugging, small files
BAMBinary compressed SAMStandard storage format
CRAMReference-based compressionLong-term archival, smaller than BAM

SAM Format Structure

@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422
@RG ID:sample1 SM:sample1
@PG ID:bwa PN:bwa VN:0.7.17
read1  0   chr1  100  60  50M  *  0  0  ACGT...  FFFF...  NM:i:0

Header lines start with @:

  • @HD - Header metadata (version, sort order)
  • @SQ - Reference sequence dictionary
  • @RG - Read group information
  • @PG - Program used to create file

Alignment fields (tab-separated):

  1. QNAME - Read name
  2. FLAG - Bitwise flag
  3. RNAME - Reference name
  4. POS - 1-based position
  5. MAPQ - Mapping quality
  6. CIGAR - Alignment description
  7. RNEXT - Mate reference
  8. PNEXT - Mate position
  9. TLEN - Template length
  10. SEQ - Read sequence
  11. QUAL - Base qualities
  12. Optional tags (NM:i:0, MD:Z:50, etc.)

samtools view

View BAM as SAM

samtools view input.bam | head

View with Header

samtools view -h input.bam | head -100

View Header Only

samtools view -H input.bam

View Specific Region

samtools view input.bam chr1:1000-2000

Count Alignments

samtools view -c input.bam

Format Conversion

Goal: Convert between SAM (text), BAM (binary), and CRAM (reference-compressed) alignment formats.

Approach: Use samtools view with format flags (-b for BAM, -C for CRAM, -h for SAM with header). CRAM requires a reference FASTA with -T.

BAM to SAM

samtools view -h -o output.sam input.bam

SAM to BAM

samtools view -b -o output.bam input.sam

BAM to CRAM

samtools view -C -T reference.fa -o output.cram input.bam

CRAM to BAM

samtools view -b -T reference.fa -o output.bam input.cram

Pipe Conversion

samtools view -b input.sam > output.bam

Common Flags

FlagDecimalMeaning
0x11Paired
0x22Proper pair
0x44Unmapped
0x88Mate unmapped
0x1016Reverse strand
0x2032Mate reverse strand
0x4064First in pair
0x80128Second in pair
0x100256Secondary alignment
0x200512Failed QC
0x4001024PCR duplicate
0x8002048Supplementary

Decode Flags (Bidirectional)

# Number to mnemonics
samtools flags 147
# 0x93 147 PAIRED,PROPER_PAIR,REVERSE,READ2

# Mnemonics to number
samtools flags PAIRED,PROPER_PAIR,REVERSE,READ2   # 147

Secondary vs Supplementary (Different Semantics)

Two different concepts that are routinely conflated:

BitNameMeaningFilter implication
0x100 (256)SecondaryAn alternative candidate alignment for the same read; not the primary location-F 256 is correct for SNV/indel calling on short reads
0x800 (2048)SupplementaryA piece of a chimeric/split alignment (the read is split across loci)Carries SA:Z tag; required by SV callers (Manta, Sniffles, cuteSV, GRIDSS, Delly)

-F 2304 removes both. Strip supplementary only when downstream is small-variant calling; keep supplementary for SV calling, fusion detection, or any analysis that follows split-reads.

MAPQ Is Not Portable Across Aligners

samtools view -q 30 does different things depending on what produced the BAM. MAPQ is an aligner-specific scale, not a universal probability:

AlignerMAPQ scale"Unique" sentinelCommon gotcha
BWA-MEM / BWA-MEM20-6060-q 30 is sensible "high confidence"
minimap2 (DNA / pbmm2)0-6060Spec-compliant
HISAT20-6060Spec-compliant
Bowtie20-4242 (rare)-q 60 drops everything; -q 23 is the established 99% threshold
STAR0, 1, 2, 3, 255255 = uniquely mapped (sentinel, not a quality)-q 255 for "unique only"; -q 30 accidentally keeps unique only too
DRAGEN0 to --mapq-max (often ~250)varies-q 30 still meaningful; distribution shape differs
Cell Ranger / STARsoloinherits STAR255Same trap as STAR

Verify the actual scale of any unfamiliar BAM:

samtools view input.bam | awk '{print $5}' | sort -un | head
samtools view -H input.bam | grep '^@PG' | head -1   # which aligner produced this BAM

0-Based vs 1-Based Coordinates (Footgun)

ContextCoordinate system
SAM text POS1-based, inclusive
samtools view chr1:100-2001-based, closed interval
samtools faidx chr1:100-2001-based, closed interval
BAM binary internal0-based, half-open
pysam read.reference_start0-based
bam.fetch('chr1', 100, 200)0-based, half-open
BED files0-based, half-open
VCF1-based
GFF/GTF1-based, inclusive

samtools view bam chr1:100-200 and bam.fetch('chr1', 100, 200) return different read sets at boundaries.

CIGAR Operations

OpDescription
MAlignment match (can be mismatch)
IInsertion to reference
DDeletion from reference
NSkipped region (introns in RNA-seq; do NOT count as covered bases)
SSoft clipping (sequence in SEQ but not aligned)
HHard clipping (sequence not in SEQ)
=Sequence match (explicit)
XSequence mismatch (explicit)
PPadding (rare; multiple-sequence-alignment context)

Example: 50M2I30M = 50 bases match, 2 base insertion, 30 bases match

CIGAR M is overloaded -- it is the union of = and X. Some aligners (minimap2, BWA with -Y) emit =/X directly; bcftools / Picard often need M and rebuild MD/NM with samtools calmd. N operations break naive coverage calculations: a 1000 bp RNA-seq read with one 50 kb intron does not cover 50 kb. Distinguish soft-clip (S, bases retained) from hard-clip (H, bases discarded -- irreversible).

Context-Specific Tags

Beyond the standard fields, downstream tools depend on optional tags whose presence depends on aligner and assay. Inspect with samtools view input.bam | head -1 | tr '\t' '\n' or pysam read.get_tag('XX').

TagSet byMeaningRequired by
NM:ibwa, samtools calmdEdit distance to referencemapDamage, many filters
MD:Zbwa, samtools calmdMismatch positions (text)bcftools mpileup BAQ, IGV mismatch coloring
MC:Zsamtools fixmate -mMate CIGARsamtools markdup
MS:isamtools fixmate -mMate scoresamtools markdup
RG:Zaligner from -RRead group IDGATK BQSR, MarkDuplicates LB lookup
SA:ZAll split-read alignersComma-list of supplementary coordsSniffles, Manta, cuteSV, GRIDSS, Delly
NH:iSTAR, HISAT2Number of reported hitsfeatureCounts multimapper handling, Salmon
HI:iSTARHit index (0-based among NH)RSEM
XS:ASTAR, HISAT2, minimap2 -ax spliceStrand inferred from splice motifStringTie, Cufflinks
CB:ZCell Ranger, STARsoloCorrected cell barcodescRNA quantification
UB:ZCell Ranger, STARsoloCorrected UMIUMI-aware dedup
RX:Zfgbio AnnotateBamWithUmisRaw UMI (bulk)fgbio GroupReadsByUmi
MI:Zfgbio CallMolecularConsensusReadsMolecular identifier (consensus)Duplex calling
cs:Zminimap2 --csCompact CIGAR-with-basespaftools, SV tools

Missing tags fail in two modes: silently wrong (featureCounts ignoring multimappers without NH; markdup marking nothing without MC/MS) or loudly (consensus tools rejecting input without MD).

Provenance: @PG Chain

The @PG lines record every tool that touched the BAM, linked through PP (previous program) tags. This is the audit trail.

samtools view -H input.bam | grep '^@PG'

A clean germline pipeline:

@PG ID:bwa-mem PN:bwa VN:0.7.17
@PG ID:samtools.1 PN:samtools VN:1.20 PP:bwa-mem CL:samtools sort
@PG ID:samtools.2 PN:samtools VN:1.20 PP:samtools.1 CL:samtools fixmate
@PG ID:samtools.3 PN:samtools VN:1.20 PP:samtools.2 CL:samtools markdup

A broken/missing chain (no PP, unknown tools, gaps) means the BAM cannot be reliably reproduced. Production pipelines often reject inputs without a complete chain.

CRAM Reference Resolution (Critical)

CRAM stores reads relative to a reference; without it, the file is unreadable. htslib resolves the reference in this order:

  1. Command-line -T ref.fa / --reference
  2. REF_CACHE env var (local MD5-named cache)
  3. REF_PATH env var (colon-separated; can include URLs)
  4. UR: URL in SAM @SQ header
  5. Last resort: EBI ENA download via MD5 in M5: tag (fails on offline HPC)

On HPC nodes without internet, populate a local cache once:

mkdir -p $HOME/cram_cache
seq_cache_populate.pl -root $HOME/cram_cache reference.fa
export REF_CACHE=$HOME/cram_cache/%2s/%2s/%s
export REF_PATH=$REF_CACHE   # disables ENA fallback

samtools quickcheck -v file.cram   # header + EOF only


---

*Content truncated.*

More by BioTender-max

View all by BioTender-max

Search skills

Search the agent skills registry