Create a sample-to-data-relationship format (SDRF) file (usually from another type of samplesheet)
Install
mkdir -p .claude/skills/create-sdrf && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14284" && unzip -o skill.zip -d .claude/skills/create-sdrf && rm skill.zipInstalls to .claude/skills/create-sdrf
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Create a sample-to-data-relationship format (SDRF) file (usually from another type of samplesheet)About this skill
Create a sample-to-data-relationship format file based on the specification document. By default generate a tsv. Quote where necessary. If required columns are missing, let the user know and provide help in further filling by asking questions. In the end, self-validate the file by doing:
curl -LsSf https://astral.sh/uv/install.sh | sh
uvx --from sdrf-pipelines parse_sdrf validate-sdrf -s $YOUR_GENERATED_SDRF
Additional online resources:
- https://github.com/bigbio/proteomics-sample-metadata (specification repo)
- https://github.com/bigbio/proteomics-metadata-standard/tree/master/annotated-projects (example annotated projects)
- https://github.com/bigbio/sdrf-pipelines (python validator)
Specification document:
[[status]] == Status of this document
This document provides information to the proteomics community about a proposed standard for sample metadata annotations in public repositories called Sample and Data Relationship File (SDRF)-Proteomics format. Distribution is unlimited.
Version 1.0.1 - 2023-05-24
[[abstract]] == Abstract
The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange, and verification. This document presents a specification for a sample metadata annotation of proteomics experiments.
Further detailed information, including any updates to this document, implementations, and examples is available at https://github.com/bigbio/proteomics-metadata-standard. The official PSI web page for the document is the following: http://psidev.info/sdrf.
[[introduction]] == Introduction
Many resources have emerged that provide raw or integrated proteomics data in the public domain. If these are valuable individually, their integration through re-analysis represents a huge asset for the community [1]. Unfortunately, proteomics experimental design and sample related information are often missing in public repositories or stored in very diverse ways and formats. For example, the CPTAC consortium (https://cptac-data-portal.georgetown.edu/) provides for every dataset a set of Excel files with the information on each sample (e.g. https://cptac-data-portal.georgetown.edu/study-summary/S048) including tumor size, origin, but also how every sample is related to a specific raw file (e.g. instrument configuration parameters). As a resource routinely re-analysing public datasets, ProteomicsDB, captures for each sample in the database a minimum number of properties to describe the sample and the related experimental protocol such as tissue, digestion method and instrument (e.g. https://www.proteomicsdb.org/#projects/4267/6228). Such heterogeneity often prevents data interpretation, reproducibility, and integration of data from different resources. This is why we propose a homogenous standard for proteomics metadata annotation. For every proteomics dataset we propose to capture at least three levels of metadata: (i) dataset description, (ii) the sample and data files related information; and (iii) the technical/proteomics specific information in standard data file formats (e.g. the PSI formats mzIdentML, mzML, or mzTab, among others).
The general description includes minimum information to describe the study overall: title, description, date of publication, type of experiment (e.g. http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD016060.0-1&outputMode=XML). The standard data files contain mostly the technical metadata associated with the dataset including search engine settings, scores, workflows, configuration files, but do not include information about the sample metadata and/or the experimental design. Currently, all ProteomeXchange partners mandate this information for each dataset. However, the information regarding the sample and its relation to the data files (Figure 1) is mostly missing [1].
These three levels of metadata are combined in the well-established data formats ISA-TAB [2] (https://www.isacommons.org/) or MAGE-TAB [3], which are used in other omics fields such as metabolomics and transcriptomics. In both data formats, a tab-delimited file is used to annotate the sample metadata and link it to the corresponding data file(s) (sample and data relationship file format—SDRF). Both data formats encode the properties and sample attributes as columns, and each row represents a sample in the study. However, more important that the file-format itself, general guidelines about what information should be encoded to enable reproducibility of the proteomics results are needed. The lack of guidelines to annotate information such as disease stage, cell line code, or organism part, or the analytical information about labelling channels (e.g. TMT, SILAC) makes the data representation incomplete. The consequence is that it is not possible to understand the original experiment, and/or perform a re-analysis of the dataset having all the necessary information for reproducibility purposes. If the information about the fractions, labelling channels, or enrichment methods is not annotated, the reuse and reproduction of the original results will be challenging, if possible, at all.
Figure 1: SDRF-Proteomics file format stores the information of the sample and its relation to the data files in the dataset. The file format includes not only information about the sample but also about how the data was acquired and processed.
[[requirements]] === Requirements
The SDRF-Proteomics format describes the sample characteristics and the relationships between samples and data files included in a dataset. The information in SDRF files is organised so that it follows the natural flow of a proteomics experiment. The main requirements to be fulfilled for SDRF-Proteomics format are:
- The SDRF file is a tab-delimited format where each ROW corresponds to a relationship between a Sample and a Data file (and MS signal corresponding to labelling in the context of multiplexed experiments).
- Each column MUST correspond to an attribute/property of the Sample or the Data file.
- Each value in each cell MUST be the property for a given Sample or Data file.
- The file MUST begin with columns describing the samples of origin and continue with the data files generated from their MS analyses.
- Support for handling unknown values/characteristics.
[[issues-addressed]] === Issues to be addressed
The main issues to be addressed by the SDRF are:
- It MUST be able to represent the sample metadata and the data files generated by the instruments or the analyses.
- It MUST be able to represent the experimental design including the way samples and data have been collected.
[[notation-conventions]] == Notational Conventions
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMEND/RECOMMENDED”, “MAY”, “COULD BE”, and “OPTIONAL” are to be interpreted as described in RFC 2119 (2).
[[document-structure]] == Documentation
The official website for SDRF-Proteomics project is https://github.com/bigbio/proteomics-metadata-standard. New use cases, changes to the specification and examples can be added by using Pull requests or issues in GitHub (see introduction to GitHub - https://lab.github.com/githubtraining/introduction-to-github).
A set of examples and annotated projects from ProteomeXchange can be found here: https://github.com/bigbio/proteomics-metadata-standard/tree/master/annotated-projects
Multiple tools have been implemented to validate SDRF-Proteomics files for users familiar with Python and Java:
-
sdrf-pipelines (Python - https://github.com/bigbio/sdrf-pipelines): This tool allows to validate an SDRF-Proteomics file. In addition, it allows converting SDRF to other popular pipelines and software configure files such as MaxQuant or OpenMS.
-
jsdrf (Java - https://github.com/bigbio/jsdrf ): These Java library and tool allow validating SDRF-Proteomics files. It also includes a generic data model that can be used by Java applications.
[[relationship-specifications]] == Relationship to other specifications
SDRF-Proteomics is fully compatible with the SDRF file format part of https://www.ebi.ac.uk/arrayexpress/help/magetab_spec.html[MAGE-TAB]. MAGE-TAB is the file format used to store metadata and sample information for transcriptomics experiments. When the proteomeXchange project file is converted to idf file (project description in MAGE-TAB) and is combined with the SDRF-Proteomics a valid MAGE-TAB is obtained.
SDRF-Proteomics sample information can be embedded into mzTab metadata files. The sample metadata in mzTab contains properties as the columns in the SDRF-Proteomics and values as Sample cell values.
The SDRF-Proteomics aims to capture the sample metadata and its relationship with the data files (e.g. raw files from mass spectrometers). The SDRF-Proteomics do not aim to capture the downstream analysis part of the experimental design such as what samples should be compared, how they can be combined or parameters for the downstream analysis (FDR or p-values thresholds). The HUPO-PSI community will work in the future to include this information in other file formats such as mzTab or a new type of file format.
[[ontologies-supported]] == Ontologies/Controlled Vocabularies Supported
The list of ontologies/controlled vocabularies (CV) supported are:
- PSI Mass Spectrometry CV (PSI-MS)
- Experimental Factor Ontology (EFO).
- Unimod protein modification database for mass spectrometry
- PSI-MOD CV (PSI-MOD)
- Cell line ontology
- Drosophila anatomy ontology
- Cell ontology
- Plant ontology
- Uber-anatomy ontology
- Zebrafish anatomy and development ontology
- Zebrafish developmental stages ontology
- Plant Environment Ontology
- FlyBase Developmental Ontology
- Rat Strain Ontology
- Chemical Entities of Biologica
Content truncated.