Data Analysis and Statistics
TMT quantitative mass spec pipeline
To ensure consistency across all publicly available mass spectrometry datasets that used TMT isobaric mass tagging, we applied the same standardized analysis pipeline previously published in Farley et al. (2024, A), Farley et al. (2024, B), and Thomas et al. (2025)1–3. This approach allows for better comparability between datasets. Several of the datasets included in this repository were generated and analyzed at the European Molecular Biosciences Laboratories (EMBL) in Heidelberg, Germany, where Dr. Frank Stein developed and implemented the data analysis pipeline.
In summary, we processed the mass spectrometry (MS) data using FragPipe4. The protein search was performed using a UniProt FASTA database with one entry per gene, excluding isoforms and minimizing TrEMBL entries. Common contaminants and reversed sequences were also included for validation. Proteins were included in the final dataset if they contained at least two razor peptides, as FragPipe applies the Occam’s Razor principle, assigning shared peptides to the protein with the highest supporting evidence.
For the database search, we included post-translational modifications (PTMs) such as oxidation (M), acetylation (N-term), and carbamidomethylation (C). The raw files were analyzed using FragPipe’s protein.tsv output, which was used as the basis for downstream quantification.
To account for potential technical variations across experiments, we applied a normalization step using the ‘removeBatchEffect’ function from the limma package to correct for batch effects in log2-transformed TMT reporter ion intensities5,6. Further normalization was carried out using the ‘normalizeVSN’ function of the same package. We identified differentially expressed proteins using the moderated t-test within limma, incorporating replicate information into the statistical model through the ‘lmFit’ function.
For the protein-lipid probe interaction experiments (+/- UV exposure), proteins were categorized based on their enrichment in the +UV condition. Proteins showing a log2 fold-change of at least 1 and a p-value below 0.05 were classified as “enriched hits,” while those showing similar enrichment trends but with p-values above 0.05 were labeled as “enriched candidates.”
To better understand the biological significance of the detected proteins, we conducted gene ontology (GO) enrichment analysis using the ‘compareCluster’ function from the ‘clusterProfiler’ package7. This method evaluates whether specific GO terms are overrepresented in the dataset compared to a background gene set. We performed enrichment analysis for three GO categories: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). The reference database used was ‘org.Pf.plasmo.db.’ To quantify enrichment, we calculated the odds ratio by comparing the proportion of genes associated with each GO term in our dataset (‘GeneRatio’) with the proportion in the reference background set (‘BgRatio’). A value greater than 1 suggests that a given GO term is more prevalent in our dataset than expected by chance.
Visualizing proteomics data
Volcano Plots
- Visualize enrichment data versus the statistical p value associated with every protein identified in a dataset.
- In each of the datasets depicted on this site, data points to the right of the x-axis (logFC > 0) denote proteins which were enriched when the lipid probe was irradiated with UV light – stimulating chemical crosslinking.
- The p value of each protein was -log10 transformed, meaning that the smaller the p value, the higher the point is.
- In the datasets depicted on this site, there is little meaning to points with logFC < 0; as such, these proteins are ignored regardless of their statistical significance
Rank-ordered Plots
- Line up each protein identified in the data set from lowest logFC to highest.
- Enables the viewer to see the shape of the dataset – for example, some datasets will have only a few proteins substantially enriched, and this is reflected by a sharp “elbow” at the edges of the plot.
- This can be helpful in selecting logFC cutoffs when determining significance cutoffs.
MA Plots
- Depict the logFC of each protein (M) versus the average m/z intensity of the peptides (A) that contributed to that proteins identification.
- Gives greater insight into the quality of a datapoint. For example, a point on the far left of the MA plot had relatively low total abundance and may be more sensitive to dramatic changes in enrichment – whereas points to the right were substantially more abundant in the sample and will be less sensitive to change.
- Thus, “significantly enriched” proteins with higher A intensity may be more trustworthy than those with lower A values.
Heat Maps
Gene Ontology Overview
- Gene Ontology (GO) analyses take a list of proteins and determines whether there is a greater than random enrichment of a biological feature. These features can include pathways, molecular functions, cellular compartments or biological processes.
- In each of the GO analyses presented on this site, we filtered according to a significance cutoff – these cutoffs are explained on the respective data pages (either “enriched candidates” or “enriched hits”, as in the TMT-based datasets, or unique subsets of probe enrichment, as in the PSM-based datasets).
- Enriched pathways are given statistical significance by assessing the likelihood that a pathway would be overrepresented through random chance.