Preamble

Quantitative proteomics using isobaric tagging such as Tandem Mass Tags (TMT) has a considerable benefit over Label-Free Quantification (LFQ) in that up to 18 samples can be quantified for each Peptide Spectrum Match (PSM). This greatly reduces the extent of missing values that may be present between different MS runs due to the limited number of ions that can be fragmented in each run and the associated issue of peptides being identified in only a subset of runs (O’Connell et al. 2018). As such, it standardises the features quantified in each sample, simplifying the comparison between samples and increasing quantification accuracy of summarised features such as proteins.

However, TMT does suffer from ratio compression, which should be avoiding by performing quantification with SPS MS3 (McAlister et al. 2014). Since quantification is performed in MS3, the TMT ions being quantified are therefore relatively low intensity and as such, removal of low signal:noise PSMs is recommended prior to summarisation to protein-level abundances. In addition, PSMs with high estimated co-isolation/interference should be removed, since the quantification values will be less accurate.

Here, we will assess the following:

We will then:

library(camprotR)
#> Warning in fun(libname, pkgname): mzR has been built against a different Rcpp version (1.0.7)
#> than is installed on your system (1.0.8.3). This might lead to errors
#> when loading mzR. If you encounter such issues, please send a report,
#> including the output of sessionInfo() to the Bioc support forum at 
#> https://support.bioconductor.org/. For details see also
#> https://github.com/sneumann/mzR/wiki/mzR-Rcpp-compiler-linker-issue.
library(MSnbase)
library(ggplot2)
library(tidyr)
library(dplyr)

Input data

We start by reading in a file containing PSM-level output from Proteome Discoverer (PD). This data comes from a published benchmark experiment where yeast peptides were spiked into human peptides at 3 known amounts to provide ground truth fold changes (see below). For more details, see (Smith et al. 2022)

The data we will use is available through the Proteomics.analysis.data package.

psm_data <- read.delim(
  system.file("extdata", 'benchmark_TMT', 'benchmark_TMT_PSMs.txt.gz',
              package = "Proteomics.analysis.data"))

The first step is to remove contaminant proteins. These were defined using the cRAP database. Below, we parse the cRAP fasta to extract the IDs for the cRAP proteins, in both ‘cRAP’ format and Uniprot IDs for these proteins.

crap_fasta_inf <- system.file(
  "extdata", "cRAP_20190401.fasta.gz", 
  package = "Proteomics.analysis.data"
)

# Load the cRAP FASTA used for the PD search
crap.fasta <- Biostrings::fasta.index(crap_fasta_inf, seqtype = "AA")

# Extract the non cRAP UniProt accessions associated with each cRAP protein
crap.accessions <- crap.fasta %>% 
  pull(desc) %>% 
  stringr::str_extract_all(pattern = "(?<=\\|).*?(?=\\|)") %>% 
  unlist()

We can then supply these cRAP protein IDs to parse_features which will remove features which may originate from contaminants, as well as features which don’t have a unique master protein. See ?parse_features for further details, including the removal of ‘associated cRAP’.

psm_data_flt <- parse_features(
  psm_data, 
  crap_proteins = crap.accessions, 
  TMT = TRUE, 
  level = 'PSM'
)
#> Parsing features...
#> 113935 features found from 11259 master proteins => Input
#> 242 cRAP proteins supplied
#> 1064 proteins identified as 'cRAP associated'
#> 112643 features found from 11211 master proteins => cRAP features removed
#> 112046 features found from 11180 master proteins => associated cRAP features removed
#> 112010 features found from 11179 master proteins => features without a master protein removed
#> 106462 features found from 10103 master proteins => features with non-unique master proteins removed

We now store the filtered PSM data in an MSnSet, the standard data object for proteomics in R. See the vignette("msnset", package="camprotR") for more details.

# Abundance columns for TMT PD-output start with Abundance 
abundance_cols <- colnames(psm_data_flt)[grepl('Abundance.', colnames(psm_data_flt))]

psm.e <- as.matrix(psm_data_flt[, abundance_cols])
psm.f <- psm_data_flt[, setdiff(colnames(psm_data_flt), abundance_cols)]

# update the column names to remove the 'Abundance.` prefix
colnames(psm.e) <- gsub('Abundance.', '', colnames(psm.e))

# we don't have 'phenotype' data to add so we just define the 
# 'expression' data and 'feature' data

psm.p <- data.frame(spike=rep(factor(c('x1', 'x2', 'x6')), c(4,3,3)), row.names=colnames(psm.e))
  
psm <- MSnbase::MSnSet(exprs = psm.e, fData = psm.f, pData=psm.p)

Plot intensity distributions

plot_quant(log(psm, base=2), method='density')
#> Warning: Removed 18092 rows containing non-finite values (stat_density).
TMT intensities

TMT intensities

Removing low quality PSMs

We want to remove low Signal:Noise (S:N) PSMs, since the quantification values will be less accurate and there will be more missing values. We can inspect the relationship between S:N and missing values using the plot_missing_SN function.

Note that where the signal:noise > 5, there are far fewer missing values.

plot_missing_SN(psm, bins = 40)
Missing values per PSM, in relation to the signal:noise ratio

Missing values per PSM, in relation to the signal:noise ratio

We can also look into this relationship at the tag level using plot_missing_SN_per_sample. In this case, there is no tag which appears to have a high proportion of missing values when signal:noise > 5. If there were, this may warrant further exploration, e.g was one of the sample preparations inadequate such that fewer peptides were labeled?

plot_missing_SN_per_sample(psm, bins = 40)
Missing values per tag, in relation to the signal:noise ratio

Missing values per tag, in relation to the signal:noise ratio

Based on the above, we will filter the PSMs to only retain those with S:N > 5 using filter_TMT_PSMs. Using the same function, we will also remove PSMs with interference/co-isolation >50%.

psm_filt_sn_int <- filter_TMT_PSMs(psm, inter_thresh = 50, sn_thresh = 5)
#> Filtering PSMs...
#> 105709 features found from 10094 master proteins => No quant filtering
#> 103307 features found from 10037 master proteins => Co-isolation filtering
#> 100612 features found from 9963 master proteins => S:N ratio filtering

Summarising to protein-level abundances

Now that we have inspected the PSM-level quantification and filtered the PSMs, we can summarise the PSMs to protein-level abundances.

Discussion

  • How do you think we should summarise the protein-level abundances from the PSM-level quantification?
  • What downsides do you perceive in this approach?

Solution

# PSM to protein-level summarisation is much less problematic for TMT quantification
# and many approaches are valid.
#
# In most cases, there are few missing values and summarisation by naive methods
# such as mean/median/sum is adequate
#
# - Mean/sum is is sensitive to outliers
# - Median is less sensitive to outliers, but sensitive to the intensities
# of PSMs around the median intensity.
# 
# PSM-level intensities can be across orders of magnitude, with the most intense
# likely to be the most accurate. Mean/sum summarisation is largely drive by a 
# subset of high intensity PSMs. Sum has a subtle added advantage that the it
# is also higher for proteins with more PSMs. This can be useful in the downstream
# statistical analysis, as shall see. Sum is a sensible default
# summarisation for TMT data
#
# In the (rare) cases where there are a lot of missing values, the 'robust'
# summarisation approach in MSnbase::combineFeatures() may be appropriate
# (following log-transformation) since this handles missing data appropriately

Solution end

For PSM to protein summarisation, we will use naive ‘sum’ summarisation (MSnbase::combineFeatures(method = 'sum')). This approach does not appropriately handle missing values, since it either returns NA if any value is missing, or, with na.rm=TRUE included, replaces NA with zero where there is at least one finite quantification value for a protein. As such, we will remove the few PSMs with any missing values

psm_filt_sn_int_missing <- psm_filt_sn_int %>% 
  MSnbase::filterNA()

Typically, one removes proteins with a single PSM, the so-called ‘one-hit wonders’, on the basis that these are more likely to be false positive identifications, and the quantification is only drawn from a single observation.

psm_filt_sn_int_missing_n_features <- psm_filt_sn_int_missing %>%
  camprotR::restrict_features_per_protein(min_features=2, plot=FALSE)

Below, we perform the summarisation.

protein <- psm_filt_sn_int_missing %>%
  MSnbase::combineFeatures(
    groupBy = fData(psm_filt_sn_int_missing)$Master.Protein.Accession,
    method = 'sum')

Finally, we assess the quantification distribution and normalise the protein-level abundances

plot_quant(log(protein, base=2), method='density')

For this dataset, we see that the protein-level intensity distributions are very similar. This benchmark experiment was performed by generating 10 samples from defined mixes of 2 peptide samples (human and yeast). As such, we expect near identical distributions for all samples from the same group. In a typical TMT experiment, you may see more variable distributions.

Regardless, the next step should be to normalise the protein-level intensities.

Here we will apply median normalisation such that all column (sample) medians match the grand median. In MSnbase::normalise, this is called diff.median. Since the intensities are log-Gaussian distributed, we log2-transform them before performing the normalisation.

Median normalisation is a relatively naive form of normalisation, since we are only applying a transformation using a single correction factor for each sample. This is most likely to be appropriate when the samples being compared are similar to one another, which is the case here.

protein_norm <- MSnbase::normalise(log(protein, base=2), method='diff.median')

plot_quant(protein_norm, method='density')

Remember that we can check the processing information for our MSnSet if we are in doubt about the processing. Here, it tells us that we log2 transformed and then used diff.median normalisation.

processingData(protein_norm)
#> - - - Processing information - - -
#> Subset [106462,10][105709,10] Fri Sep 16 16:00:02 2022 
#> Subset [105709,10][103307,10] Fri Sep 16 16:00:02 2022 
#> Subset [103307,10][100612,10] Fri Sep 16 16:00:02 2022 
#> Subset [100612,10][100347,10] Fri Sep 16 16:00:02 2022 
#> Removed features with more than 0 NAs: Fri Sep 16 16:00:02 2022 
#> Dropped featureData's levels Fri Sep 16 16:00:02 2022 
#> Combined 100347 features into 9952 using sum: Fri Sep 16 16:00:20 2022 
#> Log transformed (base 2) Fri Sep 16 16:00:21 2022 
#> Normalised (diff.median): Fri Sep 16 16:00:21 2022 
#>  MSnbase version: 2.20.4

Now we have filtered our PSM-level quantification, summarised to protein-level and normalised. We can use this object to perform downstream visualisation, data exploration and statistical analysis etc.

We save the object to disk so we can read it back into memory when we need it

saveRDS(psm_filt_sn_int_missing, './results/psm_filt.rds')
saveRDS(protein_norm, './results/tmt_protein.rds')
length(setdiff(fData(psm_filt_sn_int_missing)$Master.Protein.Accessions,
               fData(protein_norm)$Master.Protein.Accessions))
#> [1] 0

To add - TMM normalisation

References

McAlister, Graeme C., David P. Nusinow, Mark P. Jedrychowski, Martin Wühr, Edward L. Huttlin, Brian K. Erickson, Ramin Rad, Wilhelm Haas, and Steven P. Gygi. 2014. “MultiNotch MS3 Enables Accurate, Sensitive, and Multiplexed Detection of Differential Expression Across Cancer Cell Line Proteomes.” Analytical Chemistry 86 (14): 7150–58. https://doi.org/10.1021/ac502040v.
O’Connell, Jeremy D., Joao A. Paulo, Jonathon J. O’Brien, and Steven P. Gygi. 2018. “Proteome-Wide Evaluation of Two Common Protein Quantification Methods.” Journal of Proteome Research 17 (5): 1934–42. https://doi.org/10.1021/acs.jproteome.8b00016.
Smith, Tom S., Anna Andrejeva, Josie Christopher, Oliver M. Crook, Mohamed Elzek, and Kathryn S. Lilley. 2022. “Prior Signal Acquisition Software Versions for Orbitrap Underestimate Low Isobaric Mass Tag Intensities, Without Detriment to Differential Abundance Experiments.” ACS Measurement Science Au, March. https://doi.org/10.1021/acsmeasuresciau.1c00053.