Quantitative proteomics using isobaric tagging such as Tandem Mass Tags (TMT) has a considerable benefit over Label-Free Quantification (LFQ) in that up to 18 samples can be quantified for each Peptide Spectrum Match (PSM). This greatly reduces the extent of missing values that may be present between different MS runs due to the limited number of ions that can be fragmented in each run and the associated issue of peptides being identified in only a subset of runs (O’Connell et al. 2018). As such, it standardises the features quantified in each sample, simplifying the comparison between samples and increasing quantification accuracy of summarised features such as proteins.
However, TMT does suffer from ratio compression, which should be avoiding by performing quantification with SPS MS3 (McAlister et al. 2014). Since quantification is performed in MS3, the TMT ions being quantified are therefore relatively low intensity and as such, removal of low signal:noise PSMs is recommended prior to summarisation to protein-level abundances. In addition, PSMs with high estimated co-isolation/interference should be removed, since the quantification values will be less accurate.
Here, we will assess the following:
The overall distribution of tag intensities
How signal:noise relates to missing values
We will then:
Filter the PSMs based on signal:noise and co-isolation/interference
Summarise to protein-level abundances
library(camprotR)
#> Warning in fun(libname, pkgname): mzR has been built against a different Rcpp version (1.0.7)
#> than is installed on your system (1.0.8.3). This might lead to errors
#> when loading mzR. If you encounter such issues, please send a report,
#> including the output of sessionInfo() to the Bioc support forum at
#> https://support.bioconductor.org/. For details see also
#> https://github.com/sneumann/mzR/wiki/mzR-Rcpp-compiler-linker-issue.
library(MSnbase)
library(ggplot2)
library(tidyr)
library(dplyr)
We start by reading in a file containing PSM-level output from Proteome Discoverer (PD). This data comes from a published benchmark experiment where yeast peptides were spiked into human peptides at 3 known amounts to provide ground truth fold changes (see below). For more details, see (Smith et al. 2022)
The data we will use is available through the
Proteomics.analysis.data
package.
psm_data <- read.delim(
system.file("extdata", 'benchmark_TMT', 'benchmark_TMT_PSMs.txt.gz',
package = "Proteomics.analysis.data"))
The first step is to remove contaminant proteins. These were defined using the cRAP database. Below, we parse the cRAP fasta to extract the IDs for the cRAP proteins, in both ‘cRAP’ format and Uniprot IDs for these proteins.
crap_fasta_inf <- system.file(
"extdata", "cRAP_20190401.fasta.gz",
package = "Proteomics.analysis.data"
)
# Load the cRAP FASTA used for the PD search
crap.fasta <- Biostrings::fasta.index(crap_fasta_inf, seqtype = "AA")
# Extract the non cRAP UniProt accessions associated with each cRAP protein
crap.accessions <- crap.fasta %>%
pull(desc) %>%
stringr::str_extract_all(pattern = "(?<=\\|).*?(?=\\|)") %>%
unlist()
We can then supply these cRAP protein IDs to
parse_features
which will remove features which may
originate from contaminants, as well as features which don’t have a
unique master protein. See ?parse_features
for further
details, including the removal of ‘associated cRAP’.
psm_data_flt <- parse_features(
psm_data,
crap_proteins = crap.accessions,
TMT = TRUE,
level = 'PSM'
)
#> Parsing features...
#> 113935 features found from 11259 master proteins => Input
#> 242 cRAP proteins supplied
#> 1064 proteins identified as 'cRAP associated'
#> 112643 features found from 11211 master proteins => cRAP features removed
#> 112046 features found from 11180 master proteins => associated cRAP features removed
#> 112010 features found from 11179 master proteins => features without a master protein removed
#> 106462 features found from 10103 master proteins => features with non-unique master proteins removed
We now store the filtered PSM data in an MSnSet, the standard data
object for proteomics in R. See the
vignette("msnset", package="camprotR")
for more
details.
# Abundance columns for TMT PD-output start with Abundance
abundance_cols <- colnames(psm_data_flt)[grepl('Abundance.', colnames(psm_data_flt))]
psm.e <- as.matrix(psm_data_flt[, abundance_cols])
psm.f <- psm_data_flt[, setdiff(colnames(psm_data_flt), abundance_cols)]
# update the column names to remove the 'Abundance.` prefix
colnames(psm.e) <- gsub('Abundance.', '', colnames(psm.e))
# we don't have 'phenotype' data to add so we just define the
# 'expression' data and 'feature' data
psm.p <- data.frame(spike=rep(factor(c('x1', 'x2', 'x6')), c(4,3,3)), row.names=colnames(psm.e))
psm <- MSnbase::MSnSet(exprs = psm.e, fData = psm.f, pData=psm.p)
Plot intensity distributions
plot_quant(log(psm, base=2), method='density')
#> Warning: Removed 18092 rows containing non-finite values (stat_density).
TMT intensities
We want to remove low Signal:Noise (S:N) PSMs, since the
quantification values will be less accurate and there will be more
missing values. We can inspect the relationship between S:N and missing
values using the plot_missing_SN
function.
Note that where the signal:noise > 5, there are far fewer missing values.
plot_missing_SN(psm, bins = 40)
Missing values per PSM, in relation to the signal:noise ratio
We can also look into this relationship at the tag level using
plot_missing_SN_per_sample
. In this case, there is no tag
which appears to have a high proportion of missing values when
signal:noise > 5. If there were, this may warrant further
exploration, e.g was one of the sample preparations inadequate such that
fewer peptides were labeled?
plot_missing_SN_per_sample(psm, bins = 40)
Missing values per tag, in relation to the signal:noise ratio
Based on the above, we will filter the PSMs to only retain those with
S:N > 5 using filter_TMT_PSMs
. Using the same function,
we will also remove PSMs with interference/co-isolation >50%.
psm_filt_sn_int <- filter_TMT_PSMs(psm, inter_thresh = 50, sn_thresh = 5)
#> Filtering PSMs...
#> 105709 features found from 10094 master proteins => No quant filtering
#> 103307 features found from 10037 master proteins => Co-isolation filtering
#> 100612 features found from 9963 master proteins => S:N ratio filtering
Now that we have inspected the PSM-level quantification and filtered the PSMs, we can summarise the PSMs to protein-level abundances.
Discussion
- How do you think we should summarise the protein-level abundances from the PSM-level quantification?
- What downsides do you perceive in this approach?
Solution
# PSM to protein-level summarisation is much less problematic for TMT quantification
# and many approaches are valid.
#
# In most cases, there are few missing values and summarisation by naive methods
# such as mean/median/sum is adequate
#
# - Mean/sum is is sensitive to outliers
# - Median is less sensitive to outliers, but sensitive to the intensities
# of PSMs around the median intensity.
#
# PSM-level intensities can be across orders of magnitude, with the most intense
# likely to be the most accurate. Mean/sum summarisation is largely drive by a
# subset of high intensity PSMs. Sum has a subtle added advantage that the it
# is also higher for proteins with more PSMs. This can be useful in the downstream
# statistical analysis, as shall see. Sum is a sensible default
# summarisation for TMT data
#
# In the (rare) cases where there are a lot of missing values, the 'robust'
# summarisation approach in MSnbase::combineFeatures() may be appropriate
# (following log-transformation) since this handles missing data appropriately
Solution end
For PSM to protein summarisation, we will use naive ‘sum’
summarisation (MSnbase::combineFeatures(method = 'sum')
).
This approach does not appropriately handle missing values, since it
either returns NA if any value is missing, or, with
na.rm=TRUE
included, replaces NA with zero where there is
at least one finite quantification value for a protein. As such, we will
remove the few PSMs with any missing values
psm_filt_sn_int_missing <- psm_filt_sn_int %>%
MSnbase::filterNA()
Typically, one removes proteins with a single PSM, the so-called ‘one-hit wonders’, on the basis that these are more likely to be false positive identifications, and the quantification is only drawn from a single observation.
psm_filt_sn_int_missing_n_features <- psm_filt_sn_int_missing %>%
camprotR::restrict_features_per_protein(min_features=2, plot=FALSE)
Below, we perform the summarisation.
protein <- psm_filt_sn_int_missing %>%
MSnbase::combineFeatures(
groupBy = fData(psm_filt_sn_int_missing)$Master.Protein.Accession,
method = 'sum')
Finally, we assess the quantification distribution and normalise the protein-level abundances
plot_quant(log(protein, base=2), method='density')
For this dataset, we see that the protein-level intensity distributions are very similar. This benchmark experiment was performed by generating 10 samples from defined mixes of 2 peptide samples (human and yeast). As such, we expect near identical distributions for all samples from the same group. In a typical TMT experiment, you may see more variable distributions.
Regardless, the next step should be to normalise the protein-level intensities.
Here we will apply median normalisation such that all column (sample)
medians match the grand median. In MSnbase::normalise
, this
is called diff.median
. Since the intensities are
log-Gaussian distributed, we log2-transform them before
performing the normalisation.
Median normalisation is a relatively naive form of normalisation, since we are only applying a transformation using a single correction factor for each sample. This is most likely to be appropriate when the samples being compared are similar to one another, which is the case here.
protein_norm <- MSnbase::normalise(log(protein, base=2), method='diff.median')
plot_quant(protein_norm, method='density')
Remember that we can check the processing information for our MSnSet
if we are in doubt about the processing. Here, it tells us that we log2
transformed and then used diff.median
normalisation.
processingData(protein_norm)
#> - - - Processing information - - -
#> Subset [106462,10][105709,10] Fri Sep 16 16:00:02 2022
#> Subset [105709,10][103307,10] Fri Sep 16 16:00:02 2022
#> Subset [103307,10][100612,10] Fri Sep 16 16:00:02 2022
#> Subset [100612,10][100347,10] Fri Sep 16 16:00:02 2022
#> Removed features with more than 0 NAs: Fri Sep 16 16:00:02 2022
#> Dropped featureData's levels Fri Sep 16 16:00:02 2022
#> Combined 100347 features into 9952 using sum: Fri Sep 16 16:00:20 2022
#> Log transformed (base 2) Fri Sep 16 16:00:21 2022
#> Normalised (diff.median): Fri Sep 16 16:00:21 2022
#> MSnbase version: 2.20.4
Now we have filtered our PSM-level quantification, summarised to protein-level and normalised. We can use this object to perform downstream visualisation, data exploration and statistical analysis etc.
We save the object to disk so we can read it back into memory when we need it
saveRDS(psm_filt_sn_int_missing, './results/psm_filt.rds')
saveRDS(protein_norm, './results/tmt_protein.rds')
length(setdiff(fData(psm_filt_sn_int_missing)$Master.Protein.Accessions,
fData(protein_norm)$Master.Protein.Accessions))
#> [1] 0
To add - TMM normalisation