There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means complete:
limma
(Ritchie et al. 2015): Proteomics experiments are typically lowly replicated (e.g n << 10). Variance estimates are therefore inaccurate. limma
is an R package that extends the t-test/ANOVA/linear model testing framework to enable sharing of information across features (here, proteins) to update the variance estimates. This decreases false positives and increases statistical power.DEqMS
(Zhu et al. 2020): limma assumes there is a relationship between protein abundance and variance. This is usually the case, although
DEqMS
is preferred over limma for SILAC data.Load the required libraries.
library(camprotR)
library(ggplot2)
library(MSnbase)
library(DEqMS)
library(limma)
library(dplyr)
library(tidyr)
Here, we will use the SILAC data processed in Processing and QC of SILAC data. Please see the previous notebook for details of the experimental design and aim and data processing.
In brief, we wish to determine the proteins which are significantly enriched by UV crosslinking (CL) vs non-CL control (NC).
pep_res <- readRDS('results/pep_res.rds')
prot_res <- readRDS('results/prot_res.rds')
To identify proteins with significantly increased abundance in CL vs NC, we will use DEqMS (Zhu et al. 2020), which you can think of as an extension of limma (Ritchie et al. 2015) specifically for proteomics.
The analysis steps are taken from the DEqMS vignette. The idea is that we want to create an MArrayLM
object as per normal limma
analysis and then add a $count
column to the MArrayLM
object and then use the spectraCounteBayes
function to perform the Bayesian shrinkage using the count column (which describes the number of pepitdes per protein) rather than the $Amean
column (which describes the mean CL:NC ratio).
First, we create the MArrayLM
object. We filter out proteins with too few quantification values across the 4 replicates.
dat <- prot_res %>%
filterNA(pNA = 0.5) %>% # select proteins in min. 2/4 reps
exprs()
# Perfoming the equivalent of a one-sample t-test, so the design matrix is just an intercept
design <- cbind(Intercept = rep(1, ncol(dat)))
fit <- lmFit(dat, design)
efit <- eBayes(fit)
Next, we define the $count
column, which DEqMS
will use. In the DEqMS paper, they suggest that the best summarisation metric to use is the minimum value across the samples, so our count
columns is the minimum number of peptides per protein.
protein_ratio_long <- prot_res %>%
exprs() %>%
data.frame() %>%
tibble::rownames_to_column('Master.Protein.Accessions') %>%
pivot_longer(cols=-Master.Protein.Accessions, values_to='protein_ratio', names_to='sample')
# Obtain the min peptide count across the samples and determine the minimum value across
# samples
min_pep_count <- camprotR::count_features_per_protein(pep_res) %>%
merge(protein_ratio_long, by=c('Master.Protein.Accessions', 'sample')) %>%
filter(is.finite(protein_ratio)) %>% # We only want to consider samples with a ratio quantified
group_by(Master.Protein.Accessions) %>%
summarise(min_pep_count = min(n))
# add the min peptide count
efit$count <- min_pep_count %>%
filter(Master.Protein.Accessions %in% rownames(efit$coefficients)) %>%
pull(min_pep_count)
And now we run spectraCounteBayes
from DEqMS
to perform the statistical test.
# run DEqMS
efit_deqms <- suppressWarnings(spectraCounteBayes(efit))
Below, we inspect the peptide count vs variance relationship which DEqMS
is using in the statistical test.
In this case the relationship between peptide count and variance is only really apparent when the minimum number of peptides from which a protein-level ratio is obtained is very low. Typically, one might remove quantifications where there is just a single peptide for a protein. Here, have keep them, and we can refer to the count
column in the final results object if we want to check the minimum number of peptides observed per sample.
# Diagnostic plots
VarianceBoxplot(efit_deqms, n = 30, xlab = "Peptides")
Below, we summarise the number of proteins with statistically different abundance in CL vs NC and plot a ‘volcano’ plot to visualise this. Note that all proteins with a statistically significant change are increased in CL, so the plot looks more like a fire-hose than a volcano!
deqms_results <- outputResult(efit_deqms, coef_col = 1)
table(ifelse(deqms_results$sca.adj.pval < 0.01, 'sig.', 'not sig.'),
ifelse(deqms_results$logFC > 0, 'Higher in CL', 'Lower in CL'))
#>
#> Higher in CL Lower in CL
#> not sig. 409 77
#> sig. 246 0
deqms_results %>%
ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) +
geom_point() +
theme_camprot() +
scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') +
labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')
Finally, we can explore and, if we wish, export our results. Importantly, the $t
, $P.Value
and $adj.P.Val
columnts are from limma
. The columns prefixed with sca
are the from DEqMS
.
head(deqms_results)
#> logFC AveExpr t P.Value adj.P.Val B gene
#> P15880 5.843093 5.843093 28.13353 2.300892e-07 0.0001590016 7.528938 P15880
#> P22626 6.561859 6.561859 25.16140 4.344306e-07 0.0001590016 7.084582 P22626
#> Q13148 4.920545 4.920545 22.69799 7.802209e-07 0.0001607499 6.640967 Q13148
#> P43243 4.424089 4.424089 18.70247 2.336893e-06 0.0001607499 5.728685 P43243
#> P23588 5.250793 5.250793 20.13712 1.538278e-06 0.0001607499 6.088209 P23588
#> P84103 5.857123 5.857123 20.66152 1.329806e-06 0.0001607499 6.210104 P84103
#> count sca.t sca.P.Value sca.adj.pval
#> P15880 4 35.89893 4.910590e-08 3.594552e-05
#> P22626 7 29.58960 1.498581e-07 5.484808e-05
#> Q13148 1 24.37400 4.578182e-07 7.341608e-05
#> P43243 7 22.91838 6.522450e-07 7.341608e-05
#> P23588 3 22.86445 6.611324e-07 7.341608e-05
#> P84103 1 21.57023 9.237939e-07 7.341608e-05