Differential abundance testing with SILAC ratios

There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means complete:

t-test: If we assume that the log-transformed SILAC ratios are Gaussian distributed, a t-test may be appropriate when we have just one condition (ratio is comparison of conditions = one-sample t-test) or two conditions (ratio captures one condition, with another condition performed across samples = two-sample t-test)
ANOVA/linear model: Where a more complex experimental design is involved, an ANOVA or linear model can be used, on the same assumptions at the t-test.
limma (Ritchie et al. 2015): Proteomics experiments are typically lowly replicated (e.g n << 10). Variance estimates are therefore inaccurate. limma is an R package that extends the t-test/ANOVA/linear model testing framework to enable sharing of information across features (here, proteins) to update the variance estimates. This decreases false positives and increases statistical power.
DEqMS (Zhu et al. 2020): limma assumes there is a relationship between protein abundance and variance. This is usually the case, although
- The relationship with variance is often stronger with the number of peptide spectrum matches, or peptides
- For SILAC, the quantification value is the ratio, not protein abundance. There is no reason to believe there will be a relationship between the mean ratio and variance
As such, DEqMS is preferred over limma for SILAC data.

Load dependencies

Load the required libraries.

library(camprotR)
library(ggplot2)
library(MSnbase)
library(DEqMS)
library(limma)
library(dplyr)
library(tidyr)

Here, we will use the SILAC data processed in Processing and QC of SILAC data. Please see the previous notebook for details of the experimental design and aim and data processing.

In brief, we wish to determine the proteins which are significantly enriched by UV crosslinking (CL) vs non-CL control (NC).

pep_res <- readRDS('results/pep_res.rds')
prot_res <- readRDS('results/prot_res.rds')

Differential abundance testing

To identify proteins with significantly increased abundance in CL vs NC, we will use DEqMS (Zhu et al. 2020), which you can think of as an extension of limma (Ritchie et al. 2015) specifically for proteomics.

The analysis steps are taken from the DEqMS vignette. The idea is that we want to create an MArrayLM object as per normal limma analysis and then add a $count column to the MArrayLM object and then use the spectraCounteBayes function to perform the Bayesian shrinkage using the count column (which describes the number of pepitdes per protein) rather than the $Amean column (which describes the mean CL:NC ratio).

First, we create the MArrayLM object. We filter out proteins with too few quantification values across the 4 replicates.

dat <- prot_res %>% 
  filterNA(pNA = 0.5) %>% # select proteins in min. 2/4 reps
  exprs()

# Perfoming the equivalent of a one-sample t-test, so the design matrix is just an intercept
design <- cbind(Intercept = rep(1, ncol(dat))) 

fit <- lmFit(dat, design)
efit <- eBayes(fit)

Next, we define the $count column, which DEqMS will use. In the DEqMS paper, they suggest that the best summarisation metric to use is the minimum value across the samples, so our count columns is the minimum number of peptides per protein.

protein_ratio_long <- prot_res %>%
  exprs() %>%
  data.frame() %>%
  tibble::rownames_to_column('Master.Protein.Accessions') %>%
  pivot_longer(cols=-Master.Protein.Accessions, values_to='protein_ratio', names_to='sample')

# Obtain the min peptide count across the samples and determine the minimum value across 
# samples
min_pep_count <- camprotR::count_features_per_protein(pep_res) %>%
  merge(protein_ratio_long, by=c('Master.Protein.Accessions', 'sample')) %>%
  filter(is.finite(protein_ratio)) %>%  # We only want to consider samples with a ratio quantified
  group_by(Master.Protein.Accessions) %>%
  summarise(min_pep_count = min(n))

# add the min peptide count
efit$count <- min_pep_count %>% 
  filter(Master.Protein.Accessions %in% rownames(efit$coefficients)) %>% 
  pull(min_pep_count)

And now we run spectraCounteBayes from DEqMS to perform the statistical test.

# run DEqMS
efit_deqms <- suppressWarnings(spectraCounteBayes(efit))

Below, we inspect the peptide count vs variance relationship which DEqMS is using in the statistical test.

In this case the relationship between peptide count and variance is only really apparent when the minimum number of peptides from which a protein-level ratio is obtained is very low. Typically, one might remove quantifications where there is just a single peptide for a protein. Here, have keep them, and we can refer to the count column in the final results object if we want to check the minimum number of peptides observed per sample.

# Diagnostic plots
VarianceBoxplot(efit_deqms, n = 30, xlab = "Peptides")

Below, we summarise the number of proteins with statistically different abundance in CL vs NC and plot a ‘volcano’ plot to visualise this. Note that all proteins with a statistically significant change are increased in CL, so the plot looks more like a fire-hose than a volcano!

deqms_results <- outputResult(efit_deqms, coef_col = 1)

table(ifelse(deqms_results$sca.adj.pval < 0.01, 'sig.', 'not sig.'),
      ifelse(deqms_results$logFC > 0, 'Higher in CL', 'Lower in CL'))
#>           
#>            Higher in CL Lower in CL
#>   not sig.          409          77
#>   sig.              246           0

deqms_results %>%
  ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) +
  geom_point() +
  theme_camprot() +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') +
  labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')

Finally, we can explore and, if we wish, export our results. Importantly, the $t, $P.Value and $adj.P.Val columnts are from limma. The columns prefixed with sca are the from DEqMS.

head(deqms_results)
#>           logFC  AveExpr        t      P.Value    adj.P.Val        B   gene
#> P15880 5.843093 5.843093 28.13353 2.300892e-07 0.0001590016 7.528938 P15880
#> P22626 6.561859 6.561859 25.16140 4.344306e-07 0.0001590016 7.084582 P22626
#> Q13148 4.920545 4.920545 22.69799 7.802209e-07 0.0001607499 6.640967 Q13148
#> P43243 4.424089 4.424089 18.70247 2.336893e-06 0.0001607499 5.728685 P43243
#> P23588 5.250793 5.250793 20.13712 1.538278e-06 0.0001607499 6.088209 P23588
#> P84103 5.857123 5.857123 20.66152 1.329806e-06 0.0001607499 6.210104 P84103
#>        count    sca.t  sca.P.Value sca.adj.pval
#> P15880     4 35.89893 4.910590e-08 3.594552e-05
#> P22626     7 29.58960 1.498581e-07 5.484808e-05
#> Q13148     1 24.37400 4.578182e-07 7.341608e-05
#> P43243     7 22.91838 6.522450e-07 7.341608e-05
#> P23588     3 22.86445 6.611324e-07 7.341608e-05
#> P84103     1 21.57023 9.237939e-07 7.341608e-05

Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Zhu, Yafeng, Lukas M. Orre, Yan Zhou Tran, Georgios Mermelekas, Henrik J. Johansson, Alina Malyutina, Simon Anders, and Janne Lehtiö. 2020. “DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis.” Molecular & Cellular Proteomics : MCP 19 (6): 1047–57. https://doi.org/10.1074/mcp.TIR119.001646.

Differential abundance testing with SILAC ratios

Tom Smith

2022-05-11

Load dependencies

Differential abundance testing