There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means complete:

These are examples only and the code herein is unlikely to be directly applicable to your own dataset.


Load dependencies

Load the required libraries.

library(camprotR)
library(ggplot2)
library(MSnbase)
library(DEqMS)
library(limma)
library(dplyr)
library(tidyr)
library(ggplot2)
library(broom)
library(biobroom)
library(uniprotREST)

Input data

Here, we will start with the TMT data processed in DQC PSM-level quantification and summarisation to protein-level abundance. Please see the previous notebook for details of the experimental design and aim and data processing.

As a reminder, this data comes from a published benchmark experiment where yeast peptides were spiked into human peptides at 3 known amounts to provide ground truth fold changes (see below). For more details, see: (Smith et al. 2022)

First, we read in the protein-level quantification data.

tmt_protein <- readRDS('./results/tmt_protein.rds')

To keep things simple, we will just focus on the comparison between the 2x and 6x yeast spike-in samples (the last 6 TMT tags).

tmt_protein <- tmt_protein[,1:7]

 # this is need to make sure the spike factor doesn't contain unused levels (x6)
pData(tmt_protein)$spike <- droplevels(pData(tmt_protein)$spike)

Testing for differential abundance

We will use two approaches to identify proteins with significant differences in abundance: - two-sample t-test - moderated two-sample t-test (limma) - moderated two-sample t-test (DEqMS)

T-test

To perform a t-test for each protein, we want to extract the quantification values in a long ‘tidy’ format. We can do this using the biobroom package

tmt_protein_tidy <- tmt_protein %>%
  biobroom::tidy.MSnSet(addPheno=TRUE) %>% # addPheno=TRUE adds the pData so we have the sample information too
  filter(is.finite(value))
#> Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.

As an example of how to run a single t-test, let’s subset to a single protein. First, we extract the quantification values for this single protein


example_protein <- 'P40414'

tmt_protein_tidy_example <- tmt_protein_tidy %>%
  filter(protein==example_protein) %>%
  select(-sample)

print(tmt_protein_tidy_example)
#> # A tibble: 7 x 3
#>   protein spike value
#>   <chr>   <fct> <dbl>
#> 1 P40414  x1     7.51
#> 2 P40414  x1     7.39
#> 3 P40414  x1     7.44
#> 4 P40414  x1     7.65
#> 5 P40414  x2     8.30
#> 6 P40414  x2     8.09
#> 7 P40414  x2     8.13

Then we use t.test to perform the t-test.


t.test.res <- t.test(formula=value~spike,
                     data=tmt_protein_tidy_example,
                     alternative='two.sided')

print(t.test.res)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  value by spike
#> t = -8.0003, df = 4.5053, p-value = 0.0007918
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.9034124 -0.4528401
#> sample estimates:
#> mean in group x1 mean in group x2 
#>         7.496690         8.174816

We can use tidy from the broom package to return the t-test results in a tidy tibble. The value of this will be seen in the next code chunk.

tidy(t.test.res)
#> # A tibble: 1 x 10
#>   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
#>      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
#> 1   -0.678      7.50      8.17     -8.00 7.92e-4      4.51   -0.903    -0.453
#> # … with 2 more variables: method <chr>, alternative <chr>

We can now apply a t-test to every protein using dplyr group and do, making use of tidy.

t.test.res.all <- tmt_protein_tidy %>%
  group_by(protein) %>%
  do(tidy(t.test(formula=value~spike,
                 data=.,
                 alternative='two.sided')))

Here are the results for the t-test for the example protein. As we can see, the ‘estimate’ column in t.text.res.all is the mean protein abundance. The ‘statistic’ column is the t-statistic and the ‘parameter’ column is the degrees of freedom for the t-statistic. All the values are identical since have performed the exact same test with both approaches.

print(t.test.res)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  value by spike
#> t = -8.0003, df = 4.5053, p-value = 0.0007918
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.9034124 -0.4528401
#> sample estimates:
#> mean in group x1 mean in group x2 
#>         7.496690         8.174816
t.test.res.all %>% filter(protein==example_protein)
#> # A tibble: 1 x 11
#> # Groups:   protein [1]
#>   protein estimate estimate1 estimate2 statistic p.value parameter conf.low
#>   <chr>      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
#> 1 P40414    -0.678      7.50      8.17     -8.00 7.92e-4      4.51   -0.903
#> # … with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

When you are performing a lot of statistical tests at the same time, it’s recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.

Here, we have so many significant tests that the uniform distribution is hard to assess. Note that, beyond the clear peak for very low p-values (<0.05) we also have a slight skew towards low p-values in the range 0.05-0.2. This may indicate insufficient statistical power to detect some proteins that are truly differentially abundant.

hist(t.test.res.all$p.value, 20)

Discussion 1

What would you conclude from the p-value distribution above?

Solution

# Here, we have so many significant tests that the uniform distribution is hard to assess!
# Note that, beyond the clear peak for very low p-values (<0.05), we also have a
# slight skew towards low p-values in the range 0.05-0.2.
# This may indicate insufficient statistical power to detect some proteins that
# are truly differentially abundant.

Solution end

Since we have performed multiple tests, we want to calculate an adjusted p-value to avoid type I errors (false positives).

Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.

t.test.res.all$padj <- p.adjust(t.test.res.all$p.value, method='BH')
table(t.test.res.all$padj<0.01)
#> 
#> FALSE  TRUE 
#>  6915  3037

At an FDR of 1%, we have 3037 proteins with a significant difference.

Moderated t-test (limma)

Proteomics experiments are typically lowly replicated (e.g n << 10). Variance estimates are therefore inaccurate. limma (Ritchie et al. 2015) is an R package that extends the t-test/ANOVA/linear model testing framework to enable sharing of information across features (here, proteins) to update the variance estimates. This decreases false positives and increases statistical power.

Next, we create the MArrayLM object and a design model. We then supply these to limma::lmFit to fit the linear model according to the design and then use limma::eBayes to compute moderated test statistics.


exprs_for_limma <- exprs(tmt_protein)

# Performing the equivalent of a two-sample t-test
spike <- pData(tmt_protein)$spike

limma_design <- model.matrix(formula(~spike))

limma_fit <- lmFit(exprs_for_limma, limma_design)
limma_fit <- eBayes(limma_fit, trend=TRUE)

We can visualise the relationship between the average abundance and the variance using the limma::plotSA function.

limma::plotSA(limma_fit)

Discussion 2

How would you interpret the plot above?

Solution

# There's a really clear relationship between protein abundance and variance!

Solution end

In this case, the variances will be shrunk towards a value that depends on the mean protein abundance vs variance trend

We can extract a results table like so.


# use colnames(limma_fit$coefficients) to identify the coefficient names
limma_results <- topTable(limma_fit, n=Inf, coef='spikex2')

Below, we summarise the number of proteins with statistically different abundance in 2x vs 1x and plot a ‘volcano’ plot to visualise this.


table(limma_results$adj.P.Val<0.01)
#> 
#> FALSE  TRUE 
#>  4024  5928


limma_results %>%
  ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = adj.P.Val < 0.01)) +
  geom_point(size=0.5) +
  theme_camprot(border=FALSE, base_size=15) +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = '2x vs 1x Sig.') +
  labs(x = '2x vs 1x (Log2)', y = '-log10(p-value)')

t.test.res.all %>% filter(protein==example_protein)
#> # A tibble: 1 x 12
#> # Groups:   protein [1]
#>   protein estimate estimate1 estimate2 statistic p.value parameter conf.low
#>   <chr>      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
#> 1 P40414    -0.678      7.50      8.17     -8.00 7.92e-4      4.51   -0.903
#> # … with 4 more variables: conf.high <dbl>, method <chr>, alternative <chr>,
#> #   padj <dbl>
limma_results[example_protein,]
#>            logFC  AveExpr        t      P.Value    adj.P.Val        B
#> P40414 0.6781263 7.787315 9.726588 4.507344e-07 5.006969e-06 6.480212

Discussion 3

Given the experimental design, would you expect proteins to have signficant changes in both directions?

Solution

# Yes! We are mixing known quantities of human and yeast proteins together such
# that yeast proteins increase 2-fold in abundance between 2x and 1x samples, and human
# proteins decrease in abundance (to balance out the total amount of protein in the samples)

Solution end

It would make more sense to split the volcano plot by the species from which the protein derived. We can obtain this information from uniprot like so.

species <-  uniprot_map(
  ids = rownames(tmt_protein),
  from = "UniProtKB_AC-ID",
  to = "UniProtKB",
  fields = "organism_name",
) %>% rename(c('UNIPROTKB'='From'))
#> Job ID: 07bf9376998483634bbed73a66024a310bb28df4
#> Page 1 of 20
#> Success
#> Page 2 of 20
#> Success
#> Page 3 of 20
#> Success
#> Page 4 of 20
#> Success
#> Page 5 of 20
#> Success
#> Page 6 of 20
#> Success
#> Page 7 of 20
#> Success
#> Page 8 of 20
#> Success
#> Page 9 of 20
#> Success
#> Page 10 of 20
#> Success
#> Page 11 of 20
#> Success
#> Page 12 of 20
#> Success
#> Page 13 of 20
#> Success
#> Page 14 of 20
#> Success
#> Page 15 of 20
#> Success
#> Page 16 of 20
#> Success
#> Page 17 of 20
#> Success
#> Page 18 of 20
#> Success
#> Page 19 of 20
#> Success
#> Page 20 of 20
#> Success

Exercise 1

Merge the species and limma_results data.frames and re-plot the volcano plot, with one panel for each species. An example of the desired output is shown below

Hint: see ?facet_wrap

Solution


limma_results %>%
  merge(species, by.x='row.names', by.y='UNIPROTKB') %>%
  ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = adj.P.Val < 0.01)) +
  geom_point(size=0.5) +
  theme_camprot(border=FALSE, base_size=15) +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = '2x vs 1x Sig.') +
  labs(x = '2x vs 1x (Log2)', y = '-log10(p-value)') +
  facet_wrap(~(gsub('\\(.*', '', Organism)))

Solution end

We can now compare the results from the t-test and the moderated t-test (limma). Below, we update the column names so it’s easier to see which column comes from which test and then merge the two test results.

tmt_compare_tests <- merge(
  setNames(limma_results, paste0('limma.', colnames(limma_results))),
  setNames(t.test.res.all, paste0('t.test.', colnames(t.test.res.all))),
  by.x='row.names',
  by.y='t.test.protein')

Below, we can compare the p-values from the two tests. Note that the p-value is almost always lower for the moderated t-test with limma than the standard t-test.

p <- ggplot(tmt_compare_tests) +
  aes(log10(t.test.p.value), log10(limma.P.Value)) +
  geom_point(size=0.2, alpha=0.2) +
  geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) +
  theme_camprot(border=FALSE) +
  labs(x='T-test log10(p-value)', y='limma log10(p-value)')


print(p)

Finally, we can compare the number of proteins with a significant difference (Using 1% FDR threshold) according to each test. Using the t-test, there are 3037 significant differences, but with limma 6321 proteins have a significant difference.

tmt_compare_tests %>%
  group_by(t.test.padj<0.01,
           limma.P.Value<0.01) %>%
  tally()
#> # A tibble: 4 x 3
#> # Groups:   t.test.padj < 0.01 [2]
#>   `t.test.padj < 0.01` `limma.P.Value < 0.01`     n
#>   <lgl>                <lgl>                  <int>
#> 1 FALSE                FALSE                   3616
#> 2 FALSE                TRUE                    3299
#> 3 TRUE                 FALSE                     15
#> 4 TRUE                 TRUE                    3022

Moderated t-test (DEqMS)

limma assumes there is a relationship between protein abundance and variance. This is usually the case, although we have seen above that this isn’t so with our data. For LFQ, the relationship between variance and the number of peptides may be stronger.

DEqMS (Zhu et al. 2020), is an alternative to limma, which you can think of as an extension of limma (Ritchie et al. 2015) specifically for proteomics, which uses the number of peptides rather than mean abundance to share information between proteins.

The analysis steps are taken from the DEqMS vignette. We start from the MArrayLM we created for limma analysis and then simply add a $count column to the MArrayLM object and use the spectraCounteBayes function to perform the Bayesian shrinkage using the count column, which describes the number of pepitdes per protein. This is contrast to limma, which uses the $Amean column, which describes the mean protein abundance.

To define the $count column, we need to summarise the number of PSMs per protein. In the DEqMS paper, they suggest that the best summarisation metric to use is the minimum value across the samples, so our count column is the minimum number of PSMs per protein.


tmt_psm_res <- readRDS('./results/psm_filt.rds')

# Obtain the min peptide count across the samples and determine the minimum value across
# samples
min_psm_count <- camprotR::count_features_per_protein(tmt_psm_res) %>%
  merge(tmt_protein_tidy,
        by.x=c('Master.Protein.Accessions', 'sample'),
        by.y=c('protein', 'sample')) %>%
  group_by(Master.Protein.Accessions) %>%
  summarise(min_psm_count = min(n))

# add the min peptide count
limma_fit$count <- min_psm_count$min_psm_count

And now we run spectraCounteBayes from DEqMS to perform the statistical test.

# run DEqMS
efit_deqms <- suppressWarnings(spectraCounteBayes(limma_fit))

Below, we inspect the peptide count vs variance relationship which DEqMS is using in the statistical test.

# Diagnostic plots
VarianceBoxplot(efit_deqms, n = 30, xlab = "PSMs")

Below, we summarise the number of proteins with statistically different abundance in RNase +/- and plot a ‘volcano’ plot to visualise this.

deqms_results <- outputResult(efit_deqms, coef_col=2)


table(deqms_results$sca.adj.pval<0.01)
#> 
#> FALSE  TRUE 
#>  4344  5608


deqms_results %>%
  merge(species, by.x='row.names', by.y='UNIPROTKB') %>%
  ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) +
  geom_point(size=0.5) +
  theme_camprot(border=FALSE, base_size=15) +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = '2x vs 1x Sig.') +
  labs(x = '2x vs 1x (Log2)', y = '-log10(p-value)') +
  facet_wrap(~(gsub('\\(.*', '', Organism)))

We can compare the results of limma and DEqMS by considering the number of significant differences. Note that the limma results are also contained with the results from DEqMS. The $t, $P.Value and $adj.P.Val columns are from limma. The columns prefixed with sca are the from DEqMS.

deqms_results %>%
  group_by(limma_sig=adj.P.Val<0.01,
           DEqMS_sig=sca.adj.pval<0.01) %>%
  tally()
#> # A tibble: 4 x 3
#> # Groups:   limma_sig [2]
#>   limma_sig DEqMS_sig     n
#>   <lgl>     <lgl>     <int>
#> 1 FALSE     FALSE      3909
#> 2 FALSE     TRUE        115
#> 3 TRUE      FALSE       435
#> 4 TRUE      TRUE       5493

We can compare the results of limma and DEqMS by considering the p-values.


deqms_results %>%
  ggplot() +
  aes(P.Value, sca.P.Value) +
  geom_point(size=0.5, alpha=0.5) +
  geom_abline(slope=1, colour=get_cat_palette(2)[2], size=1, linetype=2) +
  theme_camprot(border=FALSE) +
  labs(x='limma p-value', y='DEqMS p-value') +
  scale_x_log10() +
  scale_y_log10()

Discussion 4

The p-values from limma and DEqMS are very well correlated. This is despite the two methods using a different depedent variable to shrink the variance (limma = mean expression, DEqMS = the number of PSMs)

Solution


# This suggests that the dependent variables are likely to be highly correlated,
# which makes sense since we are using sum summarisation from PSM -> protein,
# so proteins with more PSMs will likely have a higher abundance.
# We can check this like so

deqms_results %>%
  ggplot() +
  aes(Hmisc::cut2(count, g=20), AveExpr) +
  geom_boxplot() +
  theme_camprot(border=FALSE) +
  labs(x='PSMs', y='Average abundance') +
  theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))

Solution end

Finally, at this point we can save any one of the data.frames containing the statistical test results, either to a compressed format (rds) to read back into a later R notebook, or a flatfile (.tsv) to read with e.g excel.

# These lines are not run and are examples only

saveRDS(deqms_results, 'filename_to_save_to.rds')
write.csv(deqms_results, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)

References

Session info

#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] uniprotREST_0.0.0.9000 biobroom_1.20.0        broom_0.7.0           
#>  [4] tidyr_1.1.2            dplyr_1.0.4            DEqMS_1.6.0           
#>  [7] limma_3.44.3           MSnbase_2.14.2         ProtGenerics_1.20.0   
#> [10] S4Vectors_0.26.1       mzR_2.22.0             Rcpp_1.0.5            
#> [13] Biobase_2.48.0         BiocGenerics_0.34.0    ggplot2_3.3.2         
#> [16] camprotR_0.0.0.9000   
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.3.1            vsn_3.56.0            splines_4.0.3        
#>  [4] jsonlite_1.7.2        foreach_1.5.0         bslib_0.2.4          
#>  [7] assertthat_0.2.1      BiocManager_1.30.10   affy_1.66.0          
#> [10] highr_0.8             blob_1.2.1            yaml_2.2.1           
#> [13] robustbase_0.93-6     impute_1.62.0         pillar_1.4.6         
#> [16] backports_1.1.10      lattice_0.20-41       glue_1.6.2           
#> [19] digest_0.6.27         colorspace_1.4-1      htmltools_0.5.1.1    
#> [22] preprocessCore_1.50.0 plyr_1.8.6            MALDIquant_1.19.3    
#> [25] XML_3.99-0.5          pkgconfig_2.0.3       zlibbioc_1.34.0      
#> [28] purrr_0.3.4           scales_1.1.1          affyio_1.58.0        
#> [31] BiocParallel_1.22.0   tibble_3.0.3          farver_2.0.3         
#> [34] generics_0.0.2        IRanges_2.22.2        ellipsis_0.3.1       
#> [37] withr_2.3.0           cli_3.3.0             magrittr_1.5         
#> [40] crayon_1.3.4          evaluate_0.15         ncdf4_1.17           
#> [43] fansi_0.4.1           doParallel_1.0.15     MASS_7.3-53          
#> [46] httr2_0.2.1           tools_4.0.3           lifecycle_0.2.0      
#> [49] stringr_1.4.0         munsell_0.5.0         pcaMethods_1.80.0    
#> [52] compiler_4.0.3        jquerylib_0.1.3       mzID_1.26.0          
#> [55] rlang_1.0.5           grid_4.0.3            iterators_1.0.12     
#> [58] rstudioapi_0.11       rappdirs_0.3.1        labeling_0.3         
#> [61] rmarkdown_2.10        gtable_0.3.0          codetools_0.2-16     
#> [64] curl_4.3              DBI_1.1.0             R6_2.4.1             
#> [67] knitr_1.39            utf8_1.1.4            stringi_1.5.3        
#> [70] vctrs_0.3.6           DEoptimR_1.0-8        tidyselect_1.1.0     
#> [73] xfun_0.31

Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Smith, Tom S., Anna Andrejeva, Josie Christopher, Oliver M. Crook, Mohamed Elzek, and Kathryn S. Lilley. 2022. “Prior Signal Acquisition Software Versions for Orbitrap Underestimate Low Isobaric Mass Tag Intensities, Without Detriment to Differential Abundance Experiments.” ACS Measurement Science Au, March. https://doi.org/10.1021/acsmeasuresciau.1c00053.

Zhu, Yafeng, Lukas M. Orre, Yan Zhou Tran, Georgios Mermelekas, Henrik J. Johansson, Alina Malyutina, Simon Anders, and Janne Lehtiö. 2020. “DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis.” Molecular & Cellular Proteomics : MCP 19 (6): 1047–57. https://doi.org/10.1074/mcp.TIR119.001646.