Statistical tests

There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means exhaustive.

As such, DEqMS should be preferred over limma.

Here, we will perform statistical analyses on SILAC data.

These are examples only and the code herein is unlikely to be directly applicable to your own dataset.


Load dependencies

Load the required libraries.

library(camprotR)
library(ggplot2)
library(MSnbase)
library(DEqMS)
library(limma)
library(dplyr)
library(tidyr)
library(ggplot2)
library(broom)
library(biobroom)

Input data

Here, we will start with the SILAC data processed in Processing and QC of SILAC data. Please see the previous notebook for details of the experimental design and aim and data processing.

In brief, we wish to determine the proteins which are significantly enriched by UV crosslinking (CL) vs non-CL control (NC).

First, we read in the protein-level ratios obtained in the above notebooks. We will filter out proteins which are not present in at least 3/4 replicates.

silac_protein <- readRDS('results/prot_res.rds') %>%
  filterNA(pNA = 1/4) # max 1/4 missing.

Testing for differential abundance

We will use two approaches:

T-test

To perform a t-test for each protein, we want to extract the quantification values in a long ‘tidy’ format. We can do this using the biobroom package

silac_protein_tidy <- silac_protein %>%
  biobroom::tidy.MSnSet() %>%
  filter(is.finite(value))
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

As an example of how to run a single t-test, let’s subset to a single protein. First, we extract the quantification values for this single protein


example_protein <- 'O00159'

silac_protein_tidy_example <- silac_protein_tidy %>%
  filter(protein==example_protein)

print(silac_protein_tidy_example)
#> # A tibble: 4 × 3
#>   protein sample.id value
#>   <chr>   <chr>     <dbl>
#> 1 O00159  ratio_1    1.40
#> 2 O00159  ratio_2    1.62
#> 3 O00159  ratio_4    1.68
#> 4 O00159  ratio_3    1.80

Then we use t.test to perform the t-test. Since we are giving only a single vector of values, a one sample t-test is performed.

t.test.res <- t.test(silac_protein_tidy_example$value,
                     alternative='two.sided')

print(t.test.res)
#> 
#>  One Sample t-test
#> 
#> data:  silac_protein_tidy_example$value
#> t = 19.143, df = 3, p-value = 0.0003113
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#>  1.353325 1.893016
#> sample estimates:
#> mean of x 
#>   1.62317

We can use tidy from the broom package to return the t-test results in a tidy tibble. The value of this will be seen in the next code chunk.

tidy(t.test.res)
#> # A tibble: 1 × 8
#>   estimate statistic  p.value parameter conf.low conf.high method    alternative
#>      <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>     <chr>      
#> 1     1.62      19.1 0.000311         3     1.35      1.89 One Samp… two.sided

We can now apply a t-test to every protein using dplyr group and do, making use of tidy.

t.test.res.all <- silac_protein_tidy %>%
  group_by(protein) %>%
  do(tidy(t.test(.$value, alternative='two.sided')))

Here are the results for the t-test for the example protein. As we can see, the ‘estimate’ column in t.text.res.all is the mean log2 ratio. The ‘statistic’ column is the t-statistic and the ‘parameter’ column is the degrees of freedom for the t-statistic. All the values are identical since have performed the exact same test with both approaches.

print(t.test.res)
#> 
#>  One Sample t-test
#> 
#> data:  silac_protein_tidy_example$value
#> t = 19.143, df = 3, p-value = 0.0003113
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#>  1.353325 1.893016
#> sample estimates:
#> mean of x 
#>   1.62317
t.test.res.all %>% filter(protein==example_protein)
#> # A tibble: 1 × 9
#> # Groups:   protein [1]
#>   protein estimate statistic  p.value parameter conf.low conf.high method       
#>   <chr>      <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>        
#> 1 O00159      1.62      19.1 0.000311         3     1.35      1.89 One Sample t…
#> # … with 1 more variable: alternative <chr>

When you are performing a lot of statistical tests at the same time, it’s recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.

hist(t.test.res.all$p.value, 20)

Discussion 1

What would you conclude from the p-value distribution above?

Solution

# Here, we have so many significant tests that the uniform distribution is hard to assess!
# Note that, beyond the clear peak for very low p-values (<0.05), we also have a
# slight skew towards low p-values in the range 0.05-0.15.
# This may indicate insufficient statistical power to detect some proteins that are truly enriched upon CL.

Solution end

Since we have performed multiple tests, we want to calculate an adjusted p-value to avoid type I errors (false positives).

Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.

t.test.res.all$padj <- p.adjust(t.test.res.all$p.value, method='BH')
table(t.test.res.all$padj<0.01)
#> 
#> FALSE  TRUE 
#>   424   113

At an FDR of 1%, we have 113 proteins with a significant difference.

Finally, a differential abundance test wouldn’t be complete without a volcano plot!

Note that all proteins with a statistically significant change are increased in CL, so it doesn’t resemble the typical volcano plot.


t.test.res.all %>%
  ggplot(aes(x = estimate, y = -log10(p.value), colour = padj < 0.01)) +
  geom_point() +
  theme_camprot(border=FALSE, base_size=15) +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') +
  labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')

Discussion 2

SILAC ratios are equivalent to the absolute abundance ratios between two or more samples. If this experiment had been performed using LFQ or TMT, which quantify the relative abundances in each sample, how would you expect the above volcano plot to look?

Solution

# It would like be much more symetrical, since the trend towards increased abundance
# in CL would be masked by the comparison of the relative abundances in CL and NC.
# Protein which are highly enriched with CL would have reduced ratios and those with more slight
# enrichments may even appear to be relatively depleted!

Solution end

Moderated t-test (DEqMS)

To identify proteins with significantly increased abundance in CL vs NC, we will use DEqMS (Zhu et al. 2020), which you can think of as an extension of limma (Ritchie et al. 2015) specifically for proteomics.

The analysis steps are taken from the DEqMS vignette. We first want to create an MArrayLM object as per normal limma analysis and then add a $count column to the MArrayLM object and use the spectraCounteBayes function to perform the Bayesian shrinkage using the count column, which describes the number of pepitdes per protein. This is contrast to limma, which uses the $Amean column, which describes the mean CL:NC ratio.

First, we create the MArrayLM object.

dat <- silac_protein  %>%
  exprs()

# Performing the equivalent of a one-sample t-test, so the design matrix is just an intercept
design <- cbind(Intercept = rep(1, ncol(dat)))

fit <- lmFit(dat, design)
efit <- eBayes(fit)

Next, we define the $count column, which DEqMS will use. In the DEqMS paper, they suggest that the best summarisation metric to use is the minimum value across the samples, so our count column is the minimum number of peptides per protein.

protein_ratio_long <- silac_protein %>%
  exprs() %>%
  data.frame() %>%
  tibble::rownames_to_column('Master.Protein.Accessions') %>%
  pivot_longer(cols=-Master.Protein.Accessions, values_to='protein_ratio', names_to='sample')

pep_res <- readRDS('results/pep_res.rds')

# Obtain the min peptide count across the samples and determine the minimum value across
# samples
min_pep_count <- camprotR::count_features_per_protein(pep_res) %>%
  merge(protein_ratio_long, by=c('Master.Protein.Accessions', 'sample')) %>%
  filter(is.finite(protein_ratio)) %>%  # We only want to consider samples with a ratio quantified
  group_by(Master.Protein.Accessions) %>%
  summarise(min_pep_count = min(n))

# add the min peptide count
efit$count <- min_pep_count$min_pep_count

And now we run spectraCounteBayes from DEqMS to perform the statistical test.

# run DEqMS
efit_deqms <- suppressWarnings(spectraCounteBayes(efit))

Below, we inspect the peptide count vs variance relationship which DEqMS is using in the statistical test.

In this case the relationship between peptide count and variance is only really apparent when the minimum number of peptides from which a protein-level ratio is obtained is very low. Typically, one might remove quantification values where there is just a single peptide for a protein. Here, we have kept them, and we can refer to the count column in the final results object if we want to check the minimum number of peptides observed per sample.

# Diagnostic plots
VarianceBoxplot(efit_deqms, n = 30, xlab = "Peptides")

Below, we summarise the number of proteins with statistically different abundance in CL vs NC and plot a ‘volcano’ plot to visualise this.

Here, all proteins with a statistically significant change are increased in CL, so the plot looks more like a fire-hose than a volcano!

deqms_results <- outputResult(efit_deqms, coef_col = 1)

deqms_results %>%
  ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) +
  geom_point() +
  theme_camprot(border=FALSE, base_size=15) +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') +
  labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')

Finally, we can explore and, if we wish, export our results. Importantly, the $t, $P.Value and $adj.P.Val columns are from limma. The columns prefixed with sca are the from DEqMS.

head(deqms_results)
#>           logFC  AveExpr        t      P.Value    adj.P.Val        B   gene
#> P15880 5.843093 5.843093 28.13619 1.386402e-07 6.903875e-05 8.065292 P15880
#> P22626 6.561859 6.561859 25.35616 2.571276e-07 6.903875e-05 7.615847 P22626
#> P23588 5.250793 5.250793 20.29291 9.613338e-07 7.413109e-05 6.546409 P23588
#> P43243 4.424089 4.424089 18.79343 1.512070e-06 7.413109e-05 6.148035 P43243
#> P07910 5.003172 5.003172 18.75219 1.531772e-06 7.413109e-05 6.136436 P07910
#> P49207 4.811506 4.811506 19.47370 1.226059e-06 7.413109e-05 6.334277 P49207
#>        count    sca.t  sca.P.Value sca.adj.pval
#> P15880     4 32.84685 1.803524e-08 9.684921e-06
#> P22626     7 28.92009 4.105560e-08 1.102343e-05
#> P23588     3 22.32533 2.175142e-07 2.195040e-05
#> P43243     7 21.88917 2.469094e-07 2.195040e-05
#> P07910     6 21.18579 3.045174e-07 2.195040e-05
#> P49207     2 20.97258 3.249394e-07 2.195040e-05

We can now compare the results from the t-test and the moderate t-test (DEqMS). Below, we update the column names so it’s easier to see which column comes from which test.

colnames(t.test.res.all) <- paste0('t.test.', colnames(t.test.res.all))
colnames(deqms_results) <- paste0('deqms.', colnames(deqms_results))

And then merge the two test results.

silac_compare_tests <- merge(deqms_results,
      t.test.res.all,
      by.x='row.names',
      by.y='t.test.protein')

Exercise

Compare the effect size estimates from the two methods with a scatter plot. Can you explain what you observe?

Hints:

  • For the t-test, you want the ‘t.test.estimate’ column
  • For the moderated t-test, you want the ‘deqms.logFC’ column

Solution

ggplot(silac_compare_tests) +
  aes(t.test.estimate, deqms.logFC) +
  geom_point() +
  geom_abline(slope=1) +
  theme_camprot(border=FALSE) +
  labs(x='t.test mean ratio', y='DEqMS mean ratio')

# The ratios are the same. DEqMS (and limma) are moderating the test statistics, but not the estimated fold-change.

Solution end

We can also compare the p-values from the two tests. Note that the p-value is almost always lower for the moderated t-test with DEqMS than the standard t-test.

p <- ggplot(silac_compare_tests) +
  aes(log10(t.test.p.value), log10(deqms.P.Value)) +
  geom_point() +
  geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) +
  theme_camprot(border=FALSE) +
  labs(x='T-test log10(p-value)', y='DEqMS log10(p-value)')

print(p)

Finally, we can compare the number of proteins with a significant difference (Using 1% FDR threshold) according to each test. Using DEqMS more than doubles the number of null hypothesis rejections. Only 3 proteins are significant with a t-test but not the moderated t-test.

silac_compare_tests %>%
  group_by(t.test.padj<0.01,
           deqms.sca.adj.pval<0.01) %>%
  tally()
#> # A tibble: 4 × 3
#> # Groups:   t.test.padj < 0.01 [2]
#>   `t.test.padj < 0.01` `deqms.sca.adj.pval < 0.01`     n
#>   <lgl>                <lgl>                       <int>
#> 1 FALSE                FALSE                         266
#> 2 FALSE                TRUE                          158
#> 3 TRUE                 FALSE                           3
#> 4 TRUE                 TRUE                          110

Finally, at this point we can save any one of the data.frames containing the statistical test results, either to a compressed format (rds) to read back into a later R notebook, or a flatfile (.tsv) to read with e.g excel.

# These lines are not run and are examples only

 # Saving to R binary format. Can read back into memory with readRDS().
saveRDS(silac_compare_tests, 'filename_to_save_to.rds')

# Saving to tab-separated (human-readable) flatfile
write.csv(silac_compare_tests, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)

Extended Exercise (Optional)

Compare the number of significant differences (at 1% FDR) for each method, stratifying by the number of peptides per protein. An example of how to visualise this is below, though you do not need to replicate this visualisation precisely!

How do you interpet this?

Hints:

  • For the number of peptides, you want the deqms.count column. You need to bin this into reasonable bins. See e.g ?Hmisc::cut2 for this.

Solution


# Start by tallying as in code chunk above
silac_compare_tests %>%
  group_by('min_peptides'=Hmisc::cut2(deqms.count, cuts=c(1:4,10)),
           't-test'=t.test.padj<0.01,
           'deqms'=deqms.sca.adj.pval<0.01) %>%
  tally() %>%

  # make a single column describing yes/no significant with each method and
  # rename to make it more intuitive
  mutate(compare_tests=recode(interaction(`t-test`, deqms),
                              'FALSE.FALSE'='Neither',
                              'TRUE.TRUE'='Both',
                              'TRUE.FALSE'='t-test',
                              'FALSE.TRUE'='DEqMS')) %>%
  mutate(compare_tests=factor(compare_tests, levels=c('t-test', 'DEqMS', 'Both', 'Neither'))) %>%

  ggplot(aes(min_peptides, n, fill=compare_tests)) +
  geom_bar(stat='identity',position='fill') +
  geom_text(position=position_fill(vjust=0.5), aes(label=n)) +
  theme_camprot(border=FALSE, base_size=15) +
  scale_fill_manual(values=c(get_cat_palette(3), 'grey'), name='') +
  labs(x='Minimum peptides per sample', y='Fraction of proteins')


# Note that the 3 proteins which are only significant with the t-test have a
# minimum of 1/2 peptides per sample. These are likely to be poorly quantified
# and falsely identified as CL-enriched.

Solution end

References

Session info

#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] biobroom_1.26.0     broom_0.7.12        tidyr_1.2.0        
#>  [4] dplyr_1.0.8         DEqMS_1.12.1        limma_3.50.1       
#>  [7] matrixStats_0.61.0  MSnbase_2.20.4      ProtGenerics_1.26.0
#> [10] S4Vectors_0.32.3    mzR_2.28.0          Rcpp_1.0.8.3       
#> [13] Biobase_2.54.0      BiocGenerics_0.40.0 ggplot2_3.3.5      
#> [16] camprotR_0.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] RColorBrewer_1.1-2    doParallel_1.0.17     tools_4.1.2          
#>  [4] backports_1.4.1       bslib_0.3.1           utf8_1.2.2           
#>  [7] R6_2.5.1              affyio_1.64.0         rpart_4.1.16         
#> [10] Hmisc_4.6-0           DBI_1.1.2             colorspace_2.0-3     
#> [13] nnet_7.3-17           withr_2.5.0           gridExtra_2.3        
#> [16] tidyselect_1.1.2      compiler_4.1.2        preprocessCore_1.56.0
#> [19] cli_3.2.0             htmlTable_2.4.0       labeling_0.4.2       
#> [22] sass_0.4.0            checkmate_2.0.0       scales_1.1.1         
#> [25] DEoptimR_1.0-10       robustbase_0.93-9     affy_1.72.0          
#> [28] stringr_1.4.0         digest_0.6.29         foreign_0.8-82       
#> [31] rmarkdown_2.12        jpeg_0.1-9            base64enc_0.1-3      
#> [34] pkgconfig_2.0.3       htmltools_0.5.2       fastmap_1.1.0        
#> [37] highr_0.9             htmlwidgets_1.5.4     rlang_1.0.2          
#> [40] rstudioapi_0.13       impute_1.68.0         jquerylib_0.1.4      
#> [43] generics_0.1.2        farver_2.1.0          jsonlite_1.8.0       
#> [46] mzID_1.32.0           BiocParallel_1.28.3   magrittr_2.0.2       
#> [49] Formula_1.2-4         MALDIquant_1.21       Matrix_1.4-0         
#> [52] munsell_0.5.0         fansi_1.0.2           MsCoreUtils_1.6.2    
#> [55] lifecycle_1.0.1       vsn_3.62.0            stringi_1.7.6        
#> [58] yaml_2.3.5            MASS_7.3-55           zlibbioc_1.40.0      
#> [61] plyr_1.8.6            grid_4.1.2            parallel_4.1.2       
#> [64] crayon_1.5.0          lattice_0.20-45       splines_4.1.2        
#> [67] knitr_1.37            pillar_1.7.0          codetools_0.2-18     
#> [70] XML_3.99-0.9          glue_1.6.2            evaluate_0.15        
#> [73] latticeExtra_0.6-29   data.table_1.14.2     pcaMethods_1.86.0    
#> [76] BiocManager_1.30.16   png_0.1-7             vctrs_0.3.8          
#> [79] foreach_1.5.2         gtable_0.3.0          purrr_0.3.4          
#> [82] clue_0.3-60           assertthat_0.2.1      xfun_0.30            
#> [85] ncdf4_1.19            survival_3.3-1        tibble_3.1.6         
#> [88] iterators_1.0.14      IRanges_2.28.0        cluster_2.1.2        
#> [91] ellipsis_0.3.2
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Zhu, Yafeng, Lukas M. Orre, Yan Zhou Tran, Georgios Mermelekas, Henrik J. Johansson, Alina Malyutina, Simon Anders, and Janne Lehtiö. 2020. “DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis.” Molecular & Cellular Proteomics : MCP 19 (6): 1047–57. https://doi.org/10.1074/mcp.TIR119.001646.