There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means complete:
t-test: If we assume that the quantification values are Gaussian distributed, a t-test may be appropriate. For TMT, log-transformed abundances can be assumed to be Gaussian distributed. When we have one condition variable and we are comparing between two values variable in a TMT experiment (e.g samples are treatment or control), a two-sample t-test is appropriate.
ANOVA/linear model: Where a more complex experimental design is involved, an ANOVA or linear model can be used, on the same assumptions at the t-test.
limma
(Ritchie et al. 2015): Proteomics experiments are typically lowly replicated (e.g n << 10). Variance estimates are therefore inaccurate. limma
is an R package that extends the t-test/ANOVA/linear model testing framework to enable sharing of information across features (here, proteins) to update the variance estimates. This decreases false positives and increases statistical power.
DEqMS
(Zhu et al. 2020): limma assumes there is a relationship between protein abundance and variance. This is usually the case, although the relationship with variance is sometimes stronger with the number of peptide spectrum matches (for TMT experiments),
These are examples only and the code herein is unlikely to be directly applicable to your own dataset.
Load the required libraries.
library(camprotR)
library(ggplot2)
library(MSnbase)
library(DEqMS)
library(limma)
library(dplyr)
library(tidyr)
library(ggplot2)
library(broom)
library(biobroom)
library(uniprotREST)
Here, we will start with the TMT data processed in DQC PSM-level quantification and summarisation to protein-level abundance. Please see the previous notebook for details of the experimental design and aim and data processing.
As a reminder, this data comes from a published benchmark experiment where yeast peptides were spiked into human peptides at 3 known amounts to provide ground truth fold changes (see below). For more details, see: (Smith et al. 2022)
First, we read in the protein-level quantification data.
tmt_protein <- readRDS('./results/tmt_protein.rds')
To keep things simple, we will just focus on the comparison between the 2x and 6x yeast spike-in samples (the last 6 TMT tags).
tmt_protein <- tmt_protein[,1:7]
# this is need to make sure the spike factor doesn't contain unused levels (x6)
pData(tmt_protein)$spike <- droplevels(pData(tmt_protein)$spike)
We will use two approaches to identify proteins with significant differences in abundance: - two-sample t-test - moderated two-sample t-test (limma) - moderated two-sample t-test (DEqMS)
To perform a t-test for each protein, we want to extract the quantification values in a long ‘tidy’ format. We can do this using the biobroom package
tmt_protein_tidy <- tmt_protein %>%
biobroom::tidy.MSnSet(addPheno=TRUE) %>% # addPheno=TRUE adds the pData so we have the sample information too
filter(is.finite(value))
#> Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
As an example of how to run a single t-test, let’s subset to a single protein. First, we extract the quantification values for this single protein
example_protein <- 'P40414'
tmt_protein_tidy_example <- tmt_protein_tidy %>%
filter(protein==example_protein) %>%
select(-sample)
print(tmt_protein_tidy_example)
#> # A tibble: 7 x 3
#> protein spike value
#> <chr> <fct> <dbl>
#> 1 P40414 x1 7.51
#> 2 P40414 x1 7.39
#> 3 P40414 x1 7.44
#> 4 P40414 x1 7.65
#> 5 P40414 x2 8.30
#> 6 P40414 x2 8.09
#> 7 P40414 x2 8.13
Then we use t.test
to perform the t-test.
t.test.res <- t.test(formula=value~spike,
data=tmt_protein_tidy_example,
alternative='two.sided')
print(t.test.res)
#>
#> Welch Two Sample t-test
#>
#> data: value by spike
#> t = -8.0003, df = 4.5053, p-value = 0.0007918
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.9034124 -0.4528401
#> sample estimates:
#> mean in group x1 mean in group x2
#> 7.496690 8.174816
We can use tidy
from the broom
package to return the t-test results in a tidy tibble. The value of this will be seen in the next code chunk.
tidy(t.test.res)
#> # A tibble: 1 x 10
#> estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.678 7.50 8.17 -8.00 7.92e-4 4.51 -0.903 -0.453
#> # … with 2 more variables: method <chr>, alternative <chr>
We can now apply a t-test to every protein using dplyr group
and do
, making use of tidy
.
t.test.res.all <- tmt_protein_tidy %>%
group_by(protein) %>%
do(tidy(t.test(formula=value~spike,
data=.,
alternative='two.sided')))
Here are the results for the t-test for the example protein. As we can see, the ‘estimate’ column in t.text.res.all
is the mean protein abundance. The ‘statistic’ column is the t-statistic and the ‘parameter’ column is the degrees of freedom for the t-statistic. All the values are identical since have performed the exact same test with both approaches.
print(t.test.res)
#>
#> Welch Two Sample t-test
#>
#> data: value by spike
#> t = -8.0003, df = 4.5053, p-value = 0.0007918
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.9034124 -0.4528401
#> sample estimates:
#> mean in group x1 mean in group x2
#> 7.496690 8.174816
t.test.res.all %>% filter(protein==example_protein)
#> # A tibble: 1 x 11
#> # Groups: protein [1]
#> protein estimate estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 P40414 -0.678 7.50 8.17 -8.00 7.92e-4 4.51 -0.903
#> # … with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
When you are performing a lot of statistical tests at the same time, it’s recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.
Here, we have so many significant tests that the uniform distribution is hard to assess. Note that, beyond the clear peak for very low p-values (<0.05) we also have a slight skew towards low p-values in the range 0.05-0.2. This may indicate insufficient statistical power to detect some proteins that are truly differentially abundant.
hist(t.test.res.all$p.value, 20)
Discussion 1
What would you conclude from the p-value distribution above?
Solution
# Here, we have so many significant tests that the uniform distribution is hard to assess!
# Note that, beyond the clear peak for very low p-values (<0.05), we also have a
# slight skew towards low p-values in the range 0.05-0.2.
# This may indicate insufficient statistical power to detect some proteins that
# are truly differentially abundant.
Solution end
Since we have performed multiple tests, we want to calculate an adjusted p-value to avoid type I errors (false positives).
Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.
t.test.res.all$padj <- p.adjust(t.test.res.all$p.value, method='BH')
table(t.test.res.all$padj<0.01)
#>
#> FALSE TRUE
#> 6915 3037
At an FDR of 1%, we have 3037 proteins with a significant difference.
Proteomics experiments are typically lowly replicated (e.g n << 10). Variance estimates are therefore inaccurate. limma
(Ritchie et al. 2015) is an R package that extends the t-test/ANOVA/linear model testing framework to enable sharing of information across features (here, proteins) to update the variance estimates. This decreases false positives and increases statistical power.
Next, we create the MArrayLM
object and a design model. We then supply these to limma::lmFit
to fit the linear model according to the design and then use limma::eBayes
to compute moderated test statistics.
exprs_for_limma <- exprs(tmt_protein)
# Performing the equivalent of a two-sample t-test
spike <- pData(tmt_protein)$spike
limma_design <- model.matrix(formula(~spike))
limma_fit <- lmFit(exprs_for_limma, limma_design)
limma_fit <- eBayes(limma_fit, trend=TRUE)
We can visualise the relationship between the average abundance and the variance using the limma::plotSA
function.
limma::plotSA(limma_fit)
Discussion 2
How would you interpret the plot above?
Solution
# There's a really clear relationship between protein abundance and variance!
Solution end
In this case, the variances will be shrunk towards a value that depends on the mean protein abundance vs variance trend
We can extract a results table like so.
# use colnames(limma_fit$coefficients) to identify the coefficient names
limma_results <- topTable(limma_fit, n=Inf, coef='spikex2')
Below, we summarise the number of proteins with statistically different abundance in 2x vs 1x and plot a ‘volcano’ plot to visualise this.
table(limma_results$adj.P.Val<0.01)
#>
#> FALSE TRUE
#> 4024 5928
limma_results %>%
ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = adj.P.Val < 0.01)) +
geom_point(size=0.5) +
theme_camprot(border=FALSE, base_size=15) +
scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = '2x vs 1x Sig.') +
labs(x = '2x vs 1x (Log2)', y = '-log10(p-value)')
t.test.res.all %>% filter(protein==example_protein)
#> # A tibble: 1 x 12
#> # Groups: protein [1]
#> protein estimate estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 P40414 -0.678 7.50 8.17 -8.00 7.92e-4 4.51 -0.903
#> # … with 4 more variables: conf.high <dbl>, method <chr>, alternative <chr>,
#> # padj <dbl>
limma_results[example_protein,]
#> logFC AveExpr t P.Value adj.P.Val B
#> P40414 0.6781263 7.787315 9.726588 4.507344e-07 5.006969e-06 6.480212
Discussion 3
Given the experimental design, would you expect proteins to have signficant changes in both directions?
Solution
# Yes! We are mixing known quantities of human and yeast proteins together such
# that yeast proteins increase 2-fold in abundance between 2x and 1x samples, and human
# proteins decrease in abundance (to balance out the total amount of protein in the samples)
Solution end
It would make more sense to split the volcano plot by the species from which the protein derived. We can obtain this information from uniprot like so.
species <- uniprot_map(
ids = rownames(tmt_protein),
from = "UniProtKB_AC-ID",
to = "UniProtKB",
fields = "organism_name",
) %>% rename(c('UNIPROTKB'='From'))
#> Job ID: 07bf9376998483634bbed73a66024a310bb28df4
#> Page 1 of 20
#> Success
#> Page 2 of 20
#> Success
#> Page 3 of 20
#> Success
#> Page 4 of 20
#> Success
#> Page 5 of 20
#> Success
#> Page 6 of 20
#> Success
#> Page 7 of 20
#> Success
#> Page 8 of 20
#> Success
#> Page 9 of 20
#> Success
#> Page 10 of 20
#> Success
#> Page 11 of 20
#> Success
#> Page 12 of 20
#> Success
#> Page 13 of 20
#> Success
#> Page 14 of 20
#> Success
#> Page 15 of 20
#> Success
#> Page 16 of 20
#> Success
#> Page 17 of 20
#> Success
#> Page 18 of 20
#> Success
#> Page 19 of 20
#> Success
#> Page 20 of 20
#> Success
Exercise 1
Merge the
species
andlimma_results
data.frames
and re-plot the volcano plot, with one panel for each species. An example of the desired output is shown belowHint: see
?facet_wrap
Solution
limma_results %>%
merge(species, by.x='row.names', by.y='UNIPROTKB') %>%
ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = adj.P.Val < 0.01)) +
geom_point(size=0.5) +
theme_camprot(border=FALSE, base_size=15) +
scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = '2x vs 1x Sig.') +
labs(x = '2x vs 1x (Log2)', y = '-log10(p-value)') +
facet_wrap(~(gsub('\\(.*', '', Organism)))
Solution end
We can now compare the results from the t-test and the moderated t-test (limma). Below, we update the column names so it’s easier to see which column comes from which test and then merge the two test results.
tmt_compare_tests <- merge(
setNames(limma_results, paste0('limma.', colnames(limma_results))),
setNames(t.test.res.all, paste0('t.test.', colnames(t.test.res.all))),
by.x='row.names',
by.y='t.test.protein')
Below, we can compare the p-values from the two tests. Note that the p-value is almost always lower for the moderated t-test with limma
than the standard t-test.
p <- ggplot(tmt_compare_tests) +
aes(log10(t.test.p.value), log10(limma.P.Value)) +
geom_point(size=0.2, alpha=0.2) +
geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) +
theme_camprot(border=FALSE) +
labs(x='T-test log10(p-value)', y='limma log10(p-value)')
print(p)
Finally, we can compare the number of proteins with a significant difference (Using 1% FDR threshold) according to each test. Using the t-test, there are 3037 significant differences, but with limma 6321 proteins have a significant difference.
tmt_compare_tests %>%
group_by(t.test.padj<0.01,
limma.P.Value<0.01) %>%
tally()
#> # A tibble: 4 x 3
#> # Groups: t.test.padj < 0.01 [2]
#> `t.test.padj < 0.01` `limma.P.Value < 0.01` n
#> <lgl> <lgl> <int>
#> 1 FALSE FALSE 3616
#> 2 FALSE TRUE 3299
#> 3 TRUE FALSE 15
#> 4 TRUE TRUE 3022
limma assumes there is a relationship between protein abundance and variance. This is usually the case, although we have seen above that this isn’t so with our data. For LFQ, the relationship between variance and the number of peptides may be stronger.
DEqMS (Zhu et al. 2020), is an alternative to limma, which you can think of as an extension of limma (Ritchie et al. 2015) specifically for proteomics, which uses the number of peptides rather than mean abundance to share information between proteins.
The analysis steps are taken from the DEqMS vignette. We start from the MArrayLM
we created for limma
analysis and then simply add a $count
column to the MArrayLM
object and use the spectraCounteBayes
function to perform the Bayesian shrinkage using the count column, which describes the number of pepitdes per protein. This is contrast to limma
, which uses the $Amean
column, which describes the mean protein abundance.
To define the $count
column, we need to summarise the number of PSMs per protein. In the DEqMS paper, they suggest that the best summarisation metric to use is the minimum value across the samples, so our count
column is the minimum number of PSMs per protein.
tmt_psm_res <- readRDS('./results/psm_filt.rds')
# Obtain the min peptide count across the samples and determine the minimum value across
# samples
min_psm_count <- camprotR::count_features_per_protein(tmt_psm_res) %>%
merge(tmt_protein_tidy,
by.x=c('Master.Protein.Accessions', 'sample'),
by.y=c('protein', 'sample')) %>%
group_by(Master.Protein.Accessions) %>%
summarise(min_psm_count = min(n))
# add the min peptide count
limma_fit$count <- min_psm_count$min_psm_count
And now we run spectraCounteBayes
from DEqMS
to perform the statistical test.
# run DEqMS
efit_deqms <- suppressWarnings(spectraCounteBayes(limma_fit))
Below, we inspect the peptide count vs variance relationship which DEqMS
is using in the statistical test.
# Diagnostic plots
VarianceBoxplot(efit_deqms, n = 30, xlab = "PSMs")
Below, we summarise the number of proteins with statistically different abundance in RNase +/- and plot a ‘volcano’ plot to visualise this.
deqms_results <- outputResult(efit_deqms, coef_col=2)
table(deqms_results$sca.adj.pval<0.01)
#>
#> FALSE TRUE
#> 4344 5608
deqms_results %>%
merge(species, by.x='row.names', by.y='UNIPROTKB') %>%
ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) +
geom_point(size=0.5) +
theme_camprot(border=FALSE, base_size=15) +
scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = '2x vs 1x Sig.') +
labs(x = '2x vs 1x (Log2)', y = '-log10(p-value)') +
facet_wrap(~(gsub('\\(.*', '', Organism)))
We can compare the results of limma and DEqMS by considering the number of significant differences. Note that the limma results are also contained with the results from DEqMS. The
$t
, $P.Value
and $adj.P.Val
columns are from limma
. The columns prefixed with sca
are the from DEqMS
.
deqms_results %>%
group_by(limma_sig=adj.P.Val<0.01,
DEqMS_sig=sca.adj.pval<0.01) %>%
tally()
#> # A tibble: 4 x 3
#> # Groups: limma_sig [2]
#> limma_sig DEqMS_sig n
#> <lgl> <lgl> <int>
#> 1 FALSE FALSE 3909
#> 2 FALSE TRUE 115
#> 3 TRUE FALSE 435
#> 4 TRUE TRUE 5493
We can compare the results of limma and DEqMS by considering the p-values.
deqms_results %>%
ggplot() +
aes(P.Value, sca.P.Value) +
geom_point(size=0.5, alpha=0.5) +
geom_abline(slope=1, colour=get_cat_palette(2)[2], size=1, linetype=2) +
theme_camprot(border=FALSE) +
labs(x='limma p-value', y='DEqMS p-value') +
scale_x_log10() +
scale_y_log10()
Discussion 4
The p-values from limma and DEqMS are very well correlated. This is despite the two methods using a different depedent variable to shrink the variance (limma = mean expression, DEqMS = the number of PSMs)
Solution
# This suggests that the dependent variables are likely to be highly correlated,
# which makes sense since we are using sum summarisation from PSM -> protein,
# so proteins with more PSMs will likely have a higher abundance.
# We can check this like so
deqms_results %>%
ggplot() +
aes(Hmisc::cut2(count, g=20), AveExpr) +
geom_boxplot() +
theme_camprot(border=FALSE) +
labs(x='PSMs', y='Average abundance') +
theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))
Solution end
Finally, at this point we can save any one of the data.frames
containing the statistical test results, either to a compressed format (rds
) to read back into a later R notebook, or a flatfile (.tsv
) to read with e.g excel.
# These lines are not run and are examples only
saveRDS(deqms_results, 'filename_to_save_to.rds')
write.csv(deqms_results, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 parallel stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] uniprotREST_0.0.0.9000 biobroom_1.20.0 broom_0.7.0
#> [4] tidyr_1.1.2 dplyr_1.0.4 DEqMS_1.6.0
#> [7] limma_3.44.3 MSnbase_2.14.2 ProtGenerics_1.20.0
#> [10] S4Vectors_0.26.1 mzR_2.22.0 Rcpp_1.0.5
#> [13] Biobase_2.48.0 BiocGenerics_0.34.0 ggplot2_3.3.2
#> [16] camprotR_0.0.0.9000
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.3.1 vsn_3.56.0 splines_4.0.3
#> [4] jsonlite_1.7.2 foreach_1.5.0 bslib_0.2.4
#> [7] assertthat_0.2.1 BiocManager_1.30.10 affy_1.66.0
#> [10] highr_0.8 blob_1.2.1 yaml_2.2.1
#> [13] robustbase_0.93-6 impute_1.62.0 pillar_1.4.6
#> [16] backports_1.1.10 lattice_0.20-41 glue_1.6.2
#> [19] digest_0.6.27 colorspace_1.4-1 htmltools_0.5.1.1
#> [22] preprocessCore_1.50.0 plyr_1.8.6 MALDIquant_1.19.3
#> [25] XML_3.99-0.5 pkgconfig_2.0.3 zlibbioc_1.34.0
#> [28] purrr_0.3.4 scales_1.1.1 affyio_1.58.0
#> [31] BiocParallel_1.22.0 tibble_3.0.3 farver_2.0.3
#> [34] generics_0.0.2 IRanges_2.22.2 ellipsis_0.3.1
#> [37] withr_2.3.0 cli_3.3.0 magrittr_1.5
#> [40] crayon_1.3.4 evaluate_0.15 ncdf4_1.17
#> [43] fansi_0.4.1 doParallel_1.0.15 MASS_7.3-53
#> [46] httr2_0.2.1 tools_4.0.3 lifecycle_0.2.0
#> [49] stringr_1.4.0 munsell_0.5.0 pcaMethods_1.80.0
#> [52] compiler_4.0.3 jquerylib_0.1.3 mzID_1.26.0
#> [55] rlang_1.0.5 grid_4.0.3 iterators_1.0.12
#> [58] rstudioapi_0.11 rappdirs_0.3.1 labeling_0.3
#> [61] rmarkdown_2.10 gtable_0.3.0 codetools_0.2-16
#> [64] curl_4.3 DBI_1.1.0 R6_2.4.1
#> [67] knitr_1.39 utf8_1.1.4 stringi_1.5.3
#> [70] vctrs_0.3.6 DEoptimR_1.0-8 tidyselect_1.1.0
#> [73] xfun_0.31
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Smith, Tom S., Anna Andrejeva, Josie Christopher, Oliver M. Crook, Mohamed Elzek, and Kathryn S. Lilley. 2022. “Prior Signal Acquisition Software Versions for Orbitrap Underestimate Low Isobaric Mass Tag Intensities, Without Detriment to Differential Abundance Experiments.” ACS Measurement Science Au, March. https://doi.org/10.1021/acsmeasuresciau.1c00053.
Zhu, Yafeng, Lukas M. Orre, Yan Zhou Tran, Georgios Mermelekas, Henrik J. Johansson, Alina Malyutina, Simon Anders, and Janne Lehtiö. 2020. “DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis.” Molecular & Cellular Proteomics : MCP 19 (6): 1047–57. https://doi.org/10.1074/mcp.TIR119.001646.