There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means exhaustive.
t-test: If we assume that the quantification values are Gaussian distributed, a t-test may be appropriate. For SILAC, log-transformed ratios can be assumed to be Gaussian distributed. When we have one condition variable in a SILAC experiment, and we are comparing between two values (e.g SILAC ratio is a comparison between treatment and control), a one-sample t-test is appropriate. If the SILAC ratio captures one condition variable, with another condition variable across samples (e.g ratio is treatment vs control and samples are from two different cell lines), a two-sample t-test is appropriate.
ANOVA/linear model: Where a more complex experimental design is involved, an ANOVA or linear model can be used, on the same assumptions at the t-test.
limma
(Ritchie et al. 2015): Proteomics experiments
are typically lowly replicated (e.g n << 10). Variance estimates
are therefore inaccurate. limma
is an R package that
extends the t-test/ANOVA/linear model testing framework to enable
sharing of information across features (here, proteins) to update the
variance estimates. This decreases false positives and increases
statistical power.
DEqMS
(Zhu
et al. 2020): limma assumes there is a relationship between
protein abundance and variance. This is usually the case, although for
SILAC
As such, DEqMS
should be preferred over limma.
Here, we will perform statistical analyses on SILAC data.
These are examples only and the code herein is unlikely to be directly applicable to your own dataset.
Load the required libraries.
library(camprotR)
library(ggplot2)
library(MSnbase)
library(DEqMS)
library(limma)
library(dplyr)
library(tidyr)
library(ggplot2)
library(broom)
library(biobroom)
Here, we will start with the SILAC data processed in Processing and QC of SILAC data. Please see the previous notebook for details of the experimental design and aim and data processing.
In brief, we wish to determine the proteins which are significantly enriched by UV crosslinking (CL) vs non-CL control (NC).
First, we read in the protein-level ratios obtained in the above notebooks. We will filter out proteins which are not present in at least 3/4 replicates.
silac_protein <- readRDS('results/prot_res.rds') %>%
filterNA(pNA = 1/4) # max 1/4 missing.
We will use two approaches:
To perform a t-test for each protein, we want to extract the
quantification values in a long ‘tidy’ format. We can do this using the
biobroom
package
silac_protein_tidy <- silac_protein %>%
biobroom::tidy.MSnSet() %>%
filter(is.finite(value))
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
As an example of how to run a single t-test, let’s subset to a single protein. First, we extract the quantification values for this single protein
example_protein <- 'O00159'
silac_protein_tidy_example <- silac_protein_tidy %>%
filter(protein==example_protein)
print(silac_protein_tidy_example)
#> # A tibble: 4 × 3
#> protein sample.id value
#> <chr> <chr> <dbl>
#> 1 O00159 ratio_1 1.40
#> 2 O00159 ratio_2 1.62
#> 3 O00159 ratio_4 1.68
#> 4 O00159 ratio_3 1.80
Then we use t.test
to perform the t-test. Since we are
giving only a single vector of values, a one sample t-test is
performed.
t.test.res <- t.test(silac_protein_tidy_example$value,
alternative='two.sided')
print(t.test.res)
#>
#> One Sample t-test
#>
#> data: silac_protein_tidy_example$value
#> t = 19.143, df = 3, p-value = 0.0003113
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#> 1.353325 1.893016
#> sample estimates:
#> mean of x
#> 1.62317
We can use tidy
from the broom
package to
return the t-test results in a tidy tibble. The value of this will be
seen in the next code chunk.
tidy(t.test.res)
#> # A tibble: 1 × 8
#> estimate statistic p.value parameter conf.low conf.high method alternative
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 1.62 19.1 0.000311 3 1.35 1.89 One Samp… two.sided
We can now apply a t-test to every protein using dplyr
group
and do
, making use of
tidy
.
t.test.res.all <- silac_protein_tidy %>%
group_by(protein) %>%
do(tidy(t.test(.$value, alternative='two.sided')))
Here are the results for the t-test for the example protein. As we
can see, the ‘estimate’ column in t.text.res.all
is the
mean log2 ratio. The ‘statistic’ column is the t-statistic and the
‘parameter’ column is the degrees of freedom for the t-statistic. All
the values are identical since have performed the exact same test with
both approaches.
print(t.test.res)
#>
#> One Sample t-test
#>
#> data: silac_protein_tidy_example$value
#> t = 19.143, df = 3, p-value = 0.0003113
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#> 1.353325 1.893016
#> sample estimates:
#> mean of x
#> 1.62317
t.test.res.all %>% filter(protein==example_protein)
#> # A tibble: 1 × 9
#> # Groups: protein [1]
#> protein estimate statistic p.value parameter conf.low conf.high method
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 O00159 1.62 19.1 0.000311 3 1.35 1.89 One Sample t…
#> # … with 1 more variable: alternative <chr>
When you are performing a lot of statistical tests at the same time, it’s recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.
hist(t.test.res.all$p.value, 20)
Discussion 1
What would you conclude from the p-value distribution above?
Solution
# Here, we have so many significant tests that the uniform distribution is hard to assess!
# Note that, beyond the clear peak for very low p-values (<0.05), we also have a
# slight skew towards low p-values in the range 0.05-0.15.
# This may indicate insufficient statistical power to detect some proteins that are truly enriched upon CL.
Solution end
Since we have performed multiple tests, we want to calculate an adjusted p-value to avoid type I errors (false positives).
Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.
t.test.res.all$padj <- p.adjust(t.test.res.all$p.value, method='BH')
table(t.test.res.all$padj<0.01)
#>
#> FALSE TRUE
#> 424 113
At an FDR of 1%, we have 113 proteins with a significant difference.
Finally, a differential abundance test wouldn’t be complete without a volcano plot!
Note that all proteins with a statistically significant change are increased in CL, so it doesn’t resemble the typical volcano plot.
t.test.res.all %>%
ggplot(aes(x = estimate, y = -log10(p.value), colour = padj < 0.01)) +
geom_point() +
theme_camprot(border=FALSE, base_size=15) +
scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') +
labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')
Discussion 2
SILAC ratios are equivalent to the absolute abundance ratios between two or more samples. If this experiment had been performed using LFQ or TMT, which quantify the relative abundances in each sample, how would you expect the above volcano plot to look?
Solution
# It would like be much more symetrical, since the trend towards increased abundance
# in CL would be masked by the comparison of the relative abundances in CL and NC.
# Protein which are highly enriched with CL would have reduced ratios and those with more slight
# enrichments may even appear to be relatively depleted!
Solution end
To identify proteins with significantly increased abundance in CL vs NC, we will use DEqMS (Zhu et al. 2020), which you can think of as an extension of limma (Ritchie et al. 2015) specifically for proteomics.
The analysis steps are taken from the DEqMS
vignette. We first want to create an MArrayLM
object as
per normal limma
analysis and then add a
$count
column to the MArrayLM
object and use
the spectraCounteBayes
function to perform the Bayesian
shrinkage using the count column, which describes the number of pepitdes
per protein. This is contrast to limma
, which uses the
$Amean
column, which describes the mean CL:NC ratio.
First, we create the MArrayLM
object.
dat <- silac_protein %>%
exprs()
# Performing the equivalent of a one-sample t-test, so the design matrix is just an intercept
design <- cbind(Intercept = rep(1, ncol(dat)))
fit <- lmFit(dat, design)
efit <- eBayes(fit)
Next, we define the $count
column, which
DEqMS
will use. In the DEqMS paper, they suggest that the
best summarisation metric to use is the minimum value across the
samples, so our count
column is the minimum number of
peptides per protein.
protein_ratio_long <- silac_protein %>%
exprs() %>%
data.frame() %>%
tibble::rownames_to_column('Master.Protein.Accessions') %>%
pivot_longer(cols=-Master.Protein.Accessions, values_to='protein_ratio', names_to='sample')
pep_res <- readRDS('results/pep_res.rds')
# Obtain the min peptide count across the samples and determine the minimum value across
# samples
min_pep_count <- camprotR::count_features_per_protein(pep_res) %>%
merge(protein_ratio_long, by=c('Master.Protein.Accessions', 'sample')) %>%
filter(is.finite(protein_ratio)) %>% # We only want to consider samples with a ratio quantified
group_by(Master.Protein.Accessions) %>%
summarise(min_pep_count = min(n))
# add the min peptide count
efit$count <- min_pep_count$min_pep_count
And now we run spectraCounteBayes
from
DEqMS
to perform the statistical test.
# run DEqMS
efit_deqms <- suppressWarnings(spectraCounteBayes(efit))
Below, we inspect the peptide count vs variance relationship which
DEqMS
is using in the statistical test.
In this case the relationship between peptide count and variance is
only really apparent when the minimum number of peptides from which a
protein-level ratio is obtained is very low. Typically, one might remove
quantification values where there is just a single peptide for a
protein. Here, we have kept them, and we can refer to the
count
column in the final results object if we want to
check the minimum number of peptides observed per sample.
# Diagnostic plots
VarianceBoxplot(efit_deqms, n = 30, xlab = "Peptides")
Below, we summarise the number of proteins with statistically different abundance in CL vs NC and plot a ‘volcano’ plot to visualise this.
Here, all proteins with a statistically significant change are increased in CL, so the plot looks more like a fire-hose than a volcano!
deqms_results <- outputResult(efit_deqms, coef_col = 1)
deqms_results %>%
ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) +
geom_point() +
theme_camprot(border=FALSE, base_size=15) +
scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') +
labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')
Finally, we can explore and, if we wish, export our results.
Importantly, the $t
, $P.Value
and
$adj.P.Val
columns are from limma
. The columns
prefixed with sca
are the from DEqMS
.
head(deqms_results)
#> logFC AveExpr t P.Value adj.P.Val B gene
#> P15880 5.843093 5.843093 28.13619 1.386402e-07 6.903875e-05 8.065292 P15880
#> P22626 6.561859 6.561859 25.35616 2.571276e-07 6.903875e-05 7.615847 P22626
#> P23588 5.250793 5.250793 20.29291 9.613338e-07 7.413109e-05 6.546409 P23588
#> P43243 4.424089 4.424089 18.79343 1.512070e-06 7.413109e-05 6.148035 P43243
#> P07910 5.003172 5.003172 18.75219 1.531772e-06 7.413109e-05 6.136436 P07910
#> P49207 4.811506 4.811506 19.47370 1.226059e-06 7.413109e-05 6.334277 P49207
#> count sca.t sca.P.Value sca.adj.pval
#> P15880 4 32.84685 1.803524e-08 9.684921e-06
#> P22626 7 28.92009 4.105560e-08 1.102343e-05
#> P23588 3 22.32533 2.175142e-07 2.195040e-05
#> P43243 7 21.88917 2.469094e-07 2.195040e-05
#> P07910 6 21.18579 3.045174e-07 2.195040e-05
#> P49207 2 20.97258 3.249394e-07 2.195040e-05
We can now compare the results from the t-test and the moderate t-test (DEqMS). Below, we update the column names so it’s easier to see which column comes from which test.
colnames(t.test.res.all) <- paste0('t.test.', colnames(t.test.res.all))
colnames(deqms_results) <- paste0('deqms.', colnames(deqms_results))
And then merge the two test results.
silac_compare_tests <- merge(deqms_results,
t.test.res.all,
by.x='row.names',
by.y='t.test.protein')
Exercise
Compare the effect size estimates from the two methods with a scatter plot. Can you explain what you observe?
Hints:
- For the t-test, you want the ‘t.test.estimate’ column
- For the moderated t-test, you want the ‘deqms.logFC’ column
Solution
ggplot(silac_compare_tests) +
aes(t.test.estimate, deqms.logFC) +
geom_point() +
geom_abline(slope=1) +
theme_camprot(border=FALSE) +
labs(x='t.test mean ratio', y='DEqMS mean ratio')
# The ratios are the same. DEqMS (and limma) are moderating the test statistics, but not the estimated fold-change.
Solution end
We can also compare the p-values from the two tests. Note that the p-value is almost always lower for the moderated t-test with DEqMS than the standard t-test.
p <- ggplot(silac_compare_tests) +
aes(log10(t.test.p.value), log10(deqms.P.Value)) +
geom_point() +
geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) +
theme_camprot(border=FALSE) +
labs(x='T-test log10(p-value)', y='DEqMS log10(p-value)')
print(p)
Finally, we can compare the number of proteins with a significant difference (Using 1% FDR threshold) according to each test. Using DEqMS more than doubles the number of null hypothesis rejections. Only 3 proteins are significant with a t-test but not the moderated t-test.
silac_compare_tests %>%
group_by(t.test.padj<0.01,
deqms.sca.adj.pval<0.01) %>%
tally()
#> # A tibble: 4 × 3
#> # Groups: t.test.padj < 0.01 [2]
#> `t.test.padj < 0.01` `deqms.sca.adj.pval < 0.01` n
#> <lgl> <lgl> <int>
#> 1 FALSE FALSE 266
#> 2 FALSE TRUE 158
#> 3 TRUE FALSE 3
#> 4 TRUE TRUE 110
Finally, at this point we can save any one of the
data.frames
containing the statistical test results, either
to a compressed format (rds
) to read back into a later R
notebook, or a flatfile (.tsv
) to read with e.g excel.
# These lines are not run and are examples only
# Saving to R binary format. Can read back into memory with readRDS().
saveRDS(silac_compare_tests, 'filename_to_save_to.rds')
# Saving to tab-separated (human-readable) flatfile
write.csv(silac_compare_tests, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)
Extended Exercise (Optional)
Compare the number of significant differences (at 1% FDR) for each method, stratifying by the number of peptides per protein. An example of how to visualise this is below, though you do not need to replicate this visualisation precisely!
How do you interpet this?
Hints:
- For the number of peptides, you want the deqms.count column. You need to bin this into reasonable bins. See e.g ?Hmisc::cut2 for this.
Solution
# Start by tallying as in code chunk above
silac_compare_tests %>%
group_by('min_peptides'=Hmisc::cut2(deqms.count, cuts=c(1:4,10)),
't-test'=t.test.padj<0.01,
'deqms'=deqms.sca.adj.pval<0.01) %>%
tally() %>%
# make a single column describing yes/no significant with each method and
# rename to make it more intuitive
mutate(compare_tests=recode(interaction(`t-test`, deqms),
'FALSE.FALSE'='Neither',
'TRUE.TRUE'='Both',
'TRUE.FALSE'='t-test',
'FALSE.TRUE'='DEqMS')) %>%
mutate(compare_tests=factor(compare_tests, levels=c('t-test', 'DEqMS', 'Both', 'Neither'))) %>%
ggplot(aes(min_peptides, n, fill=compare_tests)) +
geom_bar(stat='identity',position='fill') +
geom_text(position=position_fill(vjust=0.5), aes(label=n)) +
theme_camprot(border=FALSE, base_size=15) +
scale_fill_manual(values=c(get_cat_palette(3), 'grey'), name='') +
labs(x='Minimum peptides per sample', y='Fraction of proteins')
# Note that the 3 proteins which are only significant with the t-test have a
# minimum of 1/2 peptides per sample. These are likely to be poorly quantified
# and falsely identified as CL-enriched.
Solution end
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] biobroom_1.26.0 broom_0.7.12 tidyr_1.2.0
#> [4] dplyr_1.0.8 DEqMS_1.12.1 limma_3.50.1
#> [7] matrixStats_0.61.0 MSnbase_2.20.4 ProtGenerics_1.26.0
#> [10] S4Vectors_0.32.3 mzR_2.28.0 Rcpp_1.0.8.3
#> [13] Biobase_2.54.0 BiocGenerics_0.40.0 ggplot2_3.3.5
#> [16] camprotR_0.0.0.9000
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-2 doParallel_1.0.17 tools_4.1.2
#> [4] backports_1.4.1 bslib_0.3.1 utf8_1.2.2
#> [7] R6_2.5.1 affyio_1.64.0 rpart_4.1.16
#> [10] Hmisc_4.6-0 DBI_1.1.2 colorspace_2.0-3
#> [13] nnet_7.3-17 withr_2.5.0 gridExtra_2.3
#> [16] tidyselect_1.1.2 compiler_4.1.2 preprocessCore_1.56.0
#> [19] cli_3.2.0 htmlTable_2.4.0 labeling_0.4.2
#> [22] sass_0.4.0 checkmate_2.0.0 scales_1.1.1
#> [25] DEoptimR_1.0-10 robustbase_0.93-9 affy_1.72.0
#> [28] stringr_1.4.0 digest_0.6.29 foreign_0.8-82
#> [31] rmarkdown_2.12 jpeg_0.1-9 base64enc_0.1-3
#> [34] pkgconfig_2.0.3 htmltools_0.5.2 fastmap_1.1.0
#> [37] highr_0.9 htmlwidgets_1.5.4 rlang_1.0.2
#> [40] rstudioapi_0.13 impute_1.68.0 jquerylib_0.1.4
#> [43] generics_0.1.2 farver_2.1.0 jsonlite_1.8.0
#> [46] mzID_1.32.0 BiocParallel_1.28.3 magrittr_2.0.2
#> [49] Formula_1.2-4 MALDIquant_1.21 Matrix_1.4-0
#> [52] munsell_0.5.0 fansi_1.0.2 MsCoreUtils_1.6.2
#> [55] lifecycle_1.0.1 vsn_3.62.0 stringi_1.7.6
#> [58] yaml_2.3.5 MASS_7.3-55 zlibbioc_1.40.0
#> [61] plyr_1.8.6 grid_4.1.2 parallel_4.1.2
#> [64] crayon_1.5.0 lattice_0.20-45 splines_4.1.2
#> [67] knitr_1.37 pillar_1.7.0 codetools_0.2-18
#> [70] XML_3.99-0.9 glue_1.6.2 evaluate_0.15
#> [73] latticeExtra_0.6-29 data.table_1.14.2 pcaMethods_1.86.0
#> [76] BiocManager_1.30.16 png_0.1-7 vctrs_0.3.8
#> [79] foreach_1.5.2 gtable_0.3.0 purrr_0.3.4
#> [82] clue_0.3-60 assertthat_0.2.1 xfun_0.30
#> [85] ncdf4_1.19 survival_3.3-1 tibble_3.1.6
#> [88] iterators_1.0.14 IRanges_2.28.0 cluster_2.1.2
#> [91] ellipsis_0.3.2