There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means complete.
t-test: If we assume that the quantification values are Gaussian distributed, a t-test may be appropriate. For LFQ, log-transformed abundances can be assumed to be Gaussian distributed. When we have one condition variable and we are comparing between two values variable in an LFQ experiment (e.g samples are treatment or control), a two-sample t-test is appropriate.
ANOVA/linear model: Where a more complex experimental design is involved, an ANOVA or linear model can be used, on the same assumptions at the t-test.
limma
(Ritchie et al. 2015): Proteomics experiments
are typically lowly replicated (e.g n << 10). Variance estimates
are therefore inaccurate. limma
is an R package that
extends the t-test/ANOVA/linear model testing framework to enable
sharing of information across features (here, proteins) to update the
variance estimates. This decreases false positives and increases
statistical power.
DEqMS
(Zhu
et al. 2020): limma assumes there is a relationship between
protein abundance and variance. This is usually the case, although with
LFQ the relationship between variance and the number of peptides may be
stronger.
Here, we will perform statistical analyses on LFQ data.
These are examples only and the code herein is unlikely to be directly applicable to your own dataset.
Load the required libraries.
library(camprotR)
library(ggplot2)
library(MSnbase)
library(DEqMS)
library(limma)
library(dplyr)
library(tidyr)
library(ggplot2)
library(broom)
library(biobroom)
Here, we will start with the LFQ data processed in Data processing and QC of LFQ data. Please see the previous notebook for details of the experimental design and aim and data processing.
First, we read in the protein-level ratios obtained in the above notebooks.
lfq_protein <- readRDS('./results/lfq_prot_robust.rds')
In brief, we wish to determine the proteins which are significantly depleted by RNase treatment.
We will use three approaches: - paired t-test - moderated paired t-test (limma) - moderated paired t-test (DEqMS)
To perform a t-test for each protein, we want to extract the quantification values in a long ‘tidy’ format and then re-structure so we have one column each for RNase +/-. We can do this using the biobroom package
We will also filter out proteins which are not present in both samples in at least 3/4 replicates.
lfq_protein_tidy <- lfq_protein %>%
biobroom::tidy.MSnSet() %>%
separate(sample.id, into=c(NA, 'RNase', 'replicate')) %>%
pivot_wider(names_from=RNase, values_from=value) %>%
filter(is.finite(neg), is.finite(pos)) %>%
group_by(protein) %>%
filter(length(protein)>=3)
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
As an example of how to run a single t-test, let’s subset to a single protein. First, we extract the quantification values for this single protein
example_protein <- 'A5YKK6'
lfq_protein_tidy_example <- lfq_protein_tidy %>% filter(protein==example_protein)
print(lfq_protein_tidy_example)
#> # A tibble: 3 × 4
#> # Groups: protein [1]
#> protein replicate neg pos
#> <chr> <chr> <dbl> <dbl>
#> 1 A5YKK6 1 20.3 19.9
#> 2 A5YKK6 2 20.4 19.6
#> 3 A5YKK6 3 20.1 19.9
Then we use t.test
to perform the t-test. We are giving
two vectors of values and the switch paired=TRUE
so that a
paired two sample t-test is performed.
t.test.example <- t.test(
lfq_protein_tidy_example$pos,
lfq_protein_tidy_example$neg,
alternative='two.sided',
var.equal=FALSE,
paired=TRUE)
print(t.test.example)
#>
#> Paired t-test
#>
#> data: lfq_protein_tidy_example$pos and lfq_protein_tidy_example$neg
#> t = -3.1641, df = 2, p-value = 0.08704
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -1.1133965 0.1697793
#> sample estimates:
#> mean of the differences
#> -0.4718086
We can use tidy
from the broom
package to
return the t-test results in a tidy tibble. The value of this will be
seen in the next code chunk.
head(broom::tidy(t.test.example))
#> # A tibble: 1 × 8
#> estimate statistic p.value parameter conf.low conf.high method alternative
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 -0.472 -3.16 0.0870 2 -1.11 0.170 Paired t-… two.sided
We can now apply a t-test to every protein using dplyr
group
and do
, making use of
tidy
.
t.test.all <- lfq_protein_tidy %>%
group_by(protein) %>%
do(tidy(t.test(.$pos, .$neg, paired=TRUE, alternative='two.sided')))
Here are the results for the t-test for the example protein. As we
can see, the ‘estimate’ column in t.text.res.all
is the
mean log2 ratio. The ‘statistic’ column is the t-statistic and the
‘parameter’ column is the degrees of freedom for the t-statistic. All
the values are identical since have performed the exact same test with
both approaches.
print(t.test.example)
#>
#> Paired t-test
#>
#> data: lfq_protein_tidy_example$pos and lfq_protein_tidy_example$neg
#> t = -3.1641, df = 2, p-value = 0.08704
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -1.1133965 0.1697793
#> sample estimates:
#> mean of the differences
#> -0.4718086
t.test.all %>% filter(protein==example_protein)
#> # A tibble: 1 × 9
#> # Groups: protein [1]
#> protein estimate statistic p.value parameter conf.low conf.high method
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 A5YKK6 -0.472 -3.16 0.0870 2 -1.11 0.170 Paired t-test
#> # … with 1 more variable: alternative <chr>
When you are performing a lot of statistical tests at the same time, it’s recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.
There is a clear peak for very low p-values (<0.05) and an approximately uniform distribution across the rest of the p-value range, which is what we want.
hist(t.test.all$p.value)
Since we have performed multiple tests, we want to calculate an adjusted
p-value to avoid type I errors (false positives).
Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.
t.test.all$padj <- p.adjust(t.test.all$p.value, method='BH')
At an FDR of 1%, we have 0 proteins with a significant difference.
sum(t.test.all$padj<0.01)
#> [1] 0
Proteomics experiments are typically lowly replicated (e.g n <<
10). Variance estimates are therefore inaccurate. limma
(Ritchie et al. 2015) is an R package that
extends the t-test/ANOVA/linear model testing framework to enable
sharing of information across features (here, proteins) to update the
variance estimates. This decreases false positives and increases
statistical power.
We will first reconstruct an MSnset from our filtered data since it’s easier to work with limma using this standard proteomics object.
filtered_exprs <- lfq_protein_tidy %>%
pivot_longer(cols=c(neg, pos), names_to='RNase') %>%
mutate(sample=paste0('RNase_', RNase, '.', replicate)) %>%
pivot_wider(names_from=sample, values_from=value, id_cols=protein) %>%
tibble::column_to_rownames('protein') %>%
as.matrix()
filtered_lfq_protein <- MSnSet(exprs=filtered_exprs,
fData=fData(lfq_protein)[rownames(filtered_exprs),],
pData=pData(lfq_protein)[colnames(filtered_exprs),])
Next, we create the MArrayLM
object and a design model.
We then supply these to limma::lmFit
to fit the linear
model according to the design and then use limma::eBayes
to
compute moderated test statistics.
exprs_for_limma <- exprs(filtered_lfq_protein)
# Performing the equivalent of a two-sample t-test
condition <- pData(filtered_lfq_protein)$Condition
replicate <- pData(filtered_lfq_protein)$Replicate
limma_design <- model.matrix(formula(~replicate+condition))
limma_fit <- lmFit(exprs_for_limma, limma_design)
limma_fit <- eBayes(limma_fit, trend=TRUE)
We can visualise the relationship between the average abundance and
the variance using the limma::plotSA
function.
limma::plotSA(limma_fit)
Discussion 1
How would you interpret the plot above?
Solution
# Surprisingly, there's no clear relationship between protein abundance and variance
Solution end
Despite the lack of a strong relationship between protein abundance and variance, we will continue with limma regardless, since it will still increase the effective degrees of freedom with which the gene-wise variances are estimated. In this case, the variances will be shrunk towards a similar value, regardless of the mean protein abundance.
We can extract a results table like so
# use colnames(limma_fit$coefficients) to identify the coefficient names
limma_results <- topTable(limma_fit, n=Inf, coef='conditionRNase_pos')
Below, we summarise the number of proteins with statistically different abundance in CL vs NC and plot a ‘volcano’ plot to visualise this.
table(limma_results$adj.P.Val<0.01)
#>
#> FALSE TRUE
#> 261 136
limma_results %>%
ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = adj.P.Val < 0.01)) +
geom_point() +
theme_camprot(border=FALSE, base_size=15) +
scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'RNase +/- Sig.') +
labs(x = 'RNase +/- (Log2)', y = '-log10(p-value)')
Discussion 2
- Given the experimental design, would you expect proteins to have signficant changes in both directions?
- How would you interpret the fold changes identified with respect to the absolute protein abundances?
- Does your answer to question 2 affect your expectation in question 1?
Solution
# 1. We are starting from a single OOPS interface and adding +/- RNase, then recollecting the interface. RBPs should be depleted, but there is no way for any protein to be enriched by this process. Thus, we wouldn't expect proteins with a positive RNase +/- ratio!
# 2. The protein abundances are relative to the total amount of protein in each sample. Thus, the fold changes are relative to the overall fold-change in the amount of protein in each sample. For example, if there is a global loss of protein in RNase + samples, a positive RNase +/- ratio only represents a increase relative to the global loss, and could be a negative RNase +/- ratio in absolute terms!
# 3. We are seeing positive RNase +/- ratios because there is a global difference in the amount of protein and we are using a relative protein abundance quantification approach.
Solution end
We can now compare the results from the t-test and the moderated t-test (limma). Below, we update the column names so it’s easier to see which column comes from which test and then merge the two test results.
lfq_compare_tests <- merge(
setNames(limma_results, paste0('limma.', colnames(limma_results))),
setNames(t.test.all, paste0('t.test.', colnames(t.test.all))),
by.x='row.names',
by.y='t.test.protein')
Exercise
Compare the effect size estimates from t-test vs the moderated t-tests. Can you explain what you observe?
Hints:
- For the t-test, you want the ‘t.test.estimate’ column
- For the moderated t-test, you can use the ‘limma.logFC’ column
Solution
ggplot(lfq_compare_tests) +
aes(t.test.estimate, limma.logFC) +
geom_point() +
geom_abline(slope=1) +
theme_camprot(border=FALSE) +
labs(x='t-test logFC', y='limma logFC')
# The logFC are the same! Remember that limma is not changing the underlying data,
# just moderating the test statistics.
Solution end
We can also compare the p-values from the two tests. Note that the
p-value is almost always lower for the moderated t-test with
limma
than the standard t-test.
p <- ggplot(lfq_compare_tests) +
aes(log10(t.test.p.value), log10(limma.P.Value)) +
geom_point() +
geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) +
theme_camprot(border=FALSE) +
labs(x='T-test log10(p-value)', y='limma log10(p-value)')
print(p)
Finally, we can compare the number of proteins with a significant difference (Using 1% FDR threshold) according to each test. Using the t-test, there are 0 significant differences, but with limma 182 proteins have a significant difference.
lfq_compare_tests %>%
group_by(t.test.padj<0.01,
limma.P.Value<0.01) %>%
tally()
#> # A tibble: 2 × 3
#> # Groups: t.test.padj < 0.01 [1]
#> `t.test.padj < 0.01` `limma.P.Value < 0.01` n
#> <lgl> <lgl> <int>
#> 1 FALSE FALSE 215
#> 2 FALSE TRUE 182
limma assumes there is a relationship between protein abundance and variance. This is usually the case, although we have seen above that this isn’t so with our data. For LFQ, the relationship between variance and the number of peptides may be stronger.
DEqMS (Zhu et al. 2020), is an alternative to limma, which you can think of as an extension of limma (Ritchie et al. 2015) specifically for proteomics, which uses the number of peptides rather than mean abundance to share information between proteins.
The analysis steps are taken from the DEqMS
vignette. We start from the MArrayLM
we created for
limma
analysis and then simply add a $count
column to the MArrayLM
object and use the
spectraCounteBayes
function to perform the Bayesian
shrinkage using the count column, which describes the number of pepitdes
per protein. This is contrast to limma
, which uses the
$Amean
column, which describes the mean protein
abundance.
To define the $count
column, we need to summarise the
number of peptides per protein. In the DEqMS paper, they suggest that
the best summarisation metric to use is the minimum value across the
samples, so our count
column is the minimum number of
peptides per protein.
filtered_lfq_protein_long <- filtered_lfq_protein %>%
exprs() %>%
data.frame() %>%
tibble::rownames_to_column('Master.Protein.Accessions') %>%
pivot_longer(cols=-Master.Protein.Accessions, values_to='abundance', names_to='sample')
lfq_pep_res <- readRDS('results/lfq_pep_restricted.rds')
# Obtain the min peptide count across the samples and determine the minimum value across
# samples
min_pep_count <- camprotR::count_features_per_protein(lfq_pep_res) %>%
merge(filtered_lfq_protein_long, by=c('Master.Protein.Accessions', 'sample')) %>%
filter(is.finite(abundance)) %>% # We only want to consider samples with a ratio quantified
group_by(Master.Protein.Accessions) %>%
summarise(min_pep_count = min(n))
# add the min peptide count
limma_fit$count <- min_pep_count$min_pep_count
And now we run spectraCounteBayes
from
DEqMS
to perform the statistical test.
# run DEqMS
efit_deqms <- suppressWarnings(spectraCounteBayes(limma_fit))
Below, we inspect the peptide count vs variance relationship which
DEqMS
is using in the statistical test.
In this case the relationship between peptide count and variance is not clear at all. We press on regardless.As with the limma analysis, the variance will be shrunk towards a global mean rather than one informed by the number of peptides
# Diagnostic plots
VarianceBoxplot(efit_deqms, n = 30, xlab = "Peptides")
Below, we summarise the number of proteins with statistically different abundance in RNase +/- and plot a ‘volcano’ plot to visualise this.
deqms_results <- outputResult(efit_deqms, coef_col=3)
table(deqms_results$sca.adj.pva<0.01)
#>
#> FALSE TRUE
#> 267 130
deqms_results %>%
ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) +
geom_point() +
theme_camprot(border=FALSE, base_size=15) +
scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'RNase +/- Sig.') +
labs(x = 'RNase +/- (Log2)', y = '-log10(p-value)')
We can compare the results of limma and DEqMS by considering the number
of significant differences. Note that the limma results are also
contained with the results from DEqMS. The
$t
,
$P.Value
and $adj.P.Val
columns are from
limma
. The columns prefixed with sca
are the
from DEqMS
.
deqms_results %>%
group_by(limma_sig=adj.P.Val<0.01,
DEqMS_sig=sca.adj.pval<0.01) %>%
tally()
#> # A tibble: 4 × 3
#> # Groups: limma_sig [2]
#> limma_sig DEqMS_sig n
#> <lgl> <lgl> <int>
#> 1 FALSE FALSE 257
#> 2 FALSE TRUE 4
#> 3 TRUE FALSE 10
#> 4 TRUE TRUE 126
We can compare the results of limma and DEqMS by considering the p-values. Note that here, they are very well correlated. This is because the methods failed to identify a strong trend between the mean abundance (limma) or number of peptides (DEqMS) and the variance. Thus, both shrunk the variance towards a global mean and similarly increased the effective degrees of freedom.
deqms_results %>%
ggplot() +
aes(P.Value, sca.P.Value) +
geom_point() +
geom_abline(slope=1) +
theme_camprot(border=FALSE) +
labs(x='limma p-value', y='DEqMS p-value') +
scale_x_log10() +
scale_y_log10()
Finally, at this point we can save any one of the
data.frames
containing the statistical test results, either
to a compressed format (rds
) to read back into a later R
notebook, or a flatfile (.tsv
) to read with e.g excel.
# These lines are not run and are examples only
saveRDS(deqms_results, 'filename_to_save_to.rds')
write.csv(deqms_results, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] biobroom_1.26.0 broom_0.7.12 tidyr_1.2.0
#> [4] dplyr_1.0.8 DEqMS_1.12.1 limma_3.50.1
#> [7] matrixStats_0.61.0 MSnbase_2.20.4 ProtGenerics_1.26.0
#> [10] S4Vectors_0.32.3 mzR_2.28.0 Rcpp_1.0.8.3
#> [13] Biobase_2.54.0 BiocGenerics_0.40.0 ggplot2_3.3.5
#> [16] camprotR_0.0.0.9000
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.0 vsn_3.62.0 splines_4.1.2
#> [4] jsonlite_1.8.0 foreach_1.5.2 bslib_0.3.1
#> [7] assertthat_0.2.1 highr_0.9 BiocManager_1.30.16
#> [10] affy_1.72.0 yaml_2.3.5 robustbase_0.93-9
#> [13] impute_1.68.0 backports_1.4.1 pillar_1.7.0
#> [16] lattice_0.20-45 glue_1.6.2 digest_0.6.29
#> [19] colorspace_2.0-3 htmltools_0.5.2 preprocessCore_1.56.0
#> [22] plyr_1.8.6 MALDIquant_1.21 XML_3.99-0.9
#> [25] pkgconfig_2.0.3 zlibbioc_1.40.0 purrr_0.3.4
#> [28] scales_1.1.1 affyio_1.64.0 BiocParallel_1.28.3
#> [31] tibble_3.1.6 farver_2.1.0 generics_0.1.2
#> [34] IRanges_2.28.0 ellipsis_0.3.2 withr_2.5.0
#> [37] cli_3.2.0 magrittr_2.0.2 crayon_1.5.0
#> [40] evaluate_0.15 ncdf4_1.19 fansi_1.0.2
#> [43] doParallel_1.0.17 MASS_7.3-55 tools_4.1.2
#> [46] lifecycle_1.0.1 stringr_1.4.0 munsell_0.5.0
#> [49] cluster_2.1.2 pcaMethods_1.86.0 compiler_4.1.2
#> [52] jquerylib_0.1.4 mzID_1.32.0 rlang_1.0.2
#> [55] grid_4.1.2 iterators_1.0.14 MsCoreUtils_1.6.2
#> [58] labeling_0.4.2 rmarkdown_2.12 gtable_0.3.0
#> [61] codetools_0.2-18 DBI_1.1.2 R6_2.5.1
#> [64] knitr_1.37 fastmap_1.1.0 utf8_1.2.2
#> [67] clue_0.3-60 stringi_1.7.6 parallel_4.1.2
#> [70] vctrs_0.3.8 DEoptimR_1.0-10 tidyselect_1.1.2
#> [73] xfun_0.30