Load dependencies

Load the required libraries.


library(ggplot2)
library(MSnbase)
library(biobroom)
library(camprotR)
library(Proteomics.analysis.data)
library(dplyr)
library(tidyr)

Although we recommend using the ‘robust’ summarisation, as demonstrated in the LFQ Data processing and QC notebook, there are other algorithms that are commonly applied. For example MaxLFQ (Cox et al. 2014), which is implemented within MaxQuant, but also available via the maxLFQ() function in the iq package.

We start by reading in the peptide-level quantification we created in the LFQ Data processing and QC notebook. We’ll use the same filtered peptides we used for the robust summarisation, so we can directly compare between the summarisation methods.

pep_restricted <- readRDS('results/lfq_pep_restricted.rds')
# You may wish to retain more feature columns that this in your experimental data!
feature_coloumns_to_retain <- c("Master.Protein.Accessions")

pep_data_for_summarisation <- pep_restricted %>%
  exprs() %>%
  data.frame() %>%
  merge(fData(pep_restricted)[, feature_coloumns_to_retain, drop = FALSE], 
        by = 'row.names') %>%
  select(-Row.names)

Below, we use iq::maxLFQ() to perform the summarisation manually. It should be possible to use this function within MSnbase::combineFeatures() too in theory since it accommodates user-defined functions.

# Define a function to perform MaxLFQ on a single protein and return a data.frame
# as required. 
get_maxlfq_estimate <- function(obj) {
  prot <- iq::maxLFQ(as.matrix(obj))$estimate
  
  data.frame(t(prot)) %>% 
    setNames(colnames(obj))
}

# Group by the features we want to retain and use MaxLFQ on each protein
maxlfq_estimates <- pep_data_for_summarisation %>%
  
  # group by the columns we want to retain. These define a unique protein ID
  group_by(across(all_of(feature_coloumns_to_retain))) %>% 
  
  # Perform maxLFQ summarisation for each group
  dplyr::group_modify(~ get_maxlfq_estimate(.)) %>%
  ungroup()

# Create the protein-level MSnSet
maxlfq.e <- as.matrix(select(maxlfq_estimates, -all_of(feature_coloumns_to_retain)))
maxlfq.f <- data.frame(select(maxlfq_estimates, all_of(feature_coloumns_to_retain)))
maxlfq.p <- pData(pep_restricted)

prot_maxlfq <- MSnSet(exprs = maxlfq.e,
                      fData = maxlfq.f,
                      pData = maxlfq.p)

# Update the rownames to be the protein IDs
rownames(prot_maxlfq) <- maxlfq_estimates$Master.Protein.Accessions

Comparing robust and maxLFQ

We can now compare the protein-level abundance estimates. First we read in the protein-level abundance estimates we obtained with robust summarisation.

prot_robust <- readRDS('results/lfq_prot_robust.rds')

And then we combine the summarisations from the two approaches.

# Define a function to extract the protein abundances in long form and
# add a column annotating the method
get_long_form_prot_exp <- function(obj, method_name) {
  tidy(obj) %>%
    rename(abundance=value) %>%
    mutate(method = method_name)
}

# Single object with protein inference from both methods 
compare_protein_abundances <- rbind(
  get_long_form_prot_exp(prot_maxlfq, 'MaxLFQ'),
  get_long_form_prot_exp(prot_robust, 'Robust')
)

Exercise

Compare the protein-abundance estimates using ‘robust’ and ‘maxLFQ’ using a scatter plot (see example below but don’t worry if it doesn’t exactly match). What does you conclude.

Hint: You will need to use pivot_wider to pivot the data so that you have one column for each method

Solution

#>          Robust   MaxLFQ
#> Robust 1.000000 0.990267
#> MaxLFQ 0.990267 1.000000

Solution end

There is a very good overall correlation. Let’s inspect a few proteins with the largest differences between the two approaches to see what’s going on for the edge cases.

# Identify proteins with largest difference between the protein summarisation methods
proteins_of_interest <- compare_protein_abundances %>%
  pivot_wider(names_from = method, values_from = abundance) %>%
  mutate(diff = MaxLFQ-Robust) %>%
  arrange(desc(abs(diff))) %>%
  pull(protein) %>%
  unique() %>%
  head(5)

Below we define a function to plot the peptide and protein abundances for the two methods for a single protein. We can ignore the details since it’s the plots themselves we are interested in.

plot_pep_and_protein <- function(protein_of_interest) {
  
  to_plot_compare <- compare_protein_abundances %>% 
    filter(protein == protein_of_interest)
  
  pep_restricted[fData(pep_restricted)$Master.Protein.Accession == protein_of_interest] %>%
    exprs() %>%
    data.frame() %>%
    tibble::rownames_to_column('id') %>%
    pivot_longer(cols = -id) %>%
    ggplot(aes(x = name, y = value)) +
    geom_line(aes(group = id), colour = 'grey', alpha = 0.5) +
    geom_point(colour = 'grey', alpha = 0.5) +
    geom_line(data = to_plot_compare,
              aes(x = sample.id, y = abundance, colour = method, group = method)) +
    geom_point(data = to_plot_compare,
               aes(x = sample.id, y = abundance, colour = method)) +
    scale_colour_manual(values = get_cat_palette(2), name = 'LFQ summarisation method') +
    theme_camprot(base_size = 15, border = FALSE, base_family = 'sans') +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(
      title = protein_of_interest,
      x = '',
      y = 'Protein abundance (log2)'
    )
}

Below, we apply the above function to each of our proteins of interest.

proteins_of_interest %>% lapply(plot_pep_and_protein)
#> [[1]]

#> 
#> [[2]]

#> 
#> [[3]]

#> 
#> [[4]]

#> 
#> [[5]]

Looking at these examples, we can see that MaxLFQ is often estimating slightly higher abundances but with a very similar profile across the samples is very similar, so the summarisation approach is unlikely to affect the downstream analysis. It’s not clear which of the two approaches is more correct in the examples above, but the publication proposing the robust protein inference (see here) does indicate it gives more accurate fold-change estimates overall.

Session info

#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] tidyr_1.2.0                    dplyr_1.0.8                   
#>  [3] Proteomics.analysis.data_0.1.0 camprotR_0.0.0.9000           
#>  [5] biobroom_1.26.0                broom_0.7.12                  
#>  [7] MSnbase_2.20.4                 ProtGenerics_1.26.0           
#>  [9] S4Vectors_0.32.3               mzR_2.28.0                    
#> [11] Rcpp_1.0.8.3                   Biobase_2.54.0                
#> [13] BiocGenerics_0.40.0            ggplot2_3.3.5                 
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.0            vsn_3.62.0            jsonlite_1.8.0       
#>  [4] foreach_1.5.2         bslib_0.3.1           assertthat_0.2.1     
#>  [7] highr_0.9             BiocManager_1.30.16   affy_1.72.0          
#> [10] iq_1.9.1              robustbase_0.93-9     yaml_2.3.5           
#> [13] impute_1.68.0         pillar_1.7.0          backports_1.4.1      
#> [16] lattice_0.20-45       glue_1.6.2            limma_3.50.1         
#> [19] digest_0.6.29         colorspace_2.0-3      htmltools_0.5.2      
#> [22] preprocessCore_1.56.0 plyr_1.8.6            MALDIquant_1.21      
#> [25] XML_3.99-0.9          pkgconfig_2.0.3       zlibbioc_1.40.0      
#> [28] purrr_0.3.4           scales_1.1.1          affyio_1.64.0        
#> [31] BiocParallel_1.28.3   tibble_3.1.6          farver_2.1.0         
#> [34] generics_0.1.2        IRanges_2.28.0        ellipsis_0.3.2       
#> [37] withr_2.5.0           cli_3.2.0             magrittr_2.0.2       
#> [40] crayon_1.5.0          evaluate_0.15         ncdf4_1.19           
#> [43] fansi_1.0.2           doParallel_1.0.17     MASS_7.3-55          
#> [46] tools_4.1.2           lifecycle_1.0.1       stringr_1.4.0        
#> [49] munsell_0.5.0         cluster_2.1.2         pcaMethods_1.86.0    
#> [52] compiler_4.1.2        jquerylib_0.1.4       mzID_1.32.0          
#> [55] rlang_1.0.2           grid_4.1.2            iterators_1.0.14     
#> [58] MsCoreUtils_1.6.2     labeling_0.4.2        rmarkdown_2.12       
#> [61] gtable_0.3.0          codetools_0.2-18      DBI_1.1.2            
#> [64] R6_2.5.1              knitr_1.37            fastmap_1.1.0        
#> [67] utf8_1.2.2            clue_0.3-60           stringi_1.7.6        
#> [70] parallel_4.1.2        vctrs_0.3.8           DEoptimR_1.0-10      
#> [73] tidyselect_1.1.2      xfun_0.30

References

Cox, Jürgen, Marco Y. Hein, Christian A. Luber, Igor Paron, Nagarjuna Nagaraj, and Matthias Mann. 2014. “Accurate Proteome-Wide Label-Free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed MaxLFQ*.” Molecular & Cellular Proteomics 13 (9): 2513–26. https://doi.org/10.1074/mcp.M113.031591.