With Label-Free Quantification, it’s usually appropriate to apply a naive normalisation such as ‘center-median’ normalisation, (diff.median in MSnbase::normalise) or Trimmed Mean of the M-values (TMM) normalisation. This is on the basis that the quantification should be relative to the total amount of protein in each sample.

There are occasions when you will want to normalisation to a set of reference proteins, for example endogenous ‘house-keeping’ proteins, or exogenous spiked in proteins.

Here, we consider yet another normalisation approach where we have a strong prior expectation about the abundance ratio of a subset of proteins between two conditions. Our data is from the LFQ Data processing and QC notebook. Please see the previous notebook for an explanation of the experiment. The important point here is that OOPS is known to enrich glycoproteins at the interface. We can therefore use these proteins as an internal set of reference proteins which we expect to have no difference in abundance between RNase +/-.

Load dependencies

Load the required libraries.


library(ggplot2)
library(MSnbase)
library(biobroom)
library(camprotR)
library(Proteomics.analysis.data)
library(dplyr)
library(tidyr)
library(uniprotREST)

Load data

We start by reading in the protein-level quantification we created in the LFQ Data processing and QC notebook

prot_robust <- readRDS('results/lfq_prot_robust.rds')

First, we need to calculate the RNase +/- ratios, since this is the value we want to normalise in this case.

ratios <- exprs(prot_robust[,pData(prot_robust)$Condition == 'RNase_neg']) - 
  exprs(prot_robust[,pData(prot_robust)$Condition == 'RNase_pos'])

prot_ratios <- MSnSet(exprs = ratios,
                      fData = fData(prot_robust),
                      pData = (pData(prot_robust) %>% filter(Condition == 'RNase_neg')))

Get annotations for glycoproteins and GO-‘RNA binding’

Next, we need annotations regarding which proteins are glycoproteins and GO-annotated RBPs.

Here, we query UniProt programmatically to obtain the GO terms and features for our proteins, which we will further parse to determine the GO-RBPs and glycoproteins.


glyco_res <-  uniprot_map(
  ids = rownames(prot_robust),
  from = "UniProtKB_AC-ID",
  to = "UniProtKB",
  fields = "ft_carbohyd",
) %>% rename(c('UNIPROTKB'='From'))
#> Job ID: 78e84369fd78a8e63608de318ef34d83919d3a18

go_res <-  uniprot_map(
  ids = rownames(prot_robust),
  from = "UniProtKB_AC-ID",
  to = "UniProtKB",
  fields = "go",
) %>% rename(c('UNIPROTKB'='From'))
#> Job ID: 78e84369fd78a8e63608de318ef34d83919d3a18

Get the GO-RBPs. Note that we are also identifying proteins annotated with offsprings of the RNA-binding GO term, since some proteins will only be annnotated with the child term.


# GO term for RNA-binding
go_rbp <- "GO:0003723"

# All offspring (since some proteins will only be annotated with an offspring term)
go_rbp_offspring <- AnnotationDbi::get("GO:0003723", GO.db::GOMFOFFSPRING)
#> 

# Identify all GO-RBPs
rbps <- go_res %>%
  separate_rows(Gene.Ontology..GO., sep='; ') %>%
  mutate(go=gsub('.*\\[(\\S+)\\]', '\\1', Gene.Ontology..GO.)) %>%
  filter(go %in% c(go_rbp, go_rbp_offspring)) %>% 
  pull(UNIPROTKB) %>%
  unique()

Get the glycoproteins. Note that we are only keeping proteins with at least 3 glycosylated sites to ensure we have confidence in their glycosylation

glycoproteins <- glyco_res %>% 
  # remove proteins with empty glycoprotein annotation
  filter(Glycosylation!='') %>%
  # separate the glycoprotein annotation into constituent parts
  separate_rows(Glycosylation, sep = "; ") %>%
  # keep the annotations relating to glycosylated position
  filter(grepl("CARBOHYD", Glycosylation)) %>%
  # count the number of glycosylated sites per protein
  group_by(UNIPROTKB) %>%  tally() %>%
  # keep proteins with 3 or more glycosylations
  filter(n>=3) %>%
  pull(UNIPROTKB)
  

Add feature columns describing the glycoprotein and GO-RBP status of the proteins.

fData(prot_ratios) <- fData(prot_ratios) %>%
  mutate(Glycoprotein = rownames(prot_ratios) %in% glycoproteins) %>%
  mutate(GO.RBP = rownames(prot_ratios) %in% rbps) %>%
  mutate(Glyco.RBP = interaction(Glycoprotein, GO.RBP)) %>%
  mutate(Glyco.RBP = factor(recode(
    Glyco.RBP,
    'TRUE.TRUE'='GO:RBGP',
    'FALSE.TRUE'='GO:RBP',
    'TRUE.FALSE'='Glycoprotein',
    'FALSE.FALSE'='Other'),
    levels = c('GO:RBP', 'GO:RBGP', 'Other', 'Glycoprotein'))
  )

Finally, we define a function to plot the ratios for each functional sub-type of proteins.

plot_ratios <- function(obj) {
  to_plot <- merge(
    exprs(obj),
    fData(obj)[,'Glyco.RBP',drop = FALSE],
    by = 'row.names'
  ) %>%
    pivot_longer(cols = -c(Row.names, Glyco.RBP), names_to = 'sample', values_to = 'ratio') %>%
    merge(pData(obj), by.x = 'sample', by.y = 'row.names') %>%
    filter(is.finite(ratio))
  
  p <- to_plot %>% 
    ggplot(aes(x = Replicate, y = ratio, 
               group = interaction(Glyco.RBP, Replicate), 
               colour = factor(Glyco.RBP))) +
    geom_boxplot(position = position_dodge()) +
    theme_camprot(border = FALSE, base_family = 'sans', base_size = 15) +
    scale_colour_manual(values = c(get_cat_palette(3), 'black'), name = '') +
    geom_hline(yintercept = 0, linetype = 2, colour = 'grey') +
    labs(
      x = "Replicate",
      y = "RNase -/+ ratio"
    )
  
  print(p)
  
  invisible(to_plot)
}

Normalising against the glycoproteins

Now we have everything in place, we can look at how the protein ratios look pre-normalisation…

plot_ratios(prot_ratios)

OK, so the glycoproteins are not centered at zero and there are GO-annotated RBPs with negative log RNase -/+ ratios (as much as ~25% in replicate 2)

Below, we perform the center-median normalisation with respect to the reference proteins (here, the glycoproteins).

glycoprotein_medians <- prot_ratios[fData(prot_ratios)$Glyco.RBP == 'Glycoprotein',] %>% 
  camprotR::get_medians()

prot_ratios_norm <- camprotR::center_normalise_to_ref(
  prot_ratios,
  glycoprotein_medians,
  center_to_zero = TRUE, # We want to center the glycoproteins around zero
  on_log_scale = TRUE # The quantifications are on a log scale (log2 ratios)
)

And plot the protein ratios post-normalisation.

plot_ratios(prot_ratios_norm)

saveRDS(prot_ratios_norm, './results/lfq_prot_robust_glyco_norm.rds')

Now, the median log2 RNase -/+ ratio for glycoproteins is zero for all replicates and we have far fewer GO-annotated RBPs with negative log RNase -/+ ratios.

Session info

#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] uniprotREST_0.0.0.9000         tidyr_1.2.0                   
#>  [3] dplyr_1.0.8                    Proteomics.analysis.data_0.1.0
#>  [5] camprotR_0.0.0.9000            biobroom_1.26.0               
#>  [7] broom_0.7.12                   MSnbase_2.20.4                
#>  [9] ProtGenerics_1.26.0            S4Vectors_0.32.3              
#> [11] mzR_2.28.0                     Rcpp_1.0.8.3                  
#> [13] Biobase_2.54.0                 BiocGenerics_0.40.0           
#> [15] ggplot2_3.3.5                 
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-7           bit64_4.0.5            httr_1.4.2            
#>  [4] doParallel_1.0.17      GenomeInfoDb_1.30.1    tools_4.1.2           
#>  [7] backports_1.4.1        bslib_0.3.1            utf8_1.2.2            
#> [10] R6_2.5.1               affyio_1.64.0          DBI_1.1.2             
#> [13] colorspace_2.0-3       withr_2.5.0            tidyselect_1.1.2      
#> [16] bit_4.0.4              curl_4.3.2             compiler_4.1.2        
#> [19] preprocessCore_1.56.0  httr2_0.2.1            cli_3.2.0             
#> [22] labeling_0.4.2         sass_0.4.0             scales_1.1.1          
#> [25] DEoptimR_1.0-10        robustbase_0.93-9      affy_1.72.0           
#> [28] rappdirs_0.3.3         stringr_1.4.0          digest_0.6.29         
#> [31] rmarkdown_2.12         XVector_0.34.0         pkgconfig_2.0.3       
#> [34] htmltools_0.5.2        highr_0.9              fastmap_1.1.0         
#> [37] limma_3.50.1           rlang_1.0.2            RSQLite_2.2.10        
#> [40] impute_1.68.0          farver_2.1.0           jquerylib_0.1.4       
#> [43] generics_0.1.2         jsonlite_1.8.0         mzID_1.32.0           
#> [46] BiocParallel_1.28.3    RCurl_1.98-1.6         magrittr_2.0.2        
#> [49] GO.db_3.14.0           GenomeInfoDbData_1.2.7 MALDIquant_1.21       
#> [52] munsell_0.5.0          fansi_1.0.2            MsCoreUtils_1.6.2     
#> [55] lifecycle_1.0.1        vsn_3.62.0             stringi_1.7.6         
#> [58] yaml_2.3.5             MASS_7.3-55            zlibbioc_1.40.0       
#> [61] plyr_1.8.6             blob_1.2.2             grid_4.1.2            
#> [64] parallel_4.1.2         crayon_1.5.0           lattice_0.20-45       
#> [67] Biostrings_2.62.0      KEGGREST_1.34.0        knitr_1.37            
#> [70] pillar_1.7.0           codetools_0.2-18       XML_3.99-0.9          
#> [73] glue_1.6.2             evaluate_0.15          pcaMethods_1.86.0     
#> [76] BiocManager_1.30.16    png_0.1-7              vctrs_0.3.8           
#> [79] foreach_1.5.2          gtable_0.3.0           purrr_0.3.4           
#> [82] clue_0.3-60            assertthat_0.2.1       cachem_1.0.6          
#> [85] xfun_0.30              ncdf4_1.19             tibble_3.1.6          
#> [88] iterators_1.0.14       memoise_2.0.1          AnnotationDbi_1.56.2  
#> [91] IRanges_2.28.0         cluster_2.1.2          ellipsis_0.3.2

References