Processing and QC of SILAC data

Preamble
- Load dependencies
Processing and QC of SILAC experimental data

Preamble

Stable Isotope Labelling by/with Amino acids in Cell culture (SILAC) is a form of quantitative proteomics where different conditions are quantified in the same run by differential metabolic labelling of proteins using amino acids containing stable isotopes (Ong et al. 2002). A typical SILAC experiment involved growing cells in two different types of SILAC media, one containing ‘light’ arginine and lysine, and the other containing ‘heavy’ arginine and lysine. Over time the cells incorporate these amino acids into their proteins, changing the mass of the peptides detected by LC-MS. The two pools of cells can then be used to compare e.g two experimental conditions.

The principle of SILAC. Source: https://www.creative-proteomics.com/blog/index.php/stable-isotope-labeling-using-amino-acids-in-cell-culture-silac-principles-workflow-and-applications

SILAC was initially designed to provide pairwise comparisons between cell cultures, but has now been extended to more than two labels and even whole organisms (Krüger et al. 2008).

This elegant experimental design enables quantification of peptide/protein abundance ratios between conditions with very little technical variation, since the samples from separate conditions are pooled as early as possible in the sample handling process. For example, cell cultures can be treated with drug/control and then collected and pooled together for all downstream protocol steps. The use of different isotope labels has been extended to study the protein turnover (e.g switch from one label to another) in pulsed SILAC, and relative turnover between conditions (e.g two condition on the same label and then switch each condition to a different label; requires triple SILAC).

The analysis of SILAC data is relatively straightforward, since technical noise is low and normalisation is not normally required. Typically, one is interested in the intensity (peak area or peak height depending on your PD settings) ratio of the light (L) and heavy (H) MS peaks for a given peptide. This ratio forms the quantification value which we wish to perform statistical tests and exploratory analysis on.

Load dependencies

Load the required libraries.

library(Proteomics.analysis.data)
library(camprotR)
library(Biostrings)
library(ggplot2)
library(MSnbase)
library(dplyr)
library(tidyr)
library(tibble)

Typically, the incorporation rate testing will be performed for you by the mass4tox Proteomics service. However, should you wish to do this yourself, see Asssessing the SILAC isotope incorporation rate

Processing and QC of SILAC experimental data

Here, we will use data from an OOPS (Queiroz et al. 2019) experiment designed to identify proteins that are significantly enriched upon UV crosslinking (CL) vs non-crosslinked control cells (NC). RNA-binding proteins (RBPs) should be much more abundant in CL samples, since this covalently crosslinks the RBPs to RNA, and retains them in the OOPS interface, from which the sample is taken.

In total, we have 4 SILAC runs, representing 4 separate plates of U-2 OS cells. Two plates (replicate 1 & 2) were Heavy for CL and two plates (replicates 3 & 4) were Light for CL. This is called a ‘label’ swap/switch and helps to ensure the results are independent of the labelling scheme.

Replicate	Heavy	Light
1	CL	NC
2	CL	NC
3	NC	CL
4	NC	CL

Below, we define the filepaths for the ‘PeptideGroups.txt’ files we will use. These are part the Proteomics.analysis.data package. As such, their complete filepath can be identified using system.file.

replicates <- 1:4
pep_infiles <- file.path('OOPS_SILAC', paste0('OOPS_', replicates, '_PeptideGroups.txt'))
names(pep_infiles) <- replicates

print(pep_infiles)
#>                                     1                                     2 
#> "OOPS_SILAC/OOPS_1_PeptideGroups.txt" "OOPS_SILAC/OOPS_2_PeptideGroups.txt" 
#>                                     3                                     4 
#> "OOPS_SILAC/OOPS_3_PeptideGroups.txt" "OOPS_SILAC/OOPS_4_PeptideGroups.txt"

Below, we read in one PeptideGroups.txt file and print the column names.

infdata <- read.delim(system.file("extdata", pep_infiles[[1]], package = "Proteomics.analysis.data"))
print(colnames(infdata))
#>  [1] "Peptide.Groups.Peptide.Group.ID"                      
#>  [2] "Checked"                                              
#>  [3] "Confidence"                                           
#>  [4] "Sequence"                                             
#>  [5] "Modifications"                                        
#>  [6] "Qvality.PEP"                                          
#>  [7] "Qvality.q.value"                                      
#>  [8] "Number.of.Protein.Groups"                             
#>  [9] "Number.of.Proteins"                                   
#> [10] "Number.of.PSMs"                                       
#> [11] "Master.Protein.Accessions"                            
#> [12] "Protein.Accessions"                                   
#> [13] "Number.of.Missed.Cleavages"                           
#> [14] "Theo.MHplus.in.Da"                                    
#> [15] "Abundance.F26.Light.Sample"                           
#> [16] "Abundance.F26.Heavy.Sample"                           
#> [17] "Precursor.Quan.Result.ID"                             
#> [18] "Quan.Info"                                            
#> [19] "Quan.Channel"                                         
#> [20] "Found.in.Sample.in.S42.F26.Heavy.Sample"              
#> [21] "Found.in.Sample.in.S26.F26.Light.Sample"              
#> [22] "Confidence.by.Search.Engine.MS.Amanda.20"             
#> [23] "Confidence.by.Search.Engine.Sequest.HT"               
#> [24] "Percolator.q.Value.by.Search.Engine.MS.Amanda.20"     
#> [25] "Percolator.q.Value.by.Search.Engine.Sequest.HT"       
#> [26] "Percolator.PEP.by.Search.Engine.MS.Amanda.20"         
#> [27] "Percolator.PEP.by.Search.Engine.Sequest.HT"           
#> [28] "Amanda.Score.by.Search.Engine.MS.Amanda.20"           
#> [29] "CharmeRT.Combined.Score.by.Search.Engine.MS.Amanda.20"
#> [30] "XCorr.by.Search.Engine.Sequest.HT"                    
#> [31] "Top.Apex.RT.in.min"

Exercise 1

Examine the column names and answer the following:

What do you the difference is between “Protein.Accessions” and “Master.Protein.Accessions”?

What columns contain the abundances for the the Light and Heavy-labelled samples?

How do the Light and Heavy samples relatea to the conditions (UV crosslinking +/-)?

Solution

# 1. Protein.Accessions are all the proteins a sequence can match to.
# Master.Protein.Accessions are the master protein(s) that a peptide is deemed 
# to be most likely to originate from.
# 2. "Abundance.F26.Light.Sample" & "Abundance.F26.Heavy.Sample"
# 3. It's not possible to tell! We'll deal with this later.

Solution end

Your PD output may have additional columns. For example, if all your SILAC samples were simultaneously processed in PD, you will have a ‘Light’ and ‘Heavy’ column for each one.

Parse PeptideGroups.txt files

To simply the process of reading in the data and performing initial filtering, we will use camprotR::parse_features. This function will read in the data and remove contaminant proteins and features without quantification data. Contaminant proteins were defined using the cRAP database and provided to PD. We need to obtain their accessions and provide these to camprotR::parse_features. Below, we parse the cRAP FASTA to extract the IDs for the cRAP proteins, in both ‘cRAP’ format and Uniprot IDs for these proteins.

crap_fasta_inf <- system.file(
  "extdata", "cRAP_20190401.fasta.gz", 
  package = "Proteomics.analysis.data"
)

# Load the cRAP FASTA used for the PD search. For your experiment,
# make sure you're using the same file as used in the PD workflow
crap_fasta <- Biostrings::fasta.index(crap_fasta_inf, seqtype = "AA")

# Extract the non-cRAP UniProt accessions associated with each cRAP protein
crap_accessions <- crap_fasta %>% 
  pull(desc) %>% 
  stringr::str_extract_all(pattern="(?<=\\|).*?(?=\\|)") %>% 
  unlist()

We can then supply these cRAP protein IDs to camprotR::parse_features which will parse the PD output and remove features which may originate from contaminants, as well as features which don’t have a unique master protein. See ?parse_features for further details, including the removal of ‘associated cRAP’. parse_features will output messages about the number of features at each stage in the filtering.

# We use lapply below to run the same function on each element of the list pep_data
# This is easier than running a for loop since the output is a new named list 
pep_data_parsed <- lapply(
  pep_infiles, function(infile) { # define the function to run
    
    infdata <- read.delim(
      system.file("extdata", infile, package = "Proteomics.analysis.data"))
    
    parse_features(infdata,
                   silac = TRUE,
                   level = 'peptide',
                   crap_proteins = crap_accessions,
                   unique_master = FALSE)
  }
)

Annotate with experimental conditions

Next, we add new columns to describe the intensities with respect to the CL and NC conditions, based on how H & L map to CL & NC for each replicate. We will also subset to only those columns we need.


# Define a function name to take the replicate number and annotate the data

annotate_parsed_data <- function(rep_n){
  
  pep_data <- pep_data_parsed[[rep_n]]
  
  # Identify the column names for the light and heavy intensities
  # (these names are not consistent between samples, yours may be different!)
  abundance_light_col <- grep('Abundance.*.Light.Sample', colnames(pep_data), value = TRUE)
  abundance_heavy_col <- grep('Abundance.*.Heavy.Sample', colnames(pep_data), value = TRUE)
  
  # Label-swap info defines whether heavy or light is CL
  cl_col <- ifelse(rep_n %in% 1:2, abundance_heavy_col, abundance_light_col) 
  nc_col <- ifelse(rep_n %in% 1:2, abundance_light_col, abundance_heavy_col)
  
  pep_data <- pep_data %>%
    # Add new columns with CL or NC intensities 
    mutate('CL' = !!sym(cl_col), # !!sym(var) allows us to supply a string var for tidy evaluation
           'NC' = !!sym(nc_col),
           Replicate = rep_n) %>%
    # And subset to these columns
    select(Master.Protein.Accessions,
           Sequence,
           Modifications,
           CL,
           NC,
           Replicate)
  
  return(pep_data)
}

# Apply the function to all peptide data.frames
annot_pep_data_parsed <- lapply(names(pep_data_parsed), # annotate_parsed_data function using the name
                          annotate_parsed_data)

# Re-annotate with the names
names(annot_pep_data_parsed) <- names(pep_data_parsed)

Now that we have added the replicate number as a column, we can bind together the rows from all replicates to keep all our data in a single data.frame.

abundance_data_complete <- do.call('rbind', annot_pep_data_parsed) %>% 
  remove_rownames()

Calculate ratios

Exercise 2

Add a new column to abundance_data_complete which gives the CL/NC ratio

Solution

abundance_data_complete %>% mutate(ratio=CL/NC)

Solution end

Here, we will calculate the CL/NC ratio using camprotR::get_ratio, which will also add a missing column to describe whether one of the quantification values is missing. This is useful for quality control purposes. camprotR::get_ratio expects the quantification values to be on a log scale and returns log-transformed ratios. We will use base 2 for our log transformation, since this is relatively intuitive.

ratios <- abundance_data_complete %>%
  filter((is.finite(CL) | is.finite(NC))) %>% # Retain peptides where either CL and/or NC is finite
  mutate(CL = log2(CL), NC = log2(NC)) %>% # log2-transform quantification
  get_ratio(CL, NC, bind = TRUE) # Obtain CL/NC ratio

Missing values

Below, we tally and plot the missing values per replicate for each method. Note that ~40% of peptides have a missing quantification value and it’s much more likely to be missing in NC than CL. This is what we expect since the OOPS method is designed to capture RBPs that are UV crosslinked to RNA, so without crosslinking, there should be very little protein present in the interface.

# Tally the missing values
missing_tallies <- ratios %>%
  group_by(missing) %>%
  tally()

missing_tallies %>%
  ggplot(aes(x = missing, y = n)) +
  theme_camprot(border = FALSE) + # theme_camprot is a ggplot2 theme defined in camprotR
  geom_bar(stat = 'identity', colour = 'grey50') +
  labs(y='Peptides', x='') +
  theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))

Exercise 3

Modify the code above so that rather than showing one bar each for every level of missing status, it shows a stacked bar for each replicate, where the fill denotes the missing status (see below for how it should look)

Solution

missing_tallies <- ratios %>%
  group_by(Replicate, missing) %>%
  tally()

missing_tallies %>%
  ggplot(aes(x = Replicate, y = n, fill = missing)) +
  theme_camprot(border = FALSE) + 
  geom_bar(stat = 'identity', position = 'fill') +
  labs(y = 'Fraction', fill='')

> Solution end

Use the PSM files to summarise spectrum matches for each quantification value.

Given the large number of missing values, there is a concern that protein abundance in NC may be very low. Therefore, NC peptides identified by mass shift rather than peptide spectrum matching could be incorrect. Note that peptide sequences identified by PSM are referred to as having been ‘matched’ in some code blocks below.

To get the information about what quantification values come from PSMs, we need to interrogate the PSM-level output.

# These files are part of the Proteomics.analysis.data package
psm_infiles <- gsub('_PeptideGroups.txt', '_PSMs.txt', pep_infiles)

Below we use camprotR::silac_psm_seq_int to identify which quantification values are from PSMs.

psm_matched_data <- lapply(
  psm_infiles, function(infile) {
    # Read in the PSM file
    infdata <- read.delim(
      system.file("extdata", infile, package = "Proteomics.analysis.data"))
    
    # If you export from PD including the 'Sequence' column, 
    # you don't need this step.
    # Below, we convert the Annotated.Sequence column into a Sequence column
    infdata$Sequence <- toupper(infdata$Annotated.Sequence)
    
    # Summarise spectrum matches for each peptide
    camprotR::silac_psm_seq_int(infdata, sequence_col = 'Sequence')
  }
)

Then we bind together the spectrum matched data into one data frame, ready to merge with the peptide quantification.

all_psm_matched_data <- psm_matched_data %>%
  names() %>%
  lapply(function(rep_n) {
    psm_matched_data[[rep_n]] %>%
      mutate(Replicate = rep_n)
  }) %>% 
  bind_rows()

Below, we merge the quantification and PSM information and make new columns to describe whether the CL/NC quantification is from a PSM. Note that we need to update the Modifications and Sequence columns in the peptide-level output so they can be matched to the columns in the PSM-level columns.

# merge the matched information and add new columns for the CL/NC matched information
ratios_matched <- ratios %>%
  
  # Update the modifications column for the peptide object so it 
  # doesn't include SILAC modifications 
  mutate(Modifications = remove_silac_modifications(Modifications, level = 'peptide')) %>%
  
  # Update the sequence column to all uppercase
  rowwise() %>% mutate(Sequence = toupper(Sequence)) %>%
  
  # Merge with the sequenced information
  merge(all_psm_matched_data, 
        by = c('Sequence', 'Modifications', 'Replicate'))  %>%
  
  # Add new columns with CL/NC matched information using the
  # matched_Heavy and matched_Light columns and the label swap information
  # (rep1/2, CL=H; rep3/4 CL=L)
  mutate('Matched_CL' = ifelse(Replicate %in% 1:2, matched_Heavy, matched_Light),
         'Matched_NC' = ifelse(Replicate %in% 1:2, matched_Light, matched_Heavy)) %>%
  
  # And subset to these columns
  select(Master.Protein.Accessions,
         Sequence, 
         Modifications,
         Replicate,
         ratio,
         CL, 
         NC,
         Matched_CL,
         Matched_NC,
         missing)

# Add a single column to describe the matched information across the two conditions
ratios_matched <- ratios_matched %>%
  mutate(
    matched = interaction(Matched_CL, Matched_NC),
    matched = factor(recode(matched,
                            'TRUE.TRUE'='Both spectrum matched',
                            'TRUE.FALSE'='CL spectrum matched',
                            'FALSE.TRUE'='NC spectrum matched'), 
                     levels = c('Both spectrum matched', 
                                'CL spectrum matched', 
                                'NC spectrum matched'))
  )

Below, we consider how often the ratio comes from a peptide where both CL and NC were spectrum matched, or just one spectrum matched, and the other one therefore being by mass shift. Note that a minority of peptides have both CL and NC sequenced and the majority are just sequenced in CL.

# Tally the peptide match status
matched_tallies <- ratios_matched %>%
  filter(is.finite(ratio)) %>%
  group_by(Replicate, matched) %>% 
  tally()

# Plot as stacked bar plot
matched_tallies %>% 
  ggplot(aes(x = Replicate, y = n, fill = matched)) +
  geom_bar(stat = 'identity', position='fill', colour = 'grey50') +
  theme_camprot(border = FALSE) +
  scale_fill_manual(values = get_cat_palette(3), name = '') +
  xlab('Replicate') +
  ylab('Fraction of peptides')

Discussion Why are more peptides spectrum matched in just CL?

Solution

# The CL peptides are higher intensity

Solution end

Not all peptide ratios are equally accurate

Now we have summarise the ‘spectrum matched’ information, we can then consider how this relates to the correlation between CL and NC.

ratios_matched %>%
  ggplot(aes(x = CL, y = NC)) +
  geom_point(size = 0.5, alpha=0.5) +
  geom_abline(slope = 1, linetype = 2, colour = 'grey50') + # line at CL==NC
  theme_camprot(base_size = 15, border = FALSE) +
  facet_grid(Replicate~ matched) +
  xlab('CL (log2)') +
  ylab('NC (log2)')
#> Warning: Removed 4715 rows containing missing values (geom_point).

Below, we quantify the correlations

cl_nc_correlations <- ratios_matched %>%
  filter(is.finite(CL), is.finite(NC)) %>%
  group_by(matched, Replicate) %>%
  summarise(cor = cor(CL, NC))
#> `summarise()` has grouped output by 'matched'. You can override using the
#> `.groups` argument.

cl_nc_correlations %>%
  ggplot(aes(matched, cor, colour=Replicate)) +
  geom_point() +
  theme_camprot(border = FALSE, base_size=15) +
  theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1)) +
  xlab('') +
  ylab('Pearson correlation') +
  scale_colour_manual(values=get_cat_palette(4))

Note that when both CL and NC are spectrum matched, there are two clear populations, especially for replicates 1 & 2.

CL == NC (on grey dashed line). These are the peptides not enriched by UV CL and therefore from non-RBPs
CL >> NC (below grey dashed line). These peptides are heavily enriched by UV CL and therefore from RBPs.

When just CL is matched, the correlation is less clear and the two populations are not obvious. One plausible explanation is that many of the NC quantification values obtained when CL is identified by spectrum matching, but NC is identified by mass shift, are from ions erroneously assigned to a peptide. This would be a concern.

Whenever a SILAC experiment is performed where complete absence of a protein in one condition is likely, it is recommended to perform this QC step. If the correlation is very poor when just one condition is spectrum matched, it may be necessary to exclude these peptides.

In this vignette example, we will proceed with all the peptides, but retain information about the spectrum matching in the protein quantification, should we need to investigate this further.

Creating an `MSnSet`

Now, we want to make an MSnSet from the peptide-level CL/NC ratios. For this, we need to generate a matrix of quantification values (CL/NC ratios), where rows are features (peptides), and a separate data.frame with information about the features, e.g master protein assignment. See the vignette("msnset", package="camprotR") for more details about MSnSets.

For this experiment, each of the SILAC samples was processed separately. This is sometimes more suitable than processing all samples together when we don’t want PD to share information between samples and attempt to find peptides that we missed in one sample but spectrum matched in another. However, this also means the assignments of peptides -> master proteins will not be 100% consistent across samples, since this assignment depends on the total set of observed peptides. This will cause issues when we try and create the MSnSet since one peptide could be assigned to different proteins across the samples and we therefore need to update the master protein column.

camprotR::get_parsimony_pep2prot() will take a list of files, each with Sequence, Protein.Accessions and Master.Protein.Accessions columns and generate a unified peptide sequence to master protein assignment. This is done by an approximately parsimonious approach, which is not identical to PD. The function therefore also summarises the difference between the PD assignments and the updated assignment. This is presented in two tables with the number and percentage of peptides with single or multiple master proteins in the input data (Original) and the updated peptide to protein assignments (Updated). Finally, a plot is shown with the number of peptide sequences per protein using the original and updated assignments. Typically, the differences between the original PD assignments and the updated assignments is small.

Note If all your SILAC samples were processed in one single PD workflow, e.g you have just one file for all samples, this step is not neccessary. You do not need to create new_seq_to_master or update the Master.Protein.Accessions column.


# Create a unified peptide sequence to protein assignment
new_seq_to_master <- pep_infiles %>%
  lapply(function(x) system.file("extdata", x, package = "Proteomics.analysis.data")) %>%
  camprotR::get_parsimony_pep2prot()
#> With original assignments, 1951 master proteins. With update, 1820 master proteins
#> Storing counts in `nn`, as `n` already present in input
#> Storing counts in `nn`, as `n` already present in input
#> Comparing Master Protein IDs
#> ℹ Use `name = "new_name"` to pick a new name.

#> .
#> Different ID(s)      Same ID(s) 
#>            1188           18372 
#> .
#> Different ID(s)      Same ID(s) 
#>             6.1            93.9 
#>                   Updated:Multiple Updated:Single
#> Original:Multiple              106            629
#> Original:Single                384          18441
#>                   Updated:Multiple Updated:Single
#> Original:Multiple             0.54           3.22
#> Original:Single               1.96          94.28

Next, we remove peptides without both CL and NC, e.g no quantified ratio, and then merge with the updated master protein assignments

# Remove peptides without both CL + NC quantified
flt_ratios <- ratios_matched %>%
  filter(is.finite(ratio)) %>%
  merge(new_seq_to_master, by='Sequence') %>%
  # replace Master.Protein.Accessions with Updated.Master.Protein.Accessions
  mutate(Master.Protein.Accessions=Updated.Master.Protein.Accessions)

Now we have our sanitised ratios, we can create the MSnSet. This object contains 3 elements: - A quantification data matrix (rows=features, e.g peptides/proteins, columns=samples) - Feature data (rows=features, columns=feature annotations, e.g peptide master protein assignment) - Experimental details (rows=samples, columns=experimental details, e.g treatment)

Our ratios are in a ‘long’ format, so we need to make them wider, so that each sample is a column in a quantification matrix.

# Create a wide table with unique ids as row names
flt_ratios_wide <- flt_ratios %>% 
  select(Master.Protein.Accessions,
         Sequence, Modifications,
         Replicate, Matched_NC, ratio) %>% 
  pivot_wider(names_from = "Replicate", values_from = c("Matched_NC", "ratio")) %>% 
  # Create an id column from the sequence and modification columns
  unite(id, Sequence, Modifications, remove = FALSE) %>%
  column_to_rownames(var = "id")

# Create expression matrix (exprs)
exprs_data <- flt_ratios_wide %>% 
  select(matches("ratio_[1-4]")) %>% # select unique ids and ratio data
  as.matrix()

For the peptide features, we will take all the columns other than the peptide ratios

# Create feature metadata data frame (fData)
feat_data <- flt_ratios_wide %>% 
  select(!matches("ratio")) # select unique ids and everything but ratio data

We then create an MSnSet. Note that here, we do not provide any experimental condition to the MSnSet constructor, so ‘pData’ returns a data.frame with no columns

# Create MSnSet
pep_res <- MSnSet(exprs = exprs_data,
                  fData = feat_data)

print(pData(pep_res))
#> data frame with 0 columns and 4 rows

The only important experimental condition here is the replicate number, which we add below

# Add replicate number to phenotype data
pData(pep_res)$Replicate <- 1:ncol(exprs_data)

Below, we print the object to see a summary.

print(pep_res)
#> MSnSet (storageMode: lockedEnvironment)
#> assayData: 3839 features, 4 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: ratio_1 ratio_2 ratio_4 ratio_3
#>   varLabels: Replicate
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: AAAAAAAAAAAAAAAGAGAGAK_ AAAAAAAAAAGAAGGR_1xAcetyl
#>     [N-Term] ... YYVTIIDAPGHR_ (3839 total)
#>   fvarLabels: Master.Protein.Accessions Sequence ... Matched_NC_3 (7
#>     total)
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:  
#> - - - Processing information - - -
#>  MSnbase version: 2.20.4

Exercise 4

How many peptides are there?

How many feature columns are there?

Solution

nrow(pep_res) #1
length(fvarLabels(pep_res)) #2

Solution end

Note that there is also a section call ‘Processing information’. This will retain information about any processing we perform on our data.

See the vignette("msnset", package="camprotR") for more details about MSnSets.

print(pep_res)
#> MSnSet (storageMode: lockedEnvironment)
#> assayData: 3839 features, 4 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: ratio_1 ratio_2 ratio_4 ratio_3
#>   varLabels: Replicate
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: AAAAAAAAAAAAAAAGAGAGAK_ AAAAAAAAAAGAAGGR_1xAcetyl
#>     [N-Term] ... YYVTIIDAPGHR_ (3839 total)
#>   fvarLabels: Master.Protein.Accessions Sequence ... Matched_NC_3 (7
#>     total)
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:  
#> - - - Processing information - - -
#>  MSnbase version: 2.20.4

Then we summarise the peptide-level ratios to protein-level ratios. Here, we’ll use the median. Note the message about missing values. We’re given the option na.rm=TRUE, which will be passed onto median() to ensure missing values are handled appropriately, e.g median(x, y, NA) will return the equivalent to medium(x, y).

prot_res <- combineFeatures(
  pep_res, 
  groupBy = fData(pep_res)$Master.Protein.Accessions,
  method = "median",
  na.rm = TRUE
) 
#> Your data contains missing values. Please read the relevant section in
#> the combineFeatures manual page for details on the effects of missing
#> values on data aggregation.

print(nrow(pep_res))
#> [1] 3839
print(nrow((prot_res)))
#> [1] 1128

Note that the summarisation is occurring on the expression data, but will affect the feature data too (the values in the first feature are taken). If we want to retain information about whether the peptide was ‘matched’ in both conditions, we will need to generate this ourselves and re-attach to the MSnSet.

prot_matched_nc <- fData(pep_res) %>%
  select(-Sequence, -Modifications) %>% 
  pivot_longer(cols = -Master.Protein.Accessions, values_to = 'Matched_NC') %>%
  group_by(Master.Protein.Accessions) %>%
  filter(!is.na(Matched_NC)) %>% 
  summarise(any_not_sequenced = any(!Matched_NC))


fData(prot_res) <- fData(prot_res) %>%
  rownames_to_column(var = "id") %>% 
  left_join(prot_matched_nc, by = "Master.Protein.Accessions") %>% 
  column_to_rownames(var = "id")

We can now inspect the completeness of our protein-level data using MSnbase::plotNA(). In this case 383/1128 proteins have complete data (all 4 replicates)

MSnbase::plotNA(prot_res, pNA = 0)

At this point, we have obtained an MSnSet containing protein-level ratios for each sample. You will likely wish to perform bespoke visualisations and statistical analyses from this point.

Below, we save the peptide and protein level objects to disk, so we can read them back into memory in downstream analyses. We use saveRDS to save them in compressed R binary format.

saveRDS(prot_res, 'results/prot_res.rds')
saveRDS(pep_res, 'results/pep_res.rds')

For an example of differential abundance testing from the SILAC CL/NC ratios, see Differential abundance testing with SILAC ratios

Krüger, Marcus, Markus Moser, Siegfried Ussar, Ingo Thievessen, Christian A. Luber, Francesca Forner, Sarah Schmidt, Sara Zanivan, Reinhard Fässler, and Matthias Mann. 2008. “SILAC Mouse for Quantitative Proteomics Uncovers Kindlin-3 as an Essential Factor for Red Blood Cell Function.” Cell 134 (2): 353–64. https://doi.org/10.1016/j.cell.2008.05.033.

Ong, Shao-En, Blagoy Blagoev, Irina Kratchmarova, Dan Bach Kristensen, Hanno Steen, Akhilesh Pandey, and Matthias Mann. 2002. “Stable Isotope Labeling by Amino Acids in Cell Culture, SILAC, as a Simple and Accurate Approach to Expression Proteomics*.” Molecular & Cellular Proteomics 1 (5): 376–86. https://doi.org/10.1074/mcp.M200025-MCP200.

Queiroz, Rayner M. L., Tom Smith, Eneko Villanueva, Maria Marti-Solano, Mie Monti, Mariavittoria Pizzinga, Dan-Mircea Mirea, et al. 2019. “Comprehensive Identification of RNA–Protein Interactions in Any Organism Using Orthogonal Organic Phase Separation (OOPS).” Nature Biotechnology 37 (2): 169. https://doi.org/10.1038/s41587-018-0001-2.

Processing and QC of SILAC data

Tom Smith

2022-09-16

Preamble

Load dependencies

Processing and QC of SILAC experimental data

Exercise 1

Parse PeptideGroups.txt files

Annotate with experimental conditions

Calculate ratios

Exercise 2

Missing values

Exercise 3

Use the PSM files to summarise spectrum matches for each quantification value.

Not all peptide ratios are equally accurate

Creating an MSnSet

Exercise 4

Creating an `MSnSet`