Stable Isotope Labelling by/with Amino acids in Cell culture (SILAC) is a form of quantitative proteomics where different conditions are quantified in the same run by differential metabolic labelling of proteins using amino acids containing stable isotopes (Ong et al. 2002). A typical SILAC experiment involved growing cells in two different types of SILAC media, one containing ‘light’ arginine and lysine, and the other containing ‘heavy’ arginine and lysine. Over time the cells incorporate these amino acids into their proteins, changing the mass of the peptides detected by LC-MS. The two pools of cells can then be used to compare e.g two experimental conditions.
The principle of SILAC. Source: https://www.creative-proteomics.com/blog/index.php/stable-isotope-labeling-using-amino-acids-in-cell-culture-silac-principles-workflow-and-applications
SILAC was initially designed to provide pairwise comparisons between cell cultures, but has now been extended to more than two labels and even whole organisms (Krüger et al. 2008).
This elegant experimental design enables quantification of peptide/protein abundance ratios between conditions with very little technical variation, since the samples from separate conditions are pooled as early as possible in the sample handling process. For example, cell cultures can be treated with drug/control and then collected and pooled together for all downstream protocol steps. The use of different isotope labels has been extended to study the protein turnover (e.g switch from one label to another) in pulsed SILAC, and relative turnover between conditions (e.g two condition on the same label and then switch each condition to a different label; requires triple SILAC).
The analysis of SILAC data is relatively straightforward, since technical noise is low and normalisation is not normally required. Typically, one is interested in the intensity (peak area or peak height depending on your PD settings) ratio of the light (L) and heavy (H) MS peaks for a given peptide. This ratio forms the quantification value which we wish to perform statistical tests and exploratory analysis on.
Load the required libraries.
library(Proteomics.analysis.data)
library(camprotR)
library(Biostrings)
library(ggplot2)
library(MSnbase)
library(dplyr)
library(tidyr)
library(tibble)
Typically, the incorporation rate testing will be performed for you by the mass4tox Proteomics service. However, should you wish to do this yourself, see Asssessing the SILAC isotope incorporation rate
Here, we will use data from an OOPS (Queiroz et al. 2019) experiment designed to identify proteins that are significantly enriched upon UV crosslinking (CL) vs non-crosslinked control cells (NC). RNA-binding proteins (RBPs) should be much more abundant in CL samples, since this covalently crosslinks the RBPs to RNA, and retains them in the OOPS interface, from which the sample is taken.
In total, we have 4 SILAC runs, representing 4 separate plates of U-2 OS cells. Two plates (replicate 1 & 2) were Heavy for CL and two plates (replicates 3 & 4) were Light for CL. This is called a ‘label’ swap/switch and helps to ensure the results are independent of the labelling scheme.
Replicate | Heavy | Light |
---|---|---|
1 | CL | NC |
2 | CL | NC |
3 | NC | CL |
4 | NC | CL |
Below, we define the filepaths for the ‘PeptideGroups.txt’ files we
will use. These are part the Proteomics.analysis.data
package. As such, their complete filepath can be identified using
system.file
.
replicates <- 1:4
pep_infiles <- file.path('OOPS_SILAC', paste0('OOPS_', replicates, '_PeptideGroups.txt'))
names(pep_infiles) <- replicates
print(pep_infiles)
#> 1 2
#> "OOPS_SILAC/OOPS_1_PeptideGroups.txt" "OOPS_SILAC/OOPS_2_PeptideGroups.txt"
#> 3 4
#> "OOPS_SILAC/OOPS_3_PeptideGroups.txt" "OOPS_SILAC/OOPS_4_PeptideGroups.txt"
Below, we read in one PeptideGroups.txt file and print the column names.
infdata <- read.delim(system.file("extdata", pep_infiles[[1]], package = "Proteomics.analysis.data"))
print(colnames(infdata))
#> [1] "Peptide.Groups.Peptide.Group.ID"
#> [2] "Checked"
#> [3] "Confidence"
#> [4] "Sequence"
#> [5] "Modifications"
#> [6] "Qvality.PEP"
#> [7] "Qvality.q.value"
#> [8] "Number.of.Protein.Groups"
#> [9] "Number.of.Proteins"
#> [10] "Number.of.PSMs"
#> [11] "Master.Protein.Accessions"
#> [12] "Protein.Accessions"
#> [13] "Number.of.Missed.Cleavages"
#> [14] "Theo.MHplus.in.Da"
#> [15] "Abundance.F26.Light.Sample"
#> [16] "Abundance.F26.Heavy.Sample"
#> [17] "Precursor.Quan.Result.ID"
#> [18] "Quan.Info"
#> [19] "Quan.Channel"
#> [20] "Found.in.Sample.in.S42.F26.Heavy.Sample"
#> [21] "Found.in.Sample.in.S26.F26.Light.Sample"
#> [22] "Confidence.by.Search.Engine.MS.Amanda.20"
#> [23] "Confidence.by.Search.Engine.Sequest.HT"
#> [24] "Percolator.q.Value.by.Search.Engine.MS.Amanda.20"
#> [25] "Percolator.q.Value.by.Search.Engine.Sequest.HT"
#> [26] "Percolator.PEP.by.Search.Engine.MS.Amanda.20"
#> [27] "Percolator.PEP.by.Search.Engine.Sequest.HT"
#> [28] "Amanda.Score.by.Search.Engine.MS.Amanda.20"
#> [29] "CharmeRT.Combined.Score.by.Search.Engine.MS.Amanda.20"
#> [30] "XCorr.by.Search.Engine.Sequest.HT"
#> [31] "Top.Apex.RT.in.min"
Exercise 1
Examine the column names and answer the following:
- What do you the difference is between “Protein.Accessions” and “Master.Protein.Accessions”?
- What columns contain the abundances for the the Light and Heavy-labelled samples?
- How do the Light and Heavy samples relatea to the conditions (UV crosslinking +/-)?
Solution
# 1. Protein.Accessions are all the proteins a sequence can match to.
# Master.Protein.Accessions are the master protein(s) that a peptide is deemed
# to be most likely to originate from.
# 2. "Abundance.F26.Light.Sample" & "Abundance.F26.Heavy.Sample"
# 3. It's not possible to tell! We'll deal with this later.
Solution end
Your PD output may have additional columns. For example, if all your SILAC samples were simultaneously processed in PD, you will have a ‘Light’ and ‘Heavy’ column for each one.
To simply the process of reading in the data and performing initial
filtering, we will use camprotR::parse_features
. This
function will read in the data and remove contaminant proteins and
features without quantification data. Contaminant proteins were defined
using the cRAP database and
provided to PD. We need to obtain their accessions and provide these to
camprotR::parse_features
. Below, we parse the cRAP FASTA to
extract the IDs for the cRAP proteins, in both ‘cRAP’ format and Uniprot
IDs for these proteins.
crap_fasta_inf <- system.file(
"extdata", "cRAP_20190401.fasta.gz",
package = "Proteomics.analysis.data"
)
# Load the cRAP FASTA used for the PD search. For your experiment,
# make sure you're using the same file as used in the PD workflow
crap_fasta <- Biostrings::fasta.index(crap_fasta_inf, seqtype = "AA")
# Extract the non-cRAP UniProt accessions associated with each cRAP protein
crap_accessions <- crap_fasta %>%
pull(desc) %>%
stringr::str_extract_all(pattern="(?<=\\|).*?(?=\\|)") %>%
unlist()
We can then supply these cRAP protein IDs to
camprotR::parse_features
which will parse the PD output and
remove features which may originate from contaminants, as well as
features which don’t have a unique master protein. See
?parse_features
for further details, including the removal
of ‘associated cRAP’. parse_features
will output messages
about the number of features at each stage in the filtering.
# We use lapply below to run the same function on each element of the list pep_data
# This is easier than running a for loop since the output is a new named list
pep_data_parsed <- lapply(
pep_infiles, function(infile) { # define the function to run
infdata <- read.delim(
system.file("extdata", infile, package = "Proteomics.analysis.data"))
parse_features(infdata,
silac = TRUE,
level = 'peptide',
crap_proteins = crap_accessions,
unique_master = FALSE)
}
)
Next, we add new columns to describe the intensities with respect to the CL and NC conditions, based on how H & L map to CL & NC for each replicate. We will also subset to only those columns we need.
# Define a function name to take the replicate number and annotate the data
annotate_parsed_data <- function(rep_n){
pep_data <- pep_data_parsed[[rep_n]]
# Identify the column names for the light and heavy intensities
# (these names are not consistent between samples, yours may be different!)
abundance_light_col <- grep('Abundance.*.Light.Sample', colnames(pep_data), value = TRUE)
abundance_heavy_col <- grep('Abundance.*.Heavy.Sample', colnames(pep_data), value = TRUE)
# Label-swap info defines whether heavy or light is CL
cl_col <- ifelse(rep_n %in% 1:2, abundance_heavy_col, abundance_light_col)
nc_col <- ifelse(rep_n %in% 1:2, abundance_light_col, abundance_heavy_col)
pep_data <- pep_data %>%
# Add new columns with CL or NC intensities
mutate('CL' = !!sym(cl_col), # !!sym(var) allows us to supply a string var for tidy evaluation
'NC' = !!sym(nc_col),
Replicate = rep_n) %>%
# And subset to these columns
select(Master.Protein.Accessions,
Sequence,
Modifications,
CL,
NC,
Replicate)
return(pep_data)
}
# Apply the function to all peptide data.frames
annot_pep_data_parsed <- lapply(names(pep_data_parsed), # annotate_parsed_data function using the name
annotate_parsed_data)
# Re-annotate with the names
names(annot_pep_data_parsed) <- names(pep_data_parsed)
Now that we have added the replicate number as a column, we can bind
together the rows from all replicates to keep all our data in a single
data.frame
.
abundance_data_complete <- do.call('rbind', annot_pep_data_parsed) %>%
remove_rownames()
Exercise 2
Add a new column to
abundance_data_complete
which gives the CL/NC ratio
Solution
abundance_data_complete %>% mutate(ratio=CL/NC)
Solution end
Here, we will calculate the CL/NC ratio using
camprotR::get_ratio
, which will also add a
missing
column to describe whether one of the
quantification values is missing. This is useful for quality control
purposes. camprotR::get_ratio
expects the quantification
values to be on a log scale and returns log-transformed ratios. We will
use base 2 for our log transformation, since this is relatively
intuitive.
ratios <- abundance_data_complete %>%
filter((is.finite(CL) | is.finite(NC))) %>% # Retain peptides where either CL and/or NC is finite
mutate(CL = log2(CL), NC = log2(NC)) %>% # log2-transform quantification
get_ratio(CL, NC, bind = TRUE) # Obtain CL/NC ratio
Below, we tally and plot the missing values per replicate for each method. Note that ~40% of peptides have a missing quantification value and it’s much more likely to be missing in NC than CL. This is what we expect since the OOPS method is designed to capture RBPs that are UV crosslinked to RNA, so without crosslinking, there should be very little protein present in the interface.
# Tally the missing values
missing_tallies <- ratios %>%
group_by(missing) %>%
tally()
missing_tallies %>%
ggplot(aes(x = missing, y = n)) +
theme_camprot(border = FALSE) + # theme_camprot is a ggplot2 theme defined in camprotR
geom_bar(stat = 'identity', colour = 'grey50') +
labs(y='Peptides', x='') +
theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))
Exercise 3
Modify the code above so that rather than showing one bar each for every level of missing status, it shows a stacked bar for each replicate, where the fill denotes the missing status (see below for how it should look)
Solution
missing_tallies <- ratios %>%
group_by(Replicate, missing) %>%
tally()
missing_tallies %>%
ggplot(aes(x = Replicate, y = n, fill = missing)) +
theme_camprot(border = FALSE) +
geom_bar(stat = 'identity', position = 'fill') +
labs(y = 'Fraction', fill='')
> Solution end
Given the large number of missing values, there is a concern that protein abundance in NC may be very low. Therefore, NC peptides identified by mass shift rather than peptide spectrum matching could be incorrect. Note that peptide sequences identified by PSM are referred to as having been ‘matched’ in some code blocks below.
To get the information about what quantification values come from PSMs, we need to interrogate the PSM-level output.
# These files are part of the Proteomics.analysis.data package
psm_infiles <- gsub('_PeptideGroups.txt', '_PSMs.txt', pep_infiles)
Below we use camprotR::silac_psm_seq_int
to identify
which quantification values are from PSMs.
psm_matched_data <- lapply(
psm_infiles, function(infile) {
# Read in the PSM file
infdata <- read.delim(
system.file("extdata", infile, package = "Proteomics.analysis.data"))
# If you export from PD including the 'Sequence' column,
# you don't need this step.
# Below, we convert the Annotated.Sequence column into a Sequence column
infdata$Sequence <- toupper(infdata$Annotated.Sequence)
# Summarise spectrum matches for each peptide
camprotR::silac_psm_seq_int(infdata, sequence_col = 'Sequence')
}
)
Then we bind together the spectrum matched data into one data frame, ready to merge with the peptide quantification.
all_psm_matched_data <- psm_matched_data %>%
names() %>%
lapply(function(rep_n) {
psm_matched_data[[rep_n]] %>%
mutate(Replicate = rep_n)
}) %>%
bind_rows()
Below, we merge the quantification and PSM information and make new
columns to describe whether the CL/NC quantification is from a PSM. Note
that we need to update the Modifications
and
Sequence
columns in the peptide-level output so they can be
matched to the columns in the PSM-level columns.
# merge the matched information and add new columns for the CL/NC matched information
ratios_matched <- ratios %>%
# Update the modifications column for the peptide object so it
# doesn't include SILAC modifications
mutate(Modifications = remove_silac_modifications(Modifications, level = 'peptide')) %>%
# Update the sequence column to all uppercase
rowwise() %>% mutate(Sequence = toupper(Sequence)) %>%
# Merge with the sequenced information
merge(all_psm_matched_data,
by = c('Sequence', 'Modifications', 'Replicate')) %>%
# Add new columns with CL/NC matched information using the
# matched_Heavy and matched_Light columns and the label swap information
# (rep1/2, CL=H; rep3/4 CL=L)
mutate('Matched_CL' = ifelse(Replicate %in% 1:2, matched_Heavy, matched_Light),
'Matched_NC' = ifelse(Replicate %in% 1:2, matched_Light, matched_Heavy)) %>%
# And subset to these columns
select(Master.Protein.Accessions,
Sequence,
Modifications,
Replicate,
ratio,
CL,
NC,
Matched_CL,
Matched_NC,
missing)
# Add a single column to describe the matched information across the two conditions
ratios_matched <- ratios_matched %>%
mutate(
matched = interaction(Matched_CL, Matched_NC),
matched = factor(recode(matched,
'TRUE.TRUE'='Both spectrum matched',
'TRUE.FALSE'='CL spectrum matched',
'FALSE.TRUE'='NC spectrum matched'),
levels = c('Both spectrum matched',
'CL spectrum matched',
'NC spectrum matched'))
)
Below, we consider how often the ratio comes from a peptide where both CL and NC were spectrum matched, or just one spectrum matched, and the other one therefore being by mass shift. Note that a minority of peptides have both CL and NC sequenced and the majority are just sequenced in CL.
# Tally the peptide match status
matched_tallies <- ratios_matched %>%
filter(is.finite(ratio)) %>%
group_by(Replicate, matched) %>%
tally()
# Plot as stacked bar plot
matched_tallies %>%
ggplot(aes(x = Replicate, y = n, fill = matched)) +
geom_bar(stat = 'identity', position='fill', colour = 'grey50') +
theme_camprot(border = FALSE) +
scale_fill_manual(values = get_cat_palette(3), name = '') +
xlab('Replicate') +
ylab('Fraction of peptides')
Discussion Why are more peptides spectrum matched in just CL?
Solution
# The CL peptides are higher intensity
Solution end
Now we have summarise the ‘spectrum matched’ information, we can then consider how this relates to the correlation between CL and NC.
ratios_matched %>%
ggplot(aes(x = CL, y = NC)) +
geom_point(size = 0.5, alpha=0.5) +
geom_abline(slope = 1, linetype = 2, colour = 'grey50') + # line at CL==NC
theme_camprot(base_size = 15, border = FALSE) +
facet_grid(Replicate~ matched) +
xlab('CL (log2)') +
ylab('NC (log2)')
#> Warning: Removed 4715 rows containing missing values (geom_point).
Below, we quantify the correlations
cl_nc_correlations <- ratios_matched %>%
filter(is.finite(CL), is.finite(NC)) %>%
group_by(matched, Replicate) %>%
summarise(cor = cor(CL, NC))
#> `summarise()` has grouped output by 'matched'. You can override using the
#> `.groups` argument.
cl_nc_correlations %>%
ggplot(aes(matched, cor, colour=Replicate)) +
geom_point() +
theme_camprot(border = FALSE, base_size=15) +
theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1)) +
xlab('') +
ylab('Pearson correlation') +
scale_colour_manual(values=get_cat_palette(4))
Note that when both CL and NC are spectrum matched, there are two clear populations, especially for replicates 1 & 2.
When just CL is matched, the correlation is less clear and the two populations are not obvious. One plausible explanation is that many of the NC quantification values obtained when CL is identified by spectrum matching, but NC is identified by mass shift, are from ions erroneously assigned to a peptide. This would be a concern.
Whenever a SILAC experiment is performed where complete absence of a protein in one condition is likely, it is recommended to perform this QC step. If the correlation is very poor when just one condition is spectrum matched, it may be necessary to exclude these peptides.
In this vignette example, we will proceed with all the peptides, but retain information about the spectrum matching in the protein quantification, should we need to investigate this further.
MSnSet
Now, we want to make an MSnSet
from the peptide-level
CL/NC ratios. For this, we need to generate a matrix of quantification
values (CL/NC ratios), where rows are features (peptides), and a
separate data.frame with information about the features, e.g master
protein assignment. See the
vignette("msnset", package="camprotR")
for more details
about MSnSets.
For this experiment, each of the SILAC samples was processed
separately. This is sometimes more suitable than processing all samples
together when we don’t want PD to share information between samples and
attempt to find peptides that we missed in one sample but spectrum
matched in another. However, this also means the assignments of peptides
-> master proteins will not be 100% consistent across samples, since
this assignment depends on the total set of observed peptides. This will
cause issues when we try and create the MSnSet
since one
peptide could be assigned to different proteins across the samples and
we therefore need to update the master protein column.
camprotR::get_parsimony_pep2prot()
will take a list of
files, each with Sequence, Protein.Accessions and
Master.Protein.Accessions columns and generate a unified peptide
sequence to master protein assignment. This is done by an approximately
parsimonious approach, which is not identical to PD. The function
therefore also summarises the difference between the PD assignments and
the updated assignment. This is presented in two tables with the number
and percentage of peptides with single or multiple master proteins in
the input data (Original
) and the updated peptide to
protein assignments (Updated
). Finally, a plot is shown
with the number of peptide sequences per protein using the original and
updated assignments. Typically, the differences between the original PD
assignments and the updated assignments is small.
Note If all your SILAC samples were processed in one single PD workflow, e.g you have just one file for all samples, this step is not neccessary. You do not need to create
new_seq_to_master
or update theMaster.Protein.Accessions
column.
# Create a unified peptide sequence to protein assignment
new_seq_to_master <- pep_infiles %>%
lapply(function(x) system.file("extdata", x, package = "Proteomics.analysis.data")) %>%
camprotR::get_parsimony_pep2prot()
#> With original assignments, 1951 master proteins. With update, 1820 master proteins
#> Storing counts in `nn`, as `n` already present in input
#> Storing counts in `nn`, as `n` already present in input
#> Comparing Master Protein IDs
#> ℹ Use `name = "new_name"` to pick a new name.
#> .
#> Different ID(s) Same ID(s)
#> 1188 18372
#> .
#> Different ID(s) Same ID(s)
#> 6.1 93.9
#> Updated:Multiple Updated:Single
#> Original:Multiple 106 629
#> Original:Single 384 18441
#> Updated:Multiple Updated:Single
#> Original:Multiple 0.54 3.22
#> Original:Single 1.96 94.28
Next, we remove peptides without both CL and NC, e.g no quantified ratio, and then merge with the updated master protein assignments
# Remove peptides without both CL + NC quantified
flt_ratios <- ratios_matched %>%
filter(is.finite(ratio)) %>%
merge(new_seq_to_master, by='Sequence') %>%
# replace Master.Protein.Accessions with Updated.Master.Protein.Accessions
mutate(Master.Protein.Accessions=Updated.Master.Protein.Accessions)
Now we have our sanitised ratios, we can create the
MSnSet
. This object contains 3 elements: - A quantification
data matrix (rows=features, e.g peptides/proteins, columns=samples) -
Feature data (rows=features, columns=feature annotations, e.g peptide
master protein assignment) - Experimental details (rows=samples,
columns=experimental details, e.g treatment)
Our ratios are in a ‘long’ format, so we need to make them wider, so that each sample is a column in a quantification matrix.
# Create a wide table with unique ids as row names
flt_ratios_wide <- flt_ratios %>%
select(Master.Protein.Accessions,
Sequence, Modifications,
Replicate, Matched_NC, ratio) %>%
pivot_wider(names_from = "Replicate", values_from = c("Matched_NC", "ratio")) %>%
# Create an id column from the sequence and modification columns
unite(id, Sequence, Modifications, remove = FALSE) %>%
column_to_rownames(var = "id")
# Create expression matrix (exprs)
exprs_data <- flt_ratios_wide %>%
select(matches("ratio_[1-4]")) %>% # select unique ids and ratio data
as.matrix()
For the peptide features, we will take all the columns other than the peptide ratios
# Create feature metadata data frame (fData)
feat_data <- flt_ratios_wide %>%
select(!matches("ratio")) # select unique ids and everything but ratio data
We then create an MSnSet
. Note that here, we do not
provide any experimental condition to the MSnSet
constructor, so ‘pData’ returns a data.frame with no columns
# Create MSnSet
pep_res <- MSnSet(exprs = exprs_data,
fData = feat_data)
print(pData(pep_res))
#> data frame with 0 columns and 4 rows
The only important experimental condition here is the replicate number, which we add below
# Add replicate number to phenotype data
pData(pep_res)$Replicate <- 1:ncol(exprs_data)
Below, we print the object to see a summary.
print(pep_res)
#> MSnSet (storageMode: lockedEnvironment)
#> assayData: 3839 features, 4 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> sampleNames: ratio_1 ratio_2 ratio_4 ratio_3
#> varLabels: Replicate
#> varMetadata: labelDescription
#> featureData
#> featureNames: AAAAAAAAAAAAAAAGAGAGAK_ AAAAAAAAAAGAAGGR_1xAcetyl
#> [N-Term] ... YYVTIIDAPGHR_ (3839 total)
#> fvarLabels: Master.Protein.Accessions Sequence ... Matched_NC_3 (7
#> total)
#> fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
#> - - - Processing information - - -
#> MSnbase version: 2.20.4
Exercise 4
- How many peptides are there?
- How many feature columns are there?
Solution
nrow(pep_res) #1
length(fvarLabels(pep_res)) #2
Solution end
Note that there is also a section call ‘Processing information’. This will retain information about any processing we perform on our data.
See the vignette("msnset", package="camprotR")
for more
details about MSnSets.
print(pep_res)
#> MSnSet (storageMode: lockedEnvironment)
#> assayData: 3839 features, 4 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> sampleNames: ratio_1 ratio_2 ratio_4 ratio_3
#> varLabels: Replicate
#> varMetadata: labelDescription
#> featureData
#> featureNames: AAAAAAAAAAAAAAAGAGAGAK_ AAAAAAAAAAGAAGGR_1xAcetyl
#> [N-Term] ... YYVTIIDAPGHR_ (3839 total)
#> fvarLabels: Master.Protein.Accessions Sequence ... Matched_NC_3 (7
#> total)
#> fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
#> - - - Processing information - - -
#> MSnbase version: 2.20.4
Then we summarise the peptide-level ratios to protein-level ratios.
Here, we’ll use the median. Note the message about missing values. We’re
given the option na.rm=TRUE
, which will be passed onto
median()
to ensure missing values are handled
appropriately, e.g median(x, y, NA)
will return the
equivalent to medium(x, y)
.
prot_res <- combineFeatures(
pep_res,
groupBy = fData(pep_res)$Master.Protein.Accessions,
method = "median",
na.rm = TRUE
)
#> Your data contains missing values. Please read the relevant section in
#> the combineFeatures manual page for details on the effects of missing
#> values on data aggregation.
print(nrow(pep_res))
#> [1] 3839
print(nrow((prot_res)))
#> [1] 1128
Note that the summarisation is occurring on the expression data, but will affect the feature data too (the values in the first feature are taken). If we want to retain information about whether the peptide was ‘matched’ in both conditions, we will need to generate this ourselves and re-attach to the MSnSet.
prot_matched_nc <- fData(pep_res) %>%
select(-Sequence, -Modifications) %>%
pivot_longer(cols = -Master.Protein.Accessions, values_to = 'Matched_NC') %>%
group_by(Master.Protein.Accessions) %>%
filter(!is.na(Matched_NC)) %>%
summarise(any_not_sequenced = any(!Matched_NC))
fData(prot_res) <- fData(prot_res) %>%
rownames_to_column(var = "id") %>%
left_join(prot_matched_nc, by = "Master.Protein.Accessions") %>%
column_to_rownames(var = "id")
We can now inspect the completeness of our protein-level data using
MSnbase::plotNA()
. In this case 383/1128 proteins have
complete data (all 4 replicates)
MSnbase::plotNA(prot_res, pNA = 0)
At this point, we have obtained an MSnSet
containing
protein-level ratios for each sample. You will likely wish to perform
bespoke visualisations and statistical analyses from this point.
Below, we save the peptide and protein level objects to disk, so we
can read them back into memory in downstream analyses. We use
saveRDS
to save them in compressed R binary format.
saveRDS(prot_res, 'results/prot_res.rds')
saveRDS(pep_res, 'results/pep_res.rds')
For an example of differential abundance testing from the SILAC CL/NC ratios, see Differential abundance testing with SILAC ratios