Laurent Gatto
Laurent Gatto Computational Proteomics Unit
https://lgatto.github.io University of Cambridge
lg390@cam.ac.uk @lgatt0
Link to slides: http://bit.ly/20170623pmf –
These slides are available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.
Data analysis ● Proteomics ● R/Bioconductor ● Conclusions
(Bioconductor CSAMA 2017 workshop, Mount Plose)
Data analysis is the process by which data becomes understanding, knowledge and insight. Hadley Wickham
The ability to prepare and explore data, identify patterns (good and pathological ones) and convincingly demonstrate that the patterns are genuine (rather than random).
It’s not analysing data, it’s investigating data - requires flexibility.
Data programming, but:
To understand and communication data:
Graphics reveal data.
Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you. Hadley Wickham
It is not for the tool/software to tell me what plotting/analysis to perform; it is for me to apply the most appropriate analysis or visualisation.
It is not for the tool/software to tell me what plotting/analysis to perform; it is for me to ask the most appropriate question.
Data analysis tools should enables you to manipulate your data, give some guarantees about the integrity of the data, support effective extract/subset components of the data, visualise them, enable transformation of the data, give access to infrastucture for statistical analysis, and enable annotation of the data.
Bioconductor provides tools for the analysis and comprehension of high-throughput biology data. Uses the R statistical programming language.
Collaborative project: open source and open development, involving biologists, statisticians, programmers, …
Huber W et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015 Jan 29;12(2):115-21.
~ 1400 packages ● 62 for mass spectrometry ● 92 for proteomics
mzR packagemzR: Efficient access to raw and (netCDF, mzData, mzXML, mzML) identification (mzIdentML).
Chambers et al.. A cross-platform toolkit for mass spectrometry and proteomics. Nature Biotechnology (2012).
MSnbase packageMSnbase: Convenient infrastucture for mass spectrometry and proteomics data analyis.
Laurent Gatto and Kathryn S. Lilley. MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 28, 288-289 (2012).
MSnSet class for quantitative data
Can be subsetted, transformed, visualised, annotated, statistics, …
pRoloc packagepRoloc: A unifying analysis framework for spatial proteomics: visualisation, classification, novelty detection, transfer learning, Bayesian learning (coming soon).
Gatto L, Breckels LM, Wieczorek S, Burger T, Lilley KS. Mass-spectrometry-based spatial proteomics data analysis using pRoloc and pRolocdata. Bioinformatics. 2014 May 1;30(9):1322-4.
Breckels LM, Mulvey CM, Lilley KS and Gatto L. A Bioconductor workflow for processing and analysing spatial proteomics data. F1000Research 2016, 5:2926 (doi: 10.12688/f1000research.10411.1).
plot2D(msnset, fcol = "loc",
method = "PCA")
Thank you for your attention