Published July 18, 2020 | Version 1.0
Journal article Open

mzMLb: a future-proof raw mass spectrometry data format based on standards-compliant mzML and optimized for speed and storage requirements

  • 1. Department of Population Health Sciences and Bristol Veterinary School, University of Bristol BS8 2BN, United Kingdom
  • 2. School of Biosciences and Phenome Centre Birmingham, University of Birmingham, Birmingham B15 2TT, United Kingdom
  • 3. Institute for Systems Biology, Seattle, Washington 98109, United States
  • 4. Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom

Description

With ever-increasing amounts of data produced by mass spectrometry (MS) proteomics and metabolomics, and the sheer volume of samples now analyzed, the need for a common open format possessing both file size efficiency and faster read/write speeds has become paramount to drive the next generation of data analysis pipelines. The Proteomics Standards Initiative (PSI) has established a clear and precise XML representation for data interchange, mzML, receiving substantial uptake; nevertheless, storage and file access efficiency has not been the main focus. We propose an HDF5 file format ‘mzMLb’ that is optimised for both read/write speed and storage of the raw mass spectrometry data. We provide extensive validation of write speed, random read speed and storage size, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases, while with compression, is comparable in size to proprietary vendor file formats. Since our approach uniquely preserves the XML encoding of the metadata, the format implicitly supports future versions of mzML and is straightforward to implement: mzMLb’s design adheres to both HDF5 and NetCDF4 standard implementations, which allows it to be easily utilised by third parties due to their widespread programming language support. A reference implementation within the established ProteoWizard toolkit is provided.

Data files used in the paper.

Files

16_PBQC-10_522-16470.d.zip

Files (10.2 GB)

Name Size Download all
md5:fa065241363395928366cf43c8b41964
13.8 MB Download
md5:1beb7a1738d79db3fef1d6098e945c3c
17.0 MB Download
md5:d90201e8afa5c0b09454dc998ceb2f91
520.8 MB Download
md5:eadf6484ac16e6c98a9e758b5cbe1fba
565.0 MB Preview Download
md5:781a52b2e3e52e7cc592c9ba0a1888ae
1.3 GB Preview Download
md5:133a7ab6d88ca890d5b0b8bfcc3aad09
2.8 GB Preview Download
md5:e5e07d53f4ad9cab6c8ff3ec37c179b6
38.1 MB Preview Download
md5:76b078a169175c8ad5afa06d38ea86dd
527.7 MB Download
md5:f92c470a4fc40e5d5a65f03f896a7958
42.0 MB Download
md5:c652ca92d470f83c68006ae8549130e4
211.0 MB Download
md5:0558e4e30bc365f57fdf864b3fc7ff73
1.7 GB Download
md5:8cd1fd082622c7586249a7cac3b6d0cd
2.5 GB Preview Download

Additional details

Funding

Belgium: Taming the application of statistics in proteomics and metabolomics BB/R021430/1
UK Research and Innovation
Bilateral NSF/BIO-BBSRC: Bayesian Quantitative Proteomics BB/M024954/1
UK Research and Innovation
PhosphoX-db: A web-based bioinformatics platform for studying non-canonical phosphorylation BB/R02216X/1
UK Research and Innovation
MICA: Delivering a production platform and atlas for next-generation biomarker discovery, validation and assay development in clinical proteomics MR/N028457/1
UK Research and Innovation
PROCESS - Proteomics data Collection, Software and Standards to support open access and long term management of data BB/K01997X/1
UK Research and Innovation