Motivation¶
Description of the need¶
mzML has served the mass spectrometry community well as an open, vendor-neutral interchange format. As instruments have become faster and experiments larger — ion-mobility separations, imaging, data- independent acquisition — several of mzML's design choices have become limiting:
- Size and access are at odds. mzML stores binary arrays as base64-encoded
text inside XML. The encoding inflates files, and the usual remedy
(whole-file
gzip) destroys the random access needed to read a single spectrum or an extracted-ion chromatogram without decompressing everything. - Imaging data fits poorly. Spatial coordinates have to be bolted on; imzML exists but integrates awkwardly with the rest of the ecosystem.
- Vendor metadata is lost. Much of what an instrument records has no place in the mzML model and is discarded at conversion time.
- Profile and centroid cannot coexist. A single mzML spectrum cannot carry both its raw profile signal and a centroided peak list at the same time.
- No encryption. mzML offers no mechanism for protecting parts of a file.
mzPeak is designed to remove these limitations while remaining open, language- agnostic, and grounded in the controlled vocabularies the community already uses.
Issues to be addressed¶
The format is shaped by a small number of concrete goals:
- Compact storage with preserved random access — files smaller than mzML (often by half or more) that can still be sliced by spectrum, m/z range, or time range without decoding the whole file.
- Fast analytical queries — extracted-ion chromatograms and similar reads served efficiently through columnar page indices.
- Profile and centroid signal for the same spectrum, stored side by side.
- Richer metadata — room for the vendor and acquisition metadata mzML cannot express, without forcing it into a rigid schema.
- Additional modalities — ion mobility, imaging coordinates, wavelength spectra, and diagnostic traces, with room to grow.
- Optional encryption of sensitive parts of an archive.
- A stable, cross-language data substrate that many implementations can read and write independently.