Container & Archive¶
An mzPeak archive bundles several Parquet files and a JSON index under a single name. That bundle can be delivered three ways: as a ZIP archive, as an unpacked directory, or as a remote prefix (for example an HTTP server supporting range requests, S3, FTP, or WebDAV).
ZIP archives¶
To pack multiple Parquet tables together under a single file name on disk, we need a container format. mzPeak uses the ZIP archive to bundle the member files together. A ZIP file begins with a header containing its magic bytes, followed by a sequence of (header, file) blocks, and terminates with a central directory that records how to find each member.
Files in a ZIP may be stored compressed or uncompressed.
Members MUST be stored uncompressed
When mzPeak is stored in a ZIP, it MUST store its member files uncompressed. Uncompressed members can be read directly without an intervening decompression step to reveal the Parquet file, and each Parquet file already contains layered compression superior to that of a typical ZIP compressor.
Why not TAR?¶
TAR archives are designed for linear traversal: to learn what files an archive contains you must hop from header entry to header entry until you reach the end. Compared with ZIP's central directory, this is less efficient and more expensive on object stores. TAR also has no per-file encryption, which would make protecting the parts of an archive that are not Parquet files harder.
Unpacked archives¶
If an mzPeak archive is stored as an unpacked directory, the directory name is treated as the name of the run.
Encryption¶
Because Parquet supports modular encryption, individual Parquet files — and even individual columns or the footer metadata — can be encrypted while leaving the archive readable as a whole.
Open item — index visibility vs. encryption
Anything placed in mzpeak_index.json is necessarily visible in cleartext
to all readers unless ZIP encryption is used, and ZIP encryption is widely
known to be flawed and inconsistently implemented. Metadata inside a Parquet
footer's key–value pairs can be encrypted. The index is JSON for
convenience and for easy access from scripting languages; whether sensitive
fields should move into encryptable Parquet metadata is an open design
question.