Spectra¶
Spectra are described by up to three files: a signal file
(spectra_data.parquet), an optional peaks file
(spectra_peaks.parquet), and a metadata file
(spectra_metadata.parquet).
Profile vs. centroid — where each goes
By consensus at HUPO-PSI 2026, profile data goes in
spectra_data.parquet and centroid data goes in
spectra_peaks.parquet, always. When a file contains both for the same
spectrum, both files are present and the metadata row carries both
MS_1003060_number_of_data_points
and
MS_1003059_number_of_peaks so
a reader knows which file(s) to read. A reader exposes a mode flag
(profile / centroid) indicating which representation the caller wants.
For TimsTOF-style data that is centroided in m/z but profiled in ion
mobility, the consensus is to treat it as centroid for the mass-spectrum
dimension and place it in spectra_peaks.parquet.
Spectrum signal data — spectra_data.parquet¶
The spectrum signal data is encoded using either
point layout or
chunked layout. The entity index column MUST
be named spectrum_index, and a co-located time column, if written, SHOULD
be named spectrum_time. Non-mass spectra (UV, DAD) belong in
wavelength_spectra_data.parquet.
When using null marking, follow the null semantics for signal data carefully for profile data.
Profile only
Only profile spectra are written here. Centroid spectra — including
centroided views of profile spectra when both modes are stored — MUST
instead be written to spectra_peaks.parquet.
The number of points written here for a spectrum MUST be recorded in the
MS_1003060_number_of_data_points
column of spectra_metadata.parquet, to support read planning.
Recommended Parquet encodings¶
| Column | Encoding |
|---|---|
spectrum_index |
delta encoding — ideal for repetitive or slowly increasing integers. |
spectrum_time |
byte stream split |
| m/z arrays | byte stream split (byte shuffling), or RLE dictionary when there is ion-mobility data. |
| ion-mobility arrays | RLE dictionary; byte shuffling tends not to help. Consider increasing the dictionary page size. |
Spectrum peak data — spectra_peaks.parquet¶
The spectrum peak lists, stored separately from the raw signal in
spectra_data.parquet. The entity index column MUST be named
spectrum_index, and a co-located time column, if written, SHOULD be named
spectrum_time. Any centroid spectra MUST be written here, not to
spectra_data.parquet. The number of peaks written for a spectrum MUST be
recorded in the
MS_1003059_number_of_peaks column
of spectra_metadata.parquet, to support read planning.
Spectrum metadata — spectra_metadata.parquet¶
This table uses the
packed parallel metadata table schema. Column
order is generally unspecified, but spectrum.index, scan.source_index,
precursor.source_index, and selected_ion.source_index MUST be the first
column of their respective facets. Where the lists below say MAY, that value
may be stored either as a column or as an entry in the
parameters list — a column
usually makes more sense when the value is usually present.
spectrum (group)¶
index(uint64) — the ascending 0-based index. MUST increment by 1 per entry and SHOULD be written in time-sorted ascending order. This is the primary key for thespectrumfacet and the root unit of addressability.id(string) — the "nativeID" string identifier, formatted per a native identifier format. The specific format SHOULD be given in the file-level metadata underfile_description.source_files[0].parameters, as in mzML.time(float64) — the data-acquisition start time. SHOULD be replicated from the parallelscanfacet for simpler filtering; for a spectrum with multiple scans it SHOULD be the minimum value if the run is in acquisition-time order.MS_1000511_ms_level(integer) — the MS stage number, ornullfor non-mass spectra.data_processing_ref(string) — the identifier of adata_processingthat governs this spectrum if it deviates from the default inrun.default_data_processing_id;nullotherwise.parameters(list) — controlled or uncontrolled parameters; see the parameters list.number_of_auxiliary_arrays(integer) — the count of auxiliary arrays in this row'sauxiliary_arrayscolumn; lets a reader cheaply decide whether to decode them.auxiliary_arrays(list) — structures describing arrays that did not fit the arrays-and-columns constraints. These may be large; load eagerly with care.mz_delta_model(list of float64) — parameters of the m/z delta model used to reconstruct null-marked data.MS_1000525_spectrum_representation(CURIE) — e.g.MS:1000128"profile spectrum" orMS:1000127"centroid spectrum".MS_1000465_scan_polarity(integer) —1(positive),-1(negative), ornull.MS_1000559_spectrum_type(CURIE) — a child of MS:1000559, e.g. MS1 spectrum (MS:1000579), MSn spectrum (MS:1000580).MS_1003060_number_of_data_points(integer) — profile points stored inspectra_data.parquet.MS_1003059_number_of_peaks(integer) — discrete peaks stored inspectra_peaks.parquet.- MAY supply a child of
MS:1003058(spectrum property) one or more times — e.g. base peak m/z, total ion current. - MAY supply a child of
MS:1000499(spectrum attribute) one or more times — e.g.MS_1000796_spectrum_title.
scan (group)¶
A scan or acquisition from the original raw file used to create a spectrum.
source_index(uint64) — theindexof the spectrum this scan belongs to (foreign key).scan_index(uint64) — the ascending 0-based index, incrementing by 1 per entry; uniquely identifies a scan, especially with multiple scans per spectrum (summing/averaging).spectrum_reference(string) — another spectrum corresponding to this scan. For local spectra, itsid; for external sources, a USI SHOULD be used. For unpublished collections, useUSI000000as the collection identifier with theidof a source file infile_description.source_files.instrument_configuration_ref(integer) — theinstrument_configurationgoverning this scan.parameters(list).ion_mobility_value(float64) — optional ion-mobility measurement for this scan.ion_mobility_type(CURIE) — optional; a child ofMS:1002892.scan_windows.- MAY supply children of
MS:1000503(scan attribute),MS:1000018(scan direction, once), andMS:1000019(scan law, once).
precursor (group)¶
The method of precursor-ion selection and activation.
source_index(uint64) — the spectrum this precursor belongs to (foreign key).precursor_index(uint64) — the spectrum the precursor was created from (foreign key).precursor_id(string) — theidof the spectrum referenced byprecursor_index.isolation_window(group) — the isolation/selection window.parameters(list).- MUST supply children of
MS:1000792(isolation-window attribute) one or more times; promote to columns when available — e.g. isolation-window target m/z, lower offset, upper offset.
activation(group) — the activation/dissociation type and energy.parameters(list).- MAY supply children of
MS:1000510(precursor activation attribute). - MUST supply
MS:1000044(dissociation method) or a child, one or more times.
selected_ion (group)¶
An ion isolated for dissociation.
source_index(uint64) — the spectrum this selected ion belongs to (foreign key).precursor_index(uint64) — the spectrum the selected ion was created from (foreign key).ion_mobility_value(float64) /ion_mobility_type(CURIE, child ofMS:1002892) — optional.parameters(list).- MUST supply a child of
MS:1000455(ion selection attribute) one or more times — e.g. selected-ion m/z, charge state, intensity.
Open item — generic ion-mobility storage
Is there a better way to make ion-mobility storage generic over its type ("ion mobility drift time", "inverse reduced ion mobility", "FAIMS compensation voltage")? Left open.