On-disk data format
Introduction
As a user of pdata, you don’t normally need to worry about the on-disk
data format, since you should be using pdata.analysis.dataview
for loading the data. This page is meant mainly for developers of
pdata itself, and other people who run into unexpected issues and need
some understanding of the inner workings of the package.
Warning
Despite the format being self-explanatory in principle,
you should always read in the data using
pdata.analysis.dataview, which provides plenty of useful
functions for automatically parsing data not just from the tabular
data stored with add_points, but also the instrument
parameters stored in the JSON files. Reimplementing these features
is essentially never a wise use of time. You can use
pdata.analysis.dataview for the initial parsing step even if
you use other tools later on in the analysis (see Analyzing
with other tools).
Specification
Each pdata data set is stored in its own directory. The directory contains the following files.
tabular_data.dat
Data table with rows added using add_points, and columns
defined as arguments of run_measurement.
Encoding: Only bytes corresponding to valid UTF-8 code points are allowed. In all comment rows, including header and footer, only ASCII characters 0-127 are allowed. Those are also valid single-byte UTF-8 characters.
New lines are encoded as \n. However, parsers must also
tolerate \r\n, since they can be present in legacy datasets.
Any row starting with # is a comment row. A row is considered
empty if it contains no characters other than the line termination, or
the # character and white space.
All non-empty non-comment rows are data rows. Each data row contains a
single data point as a tab-separated (\t) list of values, one
value per column. The tab-separated values can contain any character
except #, \t, or \n. add_points will
automatically replace these characters with spaces. Each data row,
including the last one, must end in \n.
Numerical values must not include extra whitespace at the beginning or
end. Numerical values must not contain a leading plus
sign. Floating-point numbers must be in the locale-independent format
expected by the C++17 from_chars function. In particular, use a period
(.) as the decimal separator and no thousands separator.
Complex numbers must be formatted as <float0>+<float1>j, or
<float0>-<|float1|>j if float1 is negative. Both real
and imaginary parts must be present.
The last non-empty comment row contains the column names and units as
a tab-separated (\t) list of strings. Each string is of the
format <column name> (<units>), where the column names
(units) must match the regular expression [\w\d\s\-+%=/*&]+
([\w\d\s\-+%=/*&]*).
All comment rows preceding the first data row are called the header. All comment rows after the last data row are defined as the footer.
The header contains a version number for the ondisk format, encoded as
# ondisk_format_version = <major>.<minor>.<patch>, following
Semantic Versioning. New in ondisk format
version 1.0.0.
The header contains the version numbers of the pdata, jsondiff, and
numpy packages as well as the version of Python, encoded as
# <package>_version = <major>.<minor>.<patch>. New in ondisk
format version 1.0.0.
The header contains the column data types, encoded as a tab-separated
(\t) list of strings. Each string has the format
<module>.<dtype>, where <module> is typically either
<numpy> or <builtins>. These help
pdata.analysis.dataview parse common dtypes (float, int,
complex, str, etc.) back into the correct type. New in ondisk format
version 1.0.0.
The header (footer) contains a timestamp specifying when the
measurement started (ended), encoded as # Measurement started
(ended) at <timestamp>, where timestamps has the format
%Y-%m-%d %H:%M:%S.%f. New in ondisk format version 1.0.0.
The footer provides the total number of data rows, encoded as #
Number of data rows: <total number of data rows>. This is available
only after the measurement has ended. New in ondisk format version
1.1.0.
The footer contains a list of snapshot diff rows, encoded as #
Snapshot diffs preceding rows (0-based index): <row no>, <another row
no>, ..., where row numbers are defined in the same way as in the
snapshot.row-<n>.diff<m>.json filenames (see below). These are
intended to provide a consistency check, and are available only after
the measurement has ended. New in ondisk format version 1.0.0.
Optionally, this file may be compressed (.gz added to file name).
snapshot.json
Instrument parameter snapshot when run_measurement started,
encoded as a JSON file loadable with the json module, using the
standard decoder.
Optionally, this file may be compressed (.gz added to file name).
snapshot.row-<n>.diff<m>.json
jsondiff of parameter changes, recorded when there were <n> data rows in tabular_data.dat. <m> is a simple counter, in case multiple diffs are created for the same row.
Optionally, these files may be combined and compressed into a gzipped tarball (tar.gz added to file name).
The diffs are always in the compact format, and produced with
marshal=True and cls=pdata.helpers.PdataJSONDiffer as
options. The purpose of the custom JsonDiffer class is to handle Numpy
ndarrays and lists of only scalars as complete blocks, which is
important for maintaining reasonable speed.
jsondiff documentation does not seem to include a clear specification of the format. We therefore specify it here.
Compact jsondiff format with marshal=True
The diff between a source dict and a target dict consists of a structure of nested dictionaries and lists. Let us call a sequence of keys specifying a leaf node or a subset of nodes in that diff a “path”.
If the path does not contain $delete, the target dict is
obtained from the source dict by following the path up to the point where the source and
target
log.txt
A copy of log messages recorded during the measurement (from the logging module).
input-history
A copy of input given to IPython/Jupyter in the current session, up to 500 most recent cells. Optional.
A copy of the Jupyter notebook (.ipynb)
A copy of the main measurement script. Optional and disabled by default. Only available in Jupyter Notebook, not in Jupyter Lab.
Changelog
v1.1.0 (from v1.0.0)
Added number of data rows as metadata in footer.
Specified that only ASCII characters 0…127 are allowed in comment rows in tabular_data.dat.
Specified better how numerical values should be formatted.
Specified that the last data row must also end in a new line character.
Motivation for the chosen format
Pdata is geared toward single-lab-scale experimental physics experiments, such as superconducting qubit experiments, IV measurements, etc. This is in contrast to big-data experiments (e.g. collecting machine learning data sets).
An important goal of the data format is to be self-documenting, such that it is in principle straightforward for a competent programmer to figure out how to parse the data, even without the pdata source.
The format also aims to be stable enough that the latest version of
pdata.analysis.dataview is able to read any data set recorded
with any previous version of pdata.
Another important design criterion is that it must be possible to read the latest data in a separate analysis script (i.e. separate process) as soon as new data becomes available from the experiment.
Therefore the data format is:
Stream-like, i.e. the on-disk data set is a valid and up-to-date dataset at all times during an on-going experiment, and not only after the measurement ends.
Relatively verbose. Or conversely, optimizing file size or speed is not a top priority.
Based on text files and other wide-spread formats (.gz, .json).
Includes a README file in each data directory.
Includes a copy of the measurement script, if possible.
Note
An advantage of using gzipped files, besides the obvious benefit of smaller file size, is that gzipped files contain a checksum. This ensures that (post-measurement) data corruption does not go unnoticed.
Note
A downside of the chosen data format is that it’s relatively slow to read from disk to memory. So if you are dealing with larger data sets, it’s highly recommended to split your analysis script into multiple steps and make use of caching parsed values and/or intermediate analysis results in cache files. There are several easy ways of doing that in Python, for example using pickle, numpy, or json.
Discussion on alternative formats
Here we have some notes on alternative formats, which are not used by pdata.
To simplify the task of having pdata.analysis.dataview support
all pdata datasets, including ones recorded with earlier versions of
pdata, changes to the on-disk data format are generally to be
avoided without very good reason.
Text based vs binary
Binary formats could offer better write and read speeds, assuming that implementation details are properly tuned. Reaching hardware-limited speed is, however, almost irrelevant for the vast majority of physics experiments that pdata is geared toward.
Binary cache files are also easy to create in Python and can be integrated as part of the data analysis workflow in most cases. Such cache files can (and should) be considered disposable, so they can be native to the system and can therefore provide unbeatable speed.
In general, any binary format is more opaque than a text-based format, if you were faced with the challenge of reverse engineering the format. With very wide spread formats this is less of a concern (e.g. .npy/.npz).
Numpy .npy/.npz
Numpy .npy/.npz would be a very reasonable binary format for the data rows of tabular data. The format is well-specified and stable and has a design philosophy similar to pdata’s, except that it’s binary.
HDF5
The main argument against using HDF5 is that the HDF5 specification is very complex (see 100+ page HDF5 specification vs .npy/.npz specification), without providing any clear advantage compared to .npz, in the case of pdata. The complexity of the specification isn’t a problem from the point of view of routine use since one, and only one, HDF5 library implementation exists. However, it could be non-trivial to debug issues in the unlikely event that bugs related to the HDF5 library would be encountered.
Note
At first sight it seems tempting to encode snapshots as nested HDF5 groups, which would provide strong data typing. However, the overhead in file size is severe (~kB per group!).
Binary JSON
There are a few variants of JSON-like formats but with binary encoding. These would potentially offer faster read speeds, while also being rather simple. This could be a benefit in use cases with very large snapshot diffs
The main disadvantage is that there are several slightly-incompatible variants of these formats and none of them seems broadly adopted, although Mathematica supports UBJSON.