On-disk data format

Introduction

As a user of pdata, you don’t normally need to worry about the on-disk data format, since you should be using pdata.analysis.dataview for loading the data. This page is meant mainly for developers of pdata itself, and other people who run into unexpected issues and need some understanding of the inner workings of the package.

Warning

Despite the format being self-explanatory in principle, you should always read in the data using pdata.analysis.dataview, which provides plenty of useful functions for automatically parsing data not just from the tabular data stored with add_points, but also the instrument parameters stored in the JSON files. Reimplementing these features is essentially never a wise use of time. You can use pdata.analysis.dataview for the initial parsing step even if you use other tools later on in the analysis (see Analyzing with other tools).

Specification

Each pdata data set is stored in its own directory. The directory contains the following files.

tabular_data.dat

Data table with rows added using add_points, and columns defined as arguments of run_measurement.

Encoding: Only bytes corresponding to valid UTF-8 code points are allowed. In all comment rows, including header and footer, only ASCII characters 0-127 are allowed. Those are also valid single-byte UTF-8 characters.

New lines are encoded as \n. However, parsers must also tolerate \r\n, since they can be present in legacy datasets.

Any row starting with # is a comment row. A row is considered empty if it contains no characters other than the line termination, or the # character and white space.

All non-empty non-comment rows are data rows. Each data row contains a single data point as a tab-separated (\t) list of values, one value per column. The tab-separated values can contain any character except #, \t, or \n. add_points will automatically replace these characters with spaces. Each data row, including the last one, must end in \n.

Numerical values must not include extra whitespace at the beginning or end. Numerical values must not contain a leading plus sign. Floating-point numbers must be in the locale-independent format expected by the C++17 from_chars function. In particular, use a period (.) as the decimal separator and no thousands separator.

Complex numbers must be formatted as <float0>+<float1>j, or <float0>-<|float1|>j if float1 is negative. Both real and imaginary parts must be present.

The last non-empty comment row contains the column names and units as a tab-separated (\t) list of strings. Each string is of the format <column name> (<units>), where the column names (units) must match the regular expression [\w\d\s\-+%=/*&]+ ([\w\d\s\-+%=/*&]*).

All comment rows preceding the first data row are called the header. All comment rows after the last data row are defined as the footer.

The header contains a version number for the ondisk format, encoded as # ondisk_format_version = <major>.<minor>.<patch>, following Semantic Versioning. New in ondisk format version 1.0.0.

The header contains the version numbers of the pdata, jsondiff, and numpy packages as well as the version of Python, encoded as # <package>_version = <major>.<minor>.<patch>. New in ondisk format version 1.0.0.

The header contains the column data types, encoded as a tab-separated (\t) list of strings. Each string has the format <module>.<dtype>, where <module> is typically either <numpy> or <builtins>. These help pdata.analysis.dataview parse common dtypes (float, int, complex, str, etc.) back into the correct type. New in ondisk format version 1.0.0.

The header (footer) contains a timestamp specifying when the measurement started (ended), encoded as # Measurement started (ended) at <timestamp>, where timestamps has the format %Y-%m-%d %H:%M:%S.%f. New in ondisk format version 1.0.0.

The footer provides the total number of data rows, encoded as # Number of data rows: <total number of data rows>. This is available only after the measurement has ended. New in ondisk format version 1.1.0.

The footer contains a list of snapshot diff rows, encoded as # Snapshot diffs preceding rows (0-based index): <row no>, <another row no>, ..., where row numbers are defined in the same way as in the snapshot.row-<n>.diff<m>.json filenames (see below). These are intended to provide a consistency check, and are available only after the measurement has ended. New in ondisk format version 1.0.0.

Optionally, this file may be compressed (.gz added to file name).

snapshot.json

Instrument parameter snapshot when run_measurement started, encoded as a JSON file loadable with the json module, using the standard decoder.

Optionally, this file may be compressed (.gz added to file name).

snapshot.row-<n>.diff<m>.json

jsondiff of parameter changes, recorded when there were <n> data rows in tabular_data.dat. <m> is a simple counter, in case multiple diffs are created for the same row.

Optionally, these files may be combined and compressed into a gzipped tarball (tar.gz added to file name).

The diffs are always in the compact format, and produced with marshal=True and cls=pdata.helpers.PdataJSONDiffer as options. The purpose of the custom JsonDiffer class is to handle Numpy ndarrays and lists of only scalars as complete blocks, which is important for maintaining reasonable speed.

jsondiff documentation does not seem to include a clear specification of the format. We therefore specify it here.

Compact jsondiff format with marshal=True

The diff between a source dict and a target dict consists of a structure of nested dictionaries and lists. Let us call a sequence of keys specifying a leaf node or a subset of nodes in that diff a “path”.

If the path does not contain $delete, the target dict is obtained from the source dict by following the path up to the point where the source and target

log.txt

A copy of log messages recorded during the measurement (from the logging module).

input-history

A copy of input given to IPython/Jupyter in the current session, up to 500 most recent cells. Optional.

A copy of the Jupyter notebook (.ipynb)

A copy of the main measurement script. Optional and disabled by default. Only available in Jupyter Notebook, not in Jupyter Lab.

Changelog

v1.1.0 (from v1.0.0)

Added number of data rows as metadata in footer.

Specified that only ASCII characters 0…127 are allowed in comment rows in tabular_data.dat.

Specified better how numerical values should be formatted.

Specified that the last data row must also end in a new line character.

Motivation for the chosen format

Pdata is geared toward single-lab-scale experimental physics experiments, such as superconducting qubit experiments, IV measurements, etc. This is in contrast to big-data experiments (e.g. collecting machine learning data sets).

An important goal of the data format is to be self-documenting, such that it is in principle straightforward for a competent programmer to figure out how to parse the data, even without the pdata source.

The format also aims to be stable enough that the latest version of pdata.analysis.dataview is able to read any data set recorded with any previous version of pdata.

Another important design criterion is that it must be possible to read the latest data in a separate analysis script (i.e. separate process) as soon as new data becomes available from the experiment.

Therefore the data format is:

Stream-like, i.e. the on-disk data set is a valid and up-to-date dataset at all times during an on-going experiment, and not only after the measurement ends.

Relatively verbose. Or conversely, optimizing file size or speed is not a top priority.

Based on text files and other wide-spread formats (.gz, .json).

Includes a README file in each data directory.

Includes a copy of the measurement script, if possible.

Note

An advantage of using gzipped files, besides the obvious benefit of smaller file size, is that gzipped files contain a checksum. This ensures that (post-measurement) data corruption does not go unnoticed.

Note

A downside of the chosen data format is that it’s relatively slow to read from disk to memory. So if you are dealing with larger data sets, it’s highly recommended to split your analysis script into multiple steps and make use of caching parsed values and/or intermediate analysis results in cache files. There are several easy ways of doing that in Python, for example using pickle, numpy, or json.

Discussion on alternative formats

Here we have some notes on alternative formats, which are not used by pdata.

To simplify the task of having pdata.analysis.dataview support all pdata datasets, including ones recorded with earlier versions of pdata, changes to the on-disk data format are generally to be avoided without very good reason.

Text based vs binary

Binary formats could offer better write and read speeds, assuming that implementation details are properly tuned. Reaching hardware-limited speed is, however, almost irrelevant for the vast majority of physics experiments that pdata is geared toward.

Binary cache files are also easy to create in Python and can be integrated as part of the data analysis workflow in most cases. Such cache files can (and should) be considered disposable, so they can be native to the system and can therefore provide unbeatable speed.

In general, any binary format is more opaque than a text-based format, if you were faced with the challenge of reverse engineering the format. With very wide spread formats this is less of a concern (e.g. .npy/.npz).

Numpy .npy/.npz

Numpy .npy/.npz would be a very reasonable binary format for the data rows of tabular data. The format is well-specified and stable and has a design philosophy similar to pdata’s, except that it’s binary.

HDF5

The main argument against using HDF5 is that the HDF5 specification is very complex (see 100+ page HDF5 specification vs .npy/.npz specification), without providing any clear advantage compared to .npz, in the case of pdata. The complexity of the specification isn’t a problem from the point of view of routine use since one, and only one, HDF5 library implementation exists. However, it could be non-trivial to debug issues in the unlikely event that bugs related to the HDF5 library would be encountered.

Note

At first sight it seems tempting to encode snapshots as nested HDF5 groups, which would provide strong data typing. However, the overhead in file size is severe (~kB per group!).

Binary JSON

There are a few variants of JSON-like formats but with binary encoding. These would potentially offer faster read speeds, while also being rather simple. This could be a benefit in use cases with very large snapshot diffs

The main disadvantage is that there are several slightly-incompatible variants of these formats and none of them seems broadly adopted, although Mathematica supports UBJSON.