pdata: Straightforward and robust data storage for experimental data
This procedural data storage package provides a self-contained interface focused exclusively on storing and reading experimental data, using an approach independent of the specific measurement framework used for instrument control.
The main goals are to provide an interface that:
Automatically stores a lot of metadata, including parameters that change during a measurement.
Is procedural rather than functional in terms of the API the experimenter sees, as procedural programming tends to be easier to understand for a typical experimental physicist.
Uses standard Python flow-control constructs (for, while, if, etc.) for looping over setpoints.
The API aims to be self-explanatory, wherever possible.
In practice, the experimenter starts by defining the columns for a
traditional table of data points, and then calls an explicit
add_points(<new data points>)
to add rows to the table,
typically inside a few nested for loops that loop over setpoints. In
the background, pdata automatically records all changes to
instrument parameters each time add_points
is called.
On the analysis side, the main functionality of pdata is to provide
correct implementations for reading the traditional data table,
concatenating multiple data sets, as well as parsing the
automatically stored instrument parameters (with dataview
and add_virtual_dimension()
).
Pdata also provides basic helpers for quick data visualization (with
basic_plot()
), graphical dataset selection (with
data_selector()
), and live plotting (with
monitor_dir()
).
Pdata does not aim to be a fully-featured plotting and analysis library. Instead, we implement easy-to-use export capabilities of parsed data to other Python and non-Python tools (see Analyzing with other tools).
Below you find a short guide to using the most important features of
pdata. You can find the same content in an interactive Jupyter
notebook format under the examples
directory. For more
detailed documentation of each function, see the API page
Note
The only requirement for the framework used to control the
instruments is that the framework needs to be able to return
instrument parameters as a python dict object. Or more
specifically, a dict compatible with json and
jsondiff. In QCoDeS for example, you would
use qcodes.station.Station.default.snapshot()
.
Saving data in measurement script
Under the examples
directory,
you find an example of running a measurement using QCoDeS for instrument control, while using pdata
for data storage and regular Python for flow control (for loops, etc.).
The essential part of running the measurement is:
# Define a function that gets the current instrument settings from QCoDeS (as a dict)
import qcodes.station
get_qcodes_instrument_snapshot = lambda s=qcodes.station.Station.default: s.snapshot(update=True)
from pdata.procedural_data import run_measurement
# Columns are specified as (<name>, <unit>), or just <name> if the quantity is dimensionless.
with run_measurement(get_qcodes_instrument_snapshot,
columns = [("frequency", "Hz"),
"S21"],
name='power-sweep', # <-- arbitrary str descriptive of measurement type
data_base_dir=data_root) as m:
logging.warning('This test warning will (also) end up in log.txt within the data dir.')
data_path = m.path()
logging.warning(f'Data directory path = {m.path()}.')
for p in [-30, -20, -10]:
vna.power(p) # <-- note that this new value gets automatically stored in the data
freqs, s21 = vna.acquire_S21()
m.add_points({'frequency': freqs, 'S21': s21})
Note
By default, floats and complex numbers are serialized to
strings with 16 significant figures, corresponding to the
precision of 64-bit floats. Other data types are converted
to string by calling str()
, and replacing tabs and
new lines with spaces. The default serializers and data
types are inferred from the data types in the first data row
passed to add_points()
. You can override the
defaults for any column with a column specification like
this: (<column name>, <unit>, <formatter>, <dtype>)
,
where <formatter>
is a single-argument function that
converts the argument to a string (e.g. lambda x:
f"{x:.3e}"
if you wanted to optimize the file size).
<dtype>
is an optional data type class that is
stored in the tabular data header as metadata, and later
helps parse the data correctly.
Interrupting a measurement prematurely
You can call pdata.procedural_data.abort_measurements()
from any other Python
kernel running on the same machine to controllably abort all ongoing
measurements on the machine, after their next call to
add_points()
.
Warning
Do not interrupt a measurement by restarting the Python kernel. This can lead to sporadic and difficult-to-debug communication issues in case the kernel was in the middle of communicating with an external measurement instrument. The external instrument can be left waiting for an answer it never gets.
Reading data in analysis scripts
Here is how you read the data back from the above example using DataView:
from pdata.analysis.dataview import DataView, PDataSingle
# Read the data from disk into a PDataSingle object
# and then feed that into a DataView object for analysis
#
# PDataSingle and Dataview are separate because you can
# concatenate multiple data dirs into one DataView by
# adding multiple PDataSingle's to the array below.
d = DataView([ PDataSingle(data_path), ])
That will read the data table including all of the columns given as
arguments to run_measurement()
. In addition, we can add any
instrument setting as a virtual dimension that looks for the rest of
the analysis just like any other column in the data table:
d.add_virtual_dimension('VNA power', # <-- name of the new column
from_set=('instruments', 'vna',
'parameters', 'power', 'value'))
Note
The path given in the from_set
argument depends on
the framework you use for instrument control. You can
determine it using the graphical helper
dataexplorer.snapshot_explorer(d)
, or by manually
examining d.settings()[0][1]
or snapshot.json.gz.
If you’re using Jupyter, take a look at the HTML-formatted summary of
the parsed data, by typing d
in a new cell.
Often, you would next use basic_plot
to plot the data using a
one-liner, or export the parsed data to e.g. xarray for more involved
analysis. But here, for pedagogical purposes, let’s first take a more
verbose approach to see what DataView
can do.
Let’s run some print statements to see how to access the data columns:
print('Instruments in the snapshot file:')
print(d.settings()[0][1]['instruments'].keys())
# You can now access the columns by name:
print('\nFirst few frequencies: %s' % (d["frequency"][:5]))
print('First few powers: %s' % (d["VNA power"][:5]))
print('\nUnique powers in the data set: %s' % (np.unique(d["VNA power"])))
print('\nSweeps based on a per-sweep-fixed parameter: %s' % d.divide_into_sweeps('VNA power'))
print('\nSweeps based on a per-sweep-swept parameter: %s' % d.divide_into_sweeps('frequency'))
Which outputs the following:
Instruments in the snapshot file:
dict_keys(['vna'])
First few frequencies: [5.900e+09 5.905e+09 5.910e+09 5.915e+09 5.920e+09]
First few powers: [-30. -30. -30. -30. -30.]
Unique powers in the data set: [-30. -20. -10.]
Sweeps based on a per-sweep-fixed parameter: [slice(0, 41, None), slice(41, 82, None), slice(82, 123, None)]
Sweeps based on a per-sweep-swept parameter: [slice(0, 41, None), slice(41, 82, None), slice(82, 123, None)]
Let’s next divide the data into sweeps with
divide_into_sweeps
. We can then plot those sweeps with
matplotlib, as an example:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
for s in d.divide_into_sweeps('frequency'):
#for s in d.divide_into_sweeps('VNA power'): # <-- This would work equally well.
dd = d.copy(); dd.mask_rows(s, unmask_instead=True) # Create a copy of d, and mask (i.e. hide) all rows except a single sweep
power = dd.single_valued_parameter('VNA power')
ax.plot(dd['frequency'], dd['S21'], label="%s dBm" % power)
ax.set(xlabel='f (Hz)', ylabel='S21')
ax.set_yscale('log')
ax.legend();
Analyzing with other tools (xarray, pandas, Matlab, etc.)
After you’ve used DataView to parse the data, you can easily export it, including virtual dimensions, to several other tools.
A convience function for conversion to xarray is included:
# Assuming the DataView d from example above:
xa = d.to_xarray("S21", coords=[ "frequency", "VNA power" ])
Note
Xarray is well-suited for N-dimensional parameter/coordinate
sweeps that were executed with nested for loops in which the
looped coordinate values in each loop were selected mostly
independent of other coordinates. More precisely, xarray
will be an efficient representation if the measured
coordinates (x,y,…,z) span most of X⊗Y…⊗Z, where X (Y)
[Z] is a set of all unique x (y) [z] found in the data set.
Otherwise there will be lots of empty values in the xarray,
which are filled with fill_value
(np.nan
by
default).
Note
Usually, you’ll want to use setpoints, rather than measured
values, as coordinates in an xarray. If a coordinate
c
is instead a measured value, you probably want to
specify coarse graining with coarse_graining={c:
<Delta>}
, which causes coordinates differing by at most
<Delta>
to be interpreted as the same coordinate.
Note
Note that if the same coordinate combination is repeated more than once in the data set, only the last measured value will appear in the output xarray. If you want to preserve information about repetitions, add another coordinate for the repetition number.
Note
Spaces, dashes and other special characters in coordinate names are replaced automatically by underscores, as these don’t work well with xarray syntax.
Converting a DataView object d
to a Pandas data frame:
import pandas # Assumes you've installed pandas
dataframe = pandas.DataFrame({col: d[col] for col in d.dimensions()})
You could further convert the Pandas dataframe to a CSV file, which can be read by many languages:
dataframe.to_csv("outputdata.csv")
If you want to work with Matlab, you can use savemat():
from scipy.io import savemat # Assumes you've installed scipy
savemat("outputdata.mat", {col: d[col] for col in d.dimensions()})
Data explorer
There are some helpers in pdata.analysis.dataexplorer
for
quick visualization of data sets.
Interactive data set selector
You can use data_selector
in a Jupyter notebook to create an
interactive element for easily selecting data sets from a given
directory:
sel = dataexplorer.data_selector(data_root)
display(sel)
Basic plots (one liners)
You can combine the interactive selector above with
basic_plot
, in a separate Jupyter notebook cell, to actually
generate a plot of S21 vs frequency for the selected data sets:
dataexplorer.basic_plot(data_root, sel.value, "frequency", "S21", ylog=True)
That will already create a plot similar to the above manually created
one, but to also add VNA power in the legend, you can add a
preprocessor
. In this example, the preprocessor adds VNA power
as a virtual dimensions so that it can be used as a
slowcoordinate
for the plot:
def add_vdims(dd):
dd.add_virtual_dimension('VNA power', units="dBm",
from_set=('instruments', 'vna',
'parameters', 'power', 'value'))
return dd
dataexplorer.basic_plot(data_root, sel.value,
"frequency", "S21",
slowcoordinate="VNA power",
ylog=True,
preprocessor=add_vdims)
Live plotting
monitor_dir
can be used to create the same plot as above. The
difference is that we don’t manually select the data sets. Instead the
plot will keep updating with new data, until you send an interrupt
signal to the kernel:
# Plot data directories that were modified < 600 s ago,
# and include "power-sweep" in the directory name.
# The cell runs indefinitely, until an interrupt is sent to the Python kernel.
dataexplorer.monitor_dir(data_root, name_filter="power-sweep", age_filter=600,
x="frequency", y="S21", ylog=True,
slowcoordinate="VNA power",
preprocessor=add_vdims);
Note
By default, monitor_dir
uses data_selector
for choosing which data directories to plot, and
basic_plot
for plotting them. You can, however,
override them by specifying custom selector
and
plotter
arguments to monitor_dir
, for easy
creation of arbitrarily complex live plots. The custom
functions just need to take the same arguments as
data_selector
and basic_plot
do.