Speed considerations
The script test/time_it.py contains some rudimentary timing
tests. You can run it yourself or check out the time_it.py example below.
fast_parser
The custom C++ pdata.analysis.fast_parser module improves
read speed by almost an order of magnitude, compared to versions
prior to pdata v2.7.0. It relies
largely on fast_float.
The fast_parser module is used by default as long as:
pdata v2.7.0 or newer is used, and
all the columns are of type
float,int,complexorstr, and
pdata.analysis.dataview.FAST_PARSER_ENABLEDis true. You can set this to false at runtime if you want to disablefast_parser.
Otherwise, numpy.genfromtxt is used.
Optimizing number of significant figures
By default, floats and complex numbers are serialized to strings with
16 significant figures, corresponding to the precision of 64-bit
floats. This is usually overkill. Both speed and dataset size can be
improved by providing a custom formatter when passing the column
definition (<column name>, <unit>, <formatter>, <dtype>) to
run_measurement.
For example, if the values you’re storing originate from a 10 or 12
bit ADC, you could safely store just four significant figures by using
(<column name>, <unit>, lambda x: f"{x:.3e})
Another example: If the values you’re storing originate from a digital
voltmeter with 10 µV resolution, you could use (<column name>,
'muV', str, int) and store the values in microvolts. In principle,
you could use 10 uV as units, but it would be less clear and would
provide very little extra performance gain.
Warning
Accidentally storing too few significant figures can be extremely annoying if you only notice the problem once it’s no longer easy to remeasure. Therefore, you should generally not optimize the number of significant figures if your data sets are small anyway. It’s also a good idea to include at least one or two extra digits beyond what you think you need.
time_it.py example
Results using pdata v2.1.1 and a bottom-shelf laptop (Intel Pentium N3700 @ 1.60GHz):
Adding 1M 2-column rows, with format=None and compress=False...
26.328 s per repetition.
Reading it to PDataSingle using fast_parser...
0.489 s per repetition.
Converting it to DataView...
1.582 s per repetition.
Reading it to PDataSingle using np.genfromtxt...
9.002 s per repetition.
Converting it to DataView...
1.649 s per repetition.
Adding 1M 2-column rows, with format=None and compress=True...
45.482 s per repetition.
Reading it to PDataSingle using fast_parser...
1.110 s per repetition.
Converting it to DataView...
1.638 s per repetition.
Reading it to PDataSingle using np.genfromtxt...
9.481 s per repetition.
Converting it to DataView...
1.649 s per repetition.
Adding 1M 2-column rows, with format=lambda x: "%.4e"%x and compress=False...
19.176 s per repetition.
Reading it to PDataSingle using fast_parser...
0.326 s per repetition.
Converting it to DataView...
1.589 s per repetition.
Reading it to PDataSingle using np.genfromtxt...
7.504 s per repetition.
Converting it to DataView...
1.669 s per repetition.
Adding 1M 2-column rows, with format=lambda x: "%.4e"%x and compress=True...
31.987 s per repetition.
Reading it to PDataSingle using fast_parser...
0.565 s per repetition.
Converting it to DataView...
1.617 s per repetition.
Reading it to PDataSingle using np.genfromtxt...
7.535 s per repetition.
Converting it to DataView...
1.580 s per repetition.