One of the capstone ideas of this course is automatically tracking the provenance of scientific data. In art, the "provenance" of a work is the history of who owned it, when, and where. In science, it's the record of how a particular result came to be: what raw data was processed by what version of what program to create which intermediate files, what was used to turn those files into Figure 6 of which paper, and so on.
A lot of people are building interesting tools to track data provenance—if you're interested, have a look at the Provenance Challenge web site (via the Internet Archive) and its successor, the Open Provenance initiative. I expect that their stuff will be in this course in two years' time; before then, though, we can explore the core ideas by extending and combining things we mentioned in our lectures on Python and version control.
To start with, suppose you have a text file combustion.dat
in a Subversion repository. Run the following two comments:
$ svn propset svn:keywords Revision combustion.dat $ svn commit -m "Turning on the 'Revision' keyword" combusion.dat
Now open the file in an editor. Assuming that the file format uses #
as a comment character, add the following line somewhere near the top:
# $Revision:$
Save the file, and commit the change:
$ svn commit -m "Inserting the 'Revision' keyword" combusion.dat
If you open the file again, you'll see that Subversion has changed your line to something like:
# $Revision: 143$
i.e., Subversion has inserted the version number after the colon and before the closing $
.
Here's what just happened. First, Subversion allows you to set properties for files and and directories. These properties aren't in the files or directories themselves, but live in Subversion's own database. One of those properties, svn:keywords
, tells Subversion to look in files that are being changed for strings of the form $propertyname: ...$
, where propertyname
is a string like Version
or Author
(about half a dozen such strings are supported), and ...
may be blank or not. If it sees such a string, Subversion rewrites it as the commit is taking place to replace ...
with the current version number, the name of the person making the change, or whatever else the property's name tells it to do. You only have to add the string to the file once: after that, Subversion updates it for you every time the file changes.
Putting the version number in the file this way can be pretty handy. If you copy the file, for example, it carries its version number with it, so you can tell which version of the file you have even if it's outside version control. We can make it even more useful by being a bit clever about the design of our data format (or our parser, which amounts to the same thing). Suppose our original data format allows files to look like this:
# Raw combusion data Experiment-Date: 2010-10-29 Experiment-Site: Hornings Mills Readings: 8.3 7.2 9.6 4.0 8.2 7.0
Here's a bit of Python code that reads in such a file:
def parse_data_file(reader): '''Read a combusion data file from reader, returning a dictionary of properties and a list of readings.''' properties = {} values = [] state = 'header' for line in reader: # Ignore comments and/or blank lines line = line.split('#')[0].strip() if line == '': continue # Still reading header? if state == 'header': key, value = line.split(':') # End of header, switching to actual readings? if key == 'Readings': state = 'readings' # Or store the property? else: key = key.strip() value = value.strip() properties[key] = value # Actual data readings elif state == 'readings': val = float(line) readings.append(val) # Whoops: state should only be 'header' or 'readings' else: assert False, 'Unknown state "%s"' % state # Finished reading in loop, so return what we found return properties, values
Let's modify the code so that our file can look like this:
# Raw combusion data $Revision: 143$ Experiment-Date: 2010-10-29 Experiment-Site: Hornings Mills Readings: 8.3 7.2 9.6 4.0 8.2 7.0
The change is highlighted below (and notice in passing that we were able to add this feature without changing anything else in our parsing function, which is a sign that our original design was a good one):
def parse_data_file(reader):
'''Read a combusion data file from reader, returning a dictionary
of properties and a list of readings.'''
properties = {}
values = []
state = 'header'
for line in reader:
# Ignore comments and/or blank lines
line = line.split('#')[0].strip()
if line == '':
continue
# Still reading header?
if state == 'header':
key, value = line.split(':')
# End of header, switching to actual readings?
if key == 'Readings':
state = 'readings'
# A special property name from Subversion?
elif key[0] == '$':
assert value[-1] == '$', \
'Expected "$" at the end of SVN property value'
key = key[1:] # strip '$' off the front of key
value = value[:-1] # and off the end of the value
properties[key] = value # and store it
# Or store the property?
else:
key = key.strip()
value = value.strip()
properties[key] = value
# Actual data readings
elif state == 'readings':
val = float(line)
readings.append(val)
# Whoops: state should only be 'header' or 'readings'
else:
assert False, 'Unknown state "%s"' % state
# Finished reading in loop, so return what we found
return properties, values
Why make this change? Because now any program that reads a combusion data file will know what version of that file it has, and can copy that version number forward to its output. For example, if the old output was:
Experiment-Date: 2010-10-29 Experiment-Site: Hornings Mills Sliding-Average: 7.75 8.40 6.80 6.10 7.60
the new data can be:
Experiment-Date: 2010-10-29
Experiment-Site: Hornings Mills
Input-Version: 143
Sliding-Average:
7.75
8.40
6.80
6.10
7.60
It's only one line, but it's a big change: the output file now has a record of its provenance. And we can feed this forward using the same trick. For example, if we're averaging results of lots of experiments, we could arrange things so that our final output file—the one we hand over to IDL or MATLAB to generate the graph we need for our paper—would look like:
Input-File: 2010-10-29 / Hornings Mills / 143 Input-File: 2010-10-28 / Hornings Mills / 127 Input-File: 2010-10-28 / Danbury / 31 XY-Data: 0 1243 1 1157 2 1161 3 1104 4 1092
i.e., it knows its own provenance too. No more hunting around in a panic the night before the paper is due trying to figure out which data files we need to re-generate Figure 6 because the journal wants it as an SVG instead of as a PNG; no more nightmares about someone accusing us of faking our results because we can't... quite... reproduce that crucial table from three years ago.
But we're only half done. There's one more thing that we can and should do. Let's go into the program that turns raw combustion data into sliding averages and add the lines emphasized below:
#!/usr/bin/env python '''Calculate sliding averages on combustion data.''' NAME = 'calc_sliding_average.py' VERSION = '$Revision: 472$' CORRECTION_FILTER = [...lots of magic numbers...] def parse_data_file(reader): ...mumble mumble mumble read read read... return properties, values def sliding_average(raw_properties, raw_values, correction_filter): ...mumble mumble mumble math math math... return result_properties, result_values def store_results(writer, properties, values): print >> writer, 'Program-Name: ', NAME print >> writer, 'Version: ', VERSION[1:-1].split(':')[1].strip() ...mumble mumble mumble write properties and values as before... if __name__ == '__main__': reader = open(sys.argv[1], 'r') props, vals = parse_data_file(reader) reader.close() props, vals = sliding_average(props, vals, CORRECTION_FILTER) writer = open(sys.argv[2], 'w') store_results(writer, props, vals) writer.close()
Please don't be frightened by the line that prints out the program's version number—the expression VERSION[1:-1].split(':')[1].strip()
is just:
VERSION
;Look instead at the first and second highlighted lines, which define the variables NAME
and VERSION
. The first is, well, it's the name of the program that's calculating sliding averages. The second includes the revision number that exactly identifies which version of the program is used. Crucially, both pieces of data are embedded in strings, rather than in comments, so that the program has access to them—in particular, so that it can include them in its output. When this program runs, its output doesn't just record the provenance of the data that was manipulated: it also records the identity of the program that did the manipulating.
And we can go even further. Suppose I'm writing a library of data smoothing functions to be used by other people's programs, but not the I/O routines. How do I get my library's version number included in their output? Short answer: I can't—they have to do it. Long answer: I can't, but I can help them do the right thing by doing this:
# In Hamming library VERSION = '$Revision: 318$'
so that they can do this:
# In program import hamming VERSION = '$Revision: 7$' # The program's version number ALL_VERSIONS [VERSION, hamming.VERSION] # All relevant version numbers
Now, instead of just printing its own version number, the data smoothing program can also record the identities of the libraries it was using. That way, if the program itself doesn't change, but the FFT library was upgraded, there's a record of the change in the output file.
And there's one more thing, which a surprising number of people overlook. Suppose my smoothing program's output depends on a sensitivity parameter that's provided on the command line, like this:
$ calc_sliding_average --sensitivity 0.05 todays_data.dat todays_results.dat
When it writes its output file, it must record the value of sensitivity along with everything else, or it won't be possible to reproduce the calculation reliably. Carrying on with the earlier example, here's the change:
Experiment-Date: 2010-10-29
Experiment-Site: Hornings Mills
Sensitivity: 0.05
Input-Version: 143
Sliding-Average:
7.75
8.40
6.80
6.10
7.60
At this point, I hope you have realized two things:
This brings us back to the Provenance Challenge and Open Provenance efforts. You shouldn't have to write your own provenance tracking software, any more than you should have to write your own FFT functions. When they're done, tracking provenance in a transferable way—i.e., in the same format as everyone else, so that everyone's tools and libraries can play nicely together—should be as easy as multiplying matrices. We're not there yet, though, so until we are, please: