Find us on GitHub

Teaching basic lab skills
for research computing

Essays: Provenance

One of the capstone ideas of this course is automatically tracking the provenance of scientific data. In art, the "provenance" of a work is the history of who owned it, when, and where. In science, it's the record of how a particular result came to be: what raw data was processed by what version of what program to create which intermediate files, what was used to turn those files into Figure 6 of which paper, and so on.

A lot of people are building interesting tools to track data provenance—if you're interested, have a look at the Provenance Challenge web site (via the Internet Archive) and its successor, the Open Provenance initiative. I expect that their stuff will be in this course in two years' time; before then, though, we can explore the core ideas by extending and combining things we mentioned in our lectures on Python and version control.

To start with, suppose you have a text file combustion.dat in a Subversion repository. Run the following two comments:

$ svn propset svn:keywords Revision combustion.dat
$ svn commit -m "Turning on the 'Revision' keyword" combusion.dat

Now open the file in an editor. Assuming that the file format uses # as a comment character, add the following line somewhere near the top:

# $Revision:$

Save the file, and commit the change:

$ svn commit -m "Inserting the 'Revision' keyword" combusion.dat

If you open the file again, you'll see that Subversion has changed your line to something like:

# $Revision: 143$

i.e., Subversion has inserted the version number after the colon and before the closing $.

Here's what just happened. First, Subversion allows you to set properties for files and and directories. These properties aren't in the files or directories themselves, but live in Subversion's own database. One of those properties, svn:keywords, tells Subversion to look in files that are being changed for strings of the form $propertyname: ...$, where propertyname is a string like Version or Author (about half a dozen such strings are supported), and ... may be blank or not. If it sees such a string, Subversion rewrites it as the commit is taking place to replace ... with the current version number, the name of the person making the change, or whatever else the property's name tells it to do. You only have to add the string to the file once: after that, Subversion updates it for you every time the file changes.

Putting the version number in the file this way can be pretty handy. If you copy the file, for example, it carries its version number with it, so you can tell which version of the file you have even if it's outside version control. We can make it even more useful by being a bit clever about the design of our data format (or our parser, which amounts to the same thing). Suppose our original data format allows files to look like this:

# Raw combusion data

Experiment-Date: 2010-10-29
Experiment-Site: Hornings Mills
Readings:
8.3
7.2
9.6
4.0
8.2
7.0

Here's a bit of Python code that reads in such a file:

def parse_data_file(reader):
  '''Read a combusion data file from reader, returning a dictionary
  of properties and a list of readings.'''

  properties = {}
  values = []
  state = 'header'
  for line in reader:

      # Ignore comments and/or blank lines
      line = line.split('#')[0].strip()
      if line == '':
          continue

      # Still reading header?
      if state == 'header':
          key, value = line.split(':')

          # End of header, switching to actual readings?
          if key == 'Readings':
              state = 'readings'

          # Or store the property?
          else:
              key = key.strip()
              value = value.strip()
              properties[key] = value

      # Actual data readings
      elif state == 'readings':
          val = float(line)
          readings.append(val)

      # Whoops: state should only be 'header' or 'readings'
      else:
          assert False, 'Unknown state "%s"' % state

  # Finished reading in loop, so return what we found
  return properties, values

Let's modify the code so that our file can look like this:

# Raw combusion data
$Revision: 143$

Experiment-Date: 2010-10-29
Experiment-Site: Hornings Mills
Readings:
8.3
7.2
9.6
4.0
8.2
7.0

The change is highlighted below (and notice in passing that we were able to add this feature without changing anything else in our parsing function, which is a sign that our original design was a good one):

def parse_data_file(reader):
  '''Read a combusion data file from reader, returning a dictionary
  of properties and a list of readings.'''

  properties = {}
  values = []
  state = 'header'
  for line in reader:

      # Ignore comments and/or blank lines
      line = line.split('#')[0].strip()
      if line == '':
          continue

      # Still reading header?
      if state == 'header':
          key, value = line.split(':')

          # End of header, switching to actual readings?
          if key == 'Readings':
              state = 'readings'

          # A special property name from Subversion?
          elif key[0] == '$':
              assert value[-1] == '$', \
                     'Expected "$" at the end of SVN property value'
              key = key[1:]            # strip '$' off the front of key
              value = value[:-1]       # and off the end of the value
              properties[key] = value  # and store it

          # Or store the property?
          else:
              key = key.strip()
              value = value.strip()
              properties[key] = value

      # Actual data readings
      elif state == 'readings':
          val = float(line)
          readings.append(val)

      # Whoops: state should only be 'header' or 'readings'
      else:
          assert False, 'Unknown state "%s"' % state

  # Finished reading in loop, so return what we found
  return properties, values

Why make this change? Because now any program that reads a combusion data file will know what version of that file it has, and can copy that version number forward to its output. For example, if the old output was:

Experiment-Date: 2010-10-29
Experiment-Site: Hornings Mills
Sliding-Average:
7.75
8.40
6.80
6.10
7.60

the new data can be:

Experiment-Date: 2010-10-29
Experiment-Site: Hornings Mills
Input-Version: 143

Sliding-Average:
7.75
8.40
6.80
6.10
7.60

It's only one line, but it's a big change: the output file now has a record of its provenance. And we can feed this forward using the same trick. For example, if we're averaging results of lots of experiments, we could arrange things so that our final output file—the one we hand over to IDL or MATLAB to generate the graph we need for our paper—would look like:

Input-File: 2010-10-29 / Hornings Mills / 143
Input-File: 2010-10-28 / Hornings Mills / 127
Input-File: 2010-10-28 / Danbury / 31
XY-Data:
0 1243
1 1157
2 1161
3 1104
4 1092

i.e., it knows its own provenance too. No more hunting around in a panic the night before the paper is due trying to figure out which data files we need to re-generate Figure 6 because the journal wants it as an SVG instead of as a PNG; no more nightmares about someone accusing us of faking our results because we can't... quite... reproduce that crucial table from three years ago.

But we're only half done. There's one more thing that we can and should do. Let's go into the program that turns raw combustion data into sliding averages and add the lines emphasized below:

#!/usr/bin/env python
'''Calculate sliding averages on combustion data.'''

NAME = 'calc_sliding_average.py'
VERSION = '$Revision: 472$'
CORRECTION_FILTER = [...lots of magic numbers...]

def parse_data_file(reader):
  ...mumble mumble mumble read read read...
  return properties, values

def sliding_average(raw_properties, raw_values, correction_filter):
  ...mumble mumble mumble math math math...
  return result_properties, result_values

def store_results(writer, properties, values):
  print >> writer, 'Program-Name: ', NAME
    print >> writer, 'Version: ', VERSION[1:-1].split(':')[1].strip()
  ...mumble mumble mumble write properties and values as before...

if __name__ == '__main__':
  reader = open(sys.argv[1], 'r')
  props, vals = parse_data_file(reader)
  reader.close()

  props, vals = sliding_average(props, vals, CORRECTION_FILTER)

  writer = open(sys.argv[2], 'w')
  store_results(writer, props, vals)
  writer.close()

Please don't be frightened by the line that prints out the program's version number—the expression VERSION[1:-1].split(':')[1].strip() is just:

  1. taking everything except the first and last character of VERSION;
  2. splitting it on colons;
  3. taking the second part of the result (remember, Python counts from 0, so the second part's index is 1);, and
  4. stripping off leading and trailing whitespace.

Look instead at the first and second highlighted lines, which define the variables NAME and VERSION. The first is, well, it's the name of the program that's calculating sliding averages. The second includes the revision number that exactly identifies which version of the program is used. Crucially, both pieces of data are embedded in strings, rather than in comments, so that the program has access to them—in particular, so that it can include them in its output. When this program runs, its output doesn't just record the provenance of the data that was manipulated: it also records the identity of the program that did the manipulating.

And we can go even further. Suppose I'm writing a library of data smoothing functions to be used by other people's programs, but not the I/O routines. How do I get my library's version number included in their output? Short answer: I can't—they have to do it. Long answer: I can't, but I can help them do the right thing by doing this:

# In Hamming library
VERSION = '$Revision: 318$'

so that they can do this:

# In program
import hamming

VERSION = '$Revision: 7$'                # The program's version number

ALL_VERSIONS [VERSION, hamming.VERSION]  # All relevant version numbers

Now, instead of just printing its own version number, the data smoothing program can also record the identities of the libraries it was using. That way, if the program itself doesn't change, but the FFT library was upgraded, there's a record of the change in the output file.

And there's one more thing, which a surprising number of people overlook. Suppose my smoothing program's output depends on a sensitivity parameter that's provided on the command line, like this:

$ calc_sliding_average --sensitivity 0.05 todays_data.dat todays_results.dat

When it writes its output file, it must record the value of sensitivity along with everything else, or it won't be possible to reproduce the calculation reliably. Carrying on with the earlier example, here's the change:

Experiment-Date: 2010-10-29
Experiment-Site: Hornings Mills
Sensitivity: 0.05
Input-Version: 143
Sliding-Average:
7.75
8.40
6.80
6.10
7.60

At this point, I hope you have realized two things:

  1. Tracking provenance is just a matter of careful bookkeeping.
  2. Doing everything, and doing it yourself, is a lot of work.

This brings us back to the Provenance Challenge and Open Provenance efforts. You shouldn't have to write your own provenance tracking software, any more than you should have to write your own FFT functions. When they're done, tracking provenance in a transferable way—i.e., in the same format as everyone else, so that everyone's tools and libraries can play nicely together—should be as easy as multiplying matrices. We're not there yet, though, so until we are, please:

  1. design or extend data formats and parsers so that you can include $-property strings in data files;
  2. embed $-property strings in software; and
  3. carry at least some of these values forward when doing calculations, so that your calculations are at least partially traceable.