A grad student who is part of the University of Wisconsin – Madison's excellent Hacker Within group asked a question last week that deserves a longer answer than I gave at the time. The question was, "How should I pass configuration parameters into my program?" (Actually, her original question was, "How do I write a configuration file parser in C++?", but that presupposes an answer to the one I'm going to discuss here.) Scientists often need to do this—to run a simulation for different reactant concentrations, or experiment with the effects of different clustering thresholds on phylogenetic tree reconstruction—so lets have a look at some of the options.
Method #1: change the constant or variable definitions in the program, and recompile it for each run. For example, if youre using C++, you can define constants in a header file like this:
// // params.h : control parameters for simulation // #define T_QUENCH 300.0 #define T_EXCITE 450.0
or even better, use proper constants like this:
// // params.h : control parameters for simulation (definitions) // extern const float T_QUENCH; extern const float Q_EXCITE;
// // params.cpp : control parameters for simulation (values) // #include "params.h" const float T_QUENCH = 300.0; const float Q_EXCITE = 450.0;
Each time you want to change the values, you edit params.cpp
, run Make to recompile, and then run your program. If you're really clever, you'll put the command to re-run the program in the Makefile, and arrange your dependencies so that it recompiles the program if necessary.
Software engineering purists will recoil in horror at this, but if you only need to change the programs parameters a few times, it may actually be the simplest choice, since you dont have to write (or test, or debug) any kind of configuration code. The main drawback is that anyone else who wants to use your program has to set up a complete build environment.
Method #2: pass parameters as command-line arguments. If you only have a handful of scalar parameters (i.e., parameters that consist of a single value, like the two in the example above, as opposed to parameters comprised of lists of values, like wave forms), then command-line parameters may be the easiest way to go. In C++, these are passed as strings to your main
function via argc
and argv
. If your program is run like this:
$ ./anneal 300.0 450.0
then the following will work:
int main( int argc, // number of command-line arguments char ** argv // array of pointers to command-line arguments ){ const char * program_name = argv[0]; const float t_quench = atof(argv[1]); const float t_excite = atof(argv[2]); run_simulation(t_quench, t_float); return 0; }
Of course, purists are spluttering again right now. First, atof
returns 0 both when it gets the string "0", and when it fails, so its never safe to use—you should use sscanf
or strtol
instead. Second, this program doesnt check that parameters have been passed in the right order: someone could pass 450.0 300.0
by mistake. (Its unlikely in this case, since the quenching temperature is always less than the excitement temperature, but in other cases, where the parameters dont have a natural order, transposition mistakes are very easy to make.)
The right way to do this is to use something like the getopt
library so that the command line is:
$ ./anneal -q 300.0 -e 450.0
or even better:
$ ./anneal --quench 300.0 --excite 450.0
This provides a little bit of documentation (if you use history
to look at recently-run commands, you can easily read off the parameters youve been using). It also means that you can run the program like this:
$ ./anneal --excite 450.0 --quench 300.0
with no ill effects: parameters are picked off by name, not position, which is a lot safer when there are more than two or three.
OK, so why wouldnt you do this? For one, you might have so many parameters that this becomes cumbersome. (As a rule of thumb based solely on personal taste, if there are more than half a dozen, you should be thinking about doing it some other way.) Second, if some of those parameters are multi-valued, this approach starts to break down as well. To come back to the example alluded to above, if one of the parameters to your program is a sampled wave form thats used to filter signals, you dont really want to type:
$ ./throttle --waveform 0.0000105 0.0000209 0.0000410 ... 0.0152720
Method #3: put parameters in a plain text file. Almost everyone gets here eventually. Put your parameters in a file like this:
quench 300.0 anneal 450.0
and read it with code like this (written in Python for the sake of brevity and readability):
import sys # The parameter file name is the program's sole argument. reader = open(sys.argv[1], 'r') for line in reader: name, value = line.split() if name == 'quench': t_quench = float(value) elif name == 'anneal': t_anneal = float(value) else: print >> sys.stderr, 'Bad parameter name "%s"' % name sys.exit(1) run_simulation(t_quench, t_float)
It works, but we can do better: much better. First, lets allow blank lines and comments beginning with '#':
for line in reader: line = line.split('#')[0].strip() if not line: continue name, value = line.split() ...as before...
The three lines in bold face take everything that was before the first '#' on the line and strip off any leading and trailing whitespace. If the result is the empty string, the line was blank, or consisted solely of a comment. Either way, the program continues on to the next line without trying to get a parameter name and value.
This version still doesnt handle multi-valued parameters, but its pretty easy to change the line.split()
call and what follows it to do so. Well leave that as an exercise for the reader, though, and look at something else instead. Suppose that some values are floats, but others are integers or strings (such as an output file name). Heres a much cleaner way to take care of parsing:
Handlers = { 'border' : int, 'excite' : float, 'output' : str, 'quench' : float } reader = open(sys.argv[1], 'r') params = {} for line in reader: line = line.split('#')[0].strip() if not line: continue name, value = line.split() if name not in Handlers: print >> sys.stderr, 'Bad parameter name "%s"' % name sys.exit(1) if name in params: print >> sys.stderr, 'Duplicate parameter name "%s"' % name sys.exit(1) conversion_func = Handlers[name] params[name] = conversion_func(value)
run_simulation(params)
Handlers
is a dictionary that maps parameter names to functions that know how to convert string representations of those functions to—well, to whatever type theyre supposed to be. Each time a name/value pair is read from the file, this program checks that there is a conversion function (which doubles as a check that the parameters name is one we recognize), then checks that we dont already have a value for that parameter (i.e., that the file doesnt mistakenly include duplicates). It then applies the conversion function to the values string representation, and stores the result. All the parameter values are then passed into the simulation in one tidy dictionary.
This approach is nice because all the logic for parsing parameters can be re-used in other programs—including future versions of this program. For example, if we want to add α and β for controlling crystallization speed, the only thing we have to change is the "table" of parameter names and conversion unctions stored in Handlers
:
Handlers = { 'alpha' : float, 'beta' : float, 'border' : int, 'excite' : float, 'output' : str, 'quench' : float }
This also gives us a natural place to put in error checking—we just write our own conversion functions, like this:
def convert_alpha(text): raw = float(text) if (raw < 0.0) or (raw > 1.0): print >> sys.stderr, 'alpha out of range: "%f"' % raw return raw
Yes, we can be smarter so that we dont have to write a separate function for each parameter, but lets not go down that path. In fact, lets not go down this path, because there are already lots of configuration file syntaxes and parsers out there. Writing one of our own may be fun, but its busy-work: if we really need configuration files, we should grab a library and use that. Which brings us to...
Method #4: put parameters in a structured text file that can be parsed by an existing libraries. Whatever your needs, the odds are good that someone else has met them before, along with others that you havent yet (but probably will). The odds are also pretty good that code in your favorite languages standard library will be less buggy right off the bat than anything you could write yourself.
But what syntax to use? One option is Windows INI files; another is XML, and of course theres the new hipster on the block, JSON. Of these, INI is the only one designed first and foremost to be written and read by human beings; XML and JSON both defer more to machines needs, which makes typing them in more painful.
Method #5: put parameters in a dynamically-loaded code module. This is the nerdiest option, and to understand it, we have to step back for a moment and look at how computers run programs. If I type:
import blarg
in a Python program, the Python interpreter:
blarg.py
;blarg
.OK, so suppose I create a Python file called config.py
that contains:
t_quench = 300.0 t_excite = 450.0
If my program contains:
import config
then config.t_quench
has the value 300.0, and config.t_excite
has the value 450.0. We didnt have to parse anything: Python did it for us. And hey, we can now use expressions and conditionals in our "configuration file" for free:
t_quench = 300.0 t_excite = 4.0 * t_quench / 5.0 if t_excite < 500.0: alpha = 0.0 beta = 1.0 else: alpha = 0.2 beta = 0.8
But wait: isnt this just method #1 all over again? Dont we have to edit config.py
each time we want to change parameter values? Well, no: if you use the importlib
librarys import_module
function, you can specify what you want to import dynamically, i.e., provide a name like "config_low_alpha_binding" as a command-line parameter, and have Python load parameters from config_low_alpha_binding.py
.
Nerds like me tend to like this option a lot: simple things remain simple, but we have the full power of a programming language when we need it (i.e., we dont have to invent our own clumsy syntax for expressions and conditionals in configuration files). The downsides are:
In practice, #1 and #2 matter most: if speed is important, youve probably written your code in a compiled language, which means that loading other bits of code on the fly and pulling out values is hard both technically and intellectually. Some people try to get around this by writing the configuration and control parts of their program in a dynamic language, while leaving the computational core in a compiled language, but thats just trading one problem for another: building, debugging, and maintaining multi-language programs is not something to be undertaken lightly.
Method #6: build a configuration GUI. Ive included this option for the sake of completeness, and because its usually essential if you want your program to be widely used. Building a desktop or web-based user GUI takes a lot of time, but if done well, can make your program much more accessible—particularly if the same GUI allows people to visualize the programs output. Be warned, though: GUI construction is a very easy way to procrastinate...
No matter how you get parameters into your program, theres one rule you should never break: always (always!) include those values in your programs output, so that when you come back and re-examine old output files a year later, you know exactly what parameters were used. This practice is a small but crucial part of tracking the provenance of your data, a topic well return to in a future post.