Find us on GitHub

Teaching basic lab skills
for research computing

Data Management: Bein

Data Management/Bein at YouTube

This screencast will introduce Bein, a library which simplifies using Python for data analysis. Bein was written with three goals in mind:

First, Bein makes it as simple as possible to mix calls to external programs with your own calculations.

Second, Bein tracks all external programs that your code runs, and the output files it produces. This tracking is important in data analysis because the parameters used to run a program and the files it produces are as important as the program itself.

Third, Bein minimizes the work required to turn a prototype of a data analysis script into production code, by reducing the task largely to wrapping an interface around your script.

Bein adds three new concepts to Python: MiniLIMS repositories for logging executions and tracking files, execution blocks for isolating chunks of code, and methods for very simply binding external programs into Python functions. Let's go through each in turn.

A MiniLIMS repository consists of an SQLite database where logs are written and a directory to store all the files it tracks. The directory has the same name as the database, but with '.files' appended. For example, if the SQLite database is named 'boris', then the directory is named 'boris.files'. As long as the database and directory are named consistently, they can be moved anywhere, copied anywhere, and treated as a black box.

How do you create a MiniLIMS repository? Just connect to it in your script. If it doesn't exist yet, Bein will create it. Don't be shy about creating MiniLIMS repositories. As a rule of thumb, each project should have should have its own.

Next we come to execution blocks. Execution blocks isolate code. When a program crosses into an execution block, it creates a uniquely named subdirectory in the current directory, and does all work in the block in that subdirectory.

When you reach the end of the block, it logs all the details of external programs that were run, and exceptions that were raised but not caught, and adds the files you wish to preserve to a MiniLIMS. Then it deletes the subdirectory it was working in. The only changes to your directory when the execution block finishes are to the MiniLIMS, so you never drown in miscellaneous files. Many people are surprised that execution blocks don't add all the files created in their working directory to the MiniLIMS. The authors of Bein found that, in practice, usually only a small subset of the files created are worth preserving, so it's easier to add the files you want to keep explicitly and let the rest be deleted than for Bein to slurp them all up indiscriminately.

Finally, we come to external program bindings. The minimum information required to use an external program is the exact command to run, and a way to extract a return value from its output. Bein provides some black magic that takes no more than this and manufactures what looks like a normal Python function from an external program.

Using this black magic works like this: you write a function that returns a dictionary with two keys. The key arguments points to a list of strings which are the exact command to run. The key return_value is either a value to return directly when the program finishes, or a function to run on the program's output to calculate a return value. Then you put @program before the whole thing, which transforms the function into an external program binding.

Program bindings made in this way are much more than simple functions, though. They handle errors that arise and automatically log their activity to a MiniLIMS. You can redirect their stdout and stderr to files of your choice. And they have methods for seamless parallel processing that run the program in a separate thread or dispatched over a cluster.

Bein requires very few changes in how you write your scripts.

You add from bein import * to the imports at the top of your script.

You wrap the top level of your calculations in an execution block.

And you attach the execution block to a MiniLIMS to track the programs you run and the files you create.

Bein is written in pure Python. You should be able to install it on any system with a functional Python setup with no more than pip install bein. It has extensive documentation, including a tutorial and a section on advanced usage which covers the idioms that make Bein really powerful.

To sum up, Bein adds three concepts to normal Python scripts which drastically simplify writing data analysis software. MiniLIMS repositories track the parameters and outputs of analysis programs; execution blocks isolate each run of the program and prevent clutter; and program bindings make it trivial to call external programs from the middle of your calculations.

Thank you.