Find us on GitHub

Teaching basic lab skills
for research computing

Databases: Missing Data

Databases/Missing Data at YouTube

Welcome to the Software Carpentry lectures on Databases. Today's topic is NULL values.

Here is a table from a previous lecture. It tracks the work done on experiments broken down by the project and scientist.

Notice that we've modified the data slightly for this lecture, so that for some data is missing — that's what the red coloured cells are.

In the real world data is not always complete — there are always holes.

A database uses a special value for these holes: NULL

NULL is not zero, False, or the empty String, it is simply just a different kind of value.

So, for these rows where the hours have a NULL value, does this mean that the hours for these experiments is missing?

Or does it mean that the value is not known?

Or maybe it means something else?

These are questions that we cannot answer just by looking at the data, but instead we have to understand how the data is supposed to be interpreted.

We tackle this question in our lecture on data modeling.

In this lecture we will discuss how to work with NULL values in our queries.

Suppose you wanted to know which experiments are missing Hours data.

You might try to filter the values like so:

SELECT * FROM Experiement
WHERE Hours = NULL ;

We get no results, why?

For many database systems, NULL is a special value that isn't comparable to anything else using the usual equality operators (e.g. =, !=, , …).

Comparing NULL to any other value using these operators always return False.

So, In our query the condition Hours = NULL is always false, so no records returned.

To write a condition that returns true on rows that contain NULL values you must use the IS operator:

SELECT * FROM Experiment

IS is used to compare fields to NULL.

It behaves like the = operator, except that it returns True when comparing two NULL values.

Here you can see that the entries in the experiments table with missing Hours fields are returned.

To find all of the rows that do not have a NULL Value, you can use the IS NOT operator for nulls in place of the != operator

SELECT * FROM Experiment

Because NULL is a different type of value all together, if your data may have NULL values in it, your queries must take this into account.

For example, suppose you wanted to find the all of the Experiments which did not take 7 hours. You might write:

SELECT * FROM Experiment
WHERE Hours != 7;

Notice that the results are missing the records with NULL values in the Hours field. Those records were filtered out because, as we've said, only the IS and IS NOT operator will return True when comparing a NULL to another value.

If we mean for these rows to be included, we need to add to the WHERE clause to explicitly check for NULL values:

SELECT * FROM Experiment
WHERE Hours != 7 OR Hours IS NULL;

Now when we run our query we get back those records with NULL Hours values along with the other records where the Hours field doesn't equal 7.

NULL values are also handled differently by aggregation functions.

Most aggregation functions ignore NULL values in their calculations.

So, for instance, let's look at the SUM function. To calculate the total number of hours spent on experiments we'd start with a query like this,

SELECT Hours FROM Experiment;

add SUM to total the values…

SELECT Sum(Hours) FROM Experiment;

… but this total is actually just the sum of all of the numeric values. The NULL values are skipped.

This more important for functions like AVG, which depend on the total number of records in the aggregation.

SELECT AVG(Hours) FROM Experiment;

NULL values are skipped, and so they don't count towards the average.

the NULL values are not treated as zero, the average function just skips over them as if they weren't in the dataset.

This is also true for Max, Min, and COUNT in SQLite.

In this lecture we've seen that databases use a special value for empty or missing information, NULL. This value has to be taken into account, and handled in a unique way, when you are writing queries.

We will see ways of working with NULL values pop up in many future lectures.