Find us on GitHub

Teaching basic lab skills
for research computing

Databases: Nested Queries

Databases/Nested Queries at YouTube

Welcome to the Software Carpentry screencast on Databases. This screencast is on nesting queries.

How do we find scientists that haven't been experimenting with time travel?

Our first instinct might be to write a query like this:

SELECT DISTINCT LoginID,
FROM Experiment
WHERE Project != 'Time Travel';

Unfortunately, this query doesn't give us what we want, since 'ivan' and 'skol' (Sofia) have each worked both on Time Travel but also on other projects as well.

There are scientists who have worked on Time Travel Projects….. and their are scientists who have worked on other projects…… and, of course, there are scientists who have worked on both Time Travel projects AND other projects as well.

Our query is returning all of the scientists who have worked on projects other than Time Travel… but that includes scientists that have worked on time travel and other projects.

Our original question was to find all of the scientists that had never worked on Time Travel.

To see why, let's look at the experiment table, but we'll sort the records by scientist and project:

SELECT * FROM Experiment ORDER BY LoginID, Project;

Our original query reject rows where the project was 'Time Travel'.

But for scientists like Ivan, who have worked on other projects, their other projects don't get filtered out, and so their names appear in the final results.

The strategy for answering this question is to start with all of the scientists …

… subtract those scientists which have worked on Time Travel projects

… to get just the scientists that haven't been experimenting with Time Travel.

To do this we'll make use of nested queries.

Finding all of the scientists who have worked on Time Travel projects is easy:

SELECT DISTINCT LoginID FROM Experiment
WHERE Project = 'Time Travel';

Finding all of the scientists is also easy:

SELECT DISTINCT LoginID FROM Experiment;

What we want to do is somehow:

SELECT DISTINCT LoginID FROM Experiment
WHERE

subtract, or filter, from these results the results of our earlier query:

WHERE LoginID NOT IN ('ivan', 'skol');

In SQL we're able to nest one query inside of another to use its results in filter conditions.

So, we create a filter condition that only includes scientists

SELECT DISTINCT LoginID FROM Experiment
WHERE LoginID NOT IN
(SELECT DISTINCT LoginID FROM Experiment
WHERE Project = 'Time Travel');

where their LoginID is not in… and then we put the results of the first query in parentheses.

So, we can read this query as saying, "Fetch all of the scientists who have run experiments, but exclude them if they appear in the list of scientists who have worked on Time Travel".

And this is what we wanted!

The pattern demonstrated by this query, of selecting and then filtering based on the results of nested query, is very useful one.

I should note, that the nested query here is completely separate from the larger query. It just fetches a list of LoginIDs once, and those loginIDs are used by the larger query's filter condition as it inspects each record.

But nested queries can also refer to the larger query they are embedded in.

For instance, in our screencast on JOIN we asked the question: which scientists have worked on more than one project?

We answered the question by joining the Experiment table to itself and then filtering results so that we only had scientists that had been paired with themselves, but working on different projects.

Using nested queries we can answer this same question in a different way.

Let's first remind ourselves of who has worked on which projects:

SELECT LoginID, Project FROM Experiment ORDER BY LoginID, Project;

Best has worked on one project, so has Dian Fossey, dimitri, Banting, Herschel.

But, Ivan Pavlov has worked on two projects, and so has Lomonosov, and so has Kovaleskaya

So, we can write a query that selects the scientists from the experiments table …

SELECT DISTINCT LoginID FROM Experiment e1
WHERE

where… and here we want to filter out the scientist if they haven't worked on any other project. That is, in order to be included they must /have/ worked on another project:

WHERE LoginID IN (

Well, we need a query that will fetch the OTHER projects the scientist has been involved in.

So, we start with fetching the LoginIDs

SELECT DISTINCT LoginID FROM Experiment e2
...

but put a filter condition in so that the loginIDs we fetch are those of the same scientist, but have a different project.

...
WHERE e2.LoginID = e1.LoginID AND e1.Project != e2.Project);

When we run this query we get Pavlov, Lomonosov, Kovaleskaya as expected.

This is a different kind of nested query than we wrote before, because the results of this nested query depend on which record is being processed by the outer query.

SELECT DISTINCT LoginID FROM Experiment e1
WHERE LoginID IN (
SELECT DISTINCT LoginID FROM Experiment e2
WHERE e2.LoginID = e1.LoginID AND e1.Project != e2.Project);

Nested queries can also be use in place of tables. That is, a nested query can provide the datasource from which you select, filter, aggregate, etc.

For example, what if we wanted to know how many different projects each scientist had worked on.

We can begin by finding the distinct list of projects each scientist worked on, which can do like so:

SELECT DISTINCT LoginID, Project FROM Experiment;

We want to count how many projects are listed here for each scientist.

Since counting is an aggregation, we need to use the results of this query as input for a separate query that does the aggregation.

We can nest this query as the source for a larger query by wrapping it in parentheses, and putting it in the FROM clause:

...
FROM (SELECT DISTINCT LoginID, Project FROM Experiment);

We want the scientist, and count of their projects:

SELECT LoginID, COUNT(*)
FROM (SELECT DISTINCT LoginID, Project FROM Experiment)
...

and we want to count over each scientist:

...
GROUP BY LoginID

Nesting queries like this is really useful if the data you want to write a query on isn't present in exactly the right form in the database. You can use one query to get the data in the form you need it in, and then with nesting you're able to use those results as the input for a larger query.

In this screencast we've introduced the idea of nesting queries.

We've seen that you can nest a query to supply elements in a WHERE condition, and we've also seen we can use a nested query in place of a table in the FROM clause.