Find us on GitHub

Teaching basic lab skills
for research computing

The Unix Shell: Pipes and Filters

The Unix Shell/Pipes and Filters at YouTube

Hello, and welcome to the fourth episode of the Software Carpentry lecture on the Unix shell. In this episode, we'll look at what makes the shell so powerful: the ease with which it lets you combine existing programs in new ways.

As we saw in previous episodes, a shell is a program that takes commands from the user, tells the computer to run the corresponding programs, and shows the user their output.

We've already seen commands to move around the filesystem, and to create, rename, copy, and delete files and directories.

What we'll see in this episode is that commands like these are much more powerful when they're combined.

We'll start with a directory called molecules that contains six files describing some simple organic molecules. The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.

Let's go into that directory with cd

…and run the command wc *.pdb.

The * in *.pdb is a wildcard character.

It matches zero or more characters.

So the shell expands the expression *.pdb to be the complete list of .pdb files.

The shell does this before wc runs, so the actual command is wc cubane.pdb ethane.pdb and so on.

wc stands for "word count".

It counts the number of lines, words, and characters in files.

Its output, shown here, prints these values in columns: lines, words, characters, and the filename, one line per file, with a line for the total at the end.

If we run wc -l instead, our output shows only the number of lines per file.

We can use -w to get only the number of words, or -c to get only the number of characters.

Now, which of these files is shortest?

It's an easy question to answer when there are only six files…

…but what if there were 6000? That's the kind of job we want a computer to do.

Our first step toward a solution is to run the command wc -l *.pdb > lengths.

> tells the shell to redirect the output to a file instead of printing it to the screen.

The shell will create the file if it doesn't exist…

…or overwrite its contents if it does.

Notice that there is no screen output: everything that wc would have printed has gone into the file lengths instead.

ls lengths confirms that the file exists.

And we can print its contents to the screen using cat lengths.

cat stands for "concatenate": it prints the contents of files one after another.

In this case, there's only one file, so cat just shows us what's in it.

Now let's use the sort command to sort its contents. This does not change the file: instead, it prints the sorted lines to the screen as shown here.

We can put the sorted list of lines in another temporary file called sorted-lengths by putting > sorted-lengths after the command, just as we used > lengths to put the output of wc into lengths.

And now, we can run another command called head to get the first few lines in sorted-lengths.

Giving head the argument -1 tells us we only want the first line of the file; -20 would get the first 20, and so on.

This must be the file with the fewest lines, since sorted-lengths holds files and their line counts in order from the least to the most.

If you think this is confusing, you're in good company: even once you understand what wc, sort, and head do, all those intermediate files make it hard to follow what's going on. How can we make it easier to understand?

Let's start by getting rid of the sorted-lengths file by running the sort and head commands together.

That vertical bar between them is called a pipe.

It tells the shell that we want to take the output of the command on the left…

…and use it as the input to the command on the right…

…without explicitly creating a temporary file. The computer can create such a file itself if it wants to, or run the two programs simultaneously and pass data from one to the other through memory without ever putting it on disk: we don't have to know or care.

Well, if we don't need to create the temporary file sorted-lengths, can we get rid of the lengths file too? The answer is "yes": we can use another pipe to send the output of wc directly to sort, which then sends its output to head. This is exactly like a mathematician nesting functions and saying "the square of the sine of x times π": in our case, the calculation is "head of sort of word count of *.pdb".

This simple idea is why Unix has been so successful.

Instead of creating enormous programs that try to do many different things, Unix programmers focus on creating lots of simple tools that:

each do one job well, and

work well with each other.

Ten such tools can be combined in 100 ways, and that's only looking at pairings: when we start to look at pipes with multiple stages, the possibilities are almost endless.

Here's what actually happens behind the scenes when we create a pipe. We'll use an octagon to show a running program.

The technical term for this is a process: it's a program that's actually loaded into memory and "live".

Every process has an input channel called standard input. By this point, you may be surprised that the name is so memorable, but don't worry:

most Unix programmers call it stdin, just to be safe.

Every process also has a default output channel called standard output, or stdout.

When we run a program normally, the shell temporarily sends whatever we type on our keyboard to the process's stdin, and sends whatever the process prints to stdout to our computer's screen.

For example, if we run wc -l *.pdb > lengths

…the shell starts by telling the computer to create a new process to run the wc program.

Since we've provided some filenames as arguments, wc reads from them instead of from standard input.

And since we've used > to redirect output to a file, the shell connects the process's standard output to that file.

Here's what happens when we run wc -l *.pdb | sort instead. The shell creates two processes, one for each component of the pipe, so that wc and sort run simultaneously. The standard output of wc is fed directly to the standard input of sort; since there's no redirection with >, sort's output goes to the screen.

And if we run wc -l *.pdb | sort | head -1, we get the three processes shown here, with data flowing from the files, through wc to sort, and from sort through head to the screen.

This programming model is called pipes and filters.

A filter is a program that transforms a stream of input into a stream of ouptut. Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read from stdin, do something to what they've read, and write to stdout.

A pipe is just a connection between two filters. Behind the scenes, the computer may do some clever things to move data around, but from the user's point of view, all a pipe does is move bytes from one process to another.

The key is that any program that reads lines of text from standard input, and writes lines of text to standard output, can work with every other program that behaves this way as well.

You and and should write your programs this way, so that you and other people can put those programs into pipes to multiply their power.

To summarize, we now have a bunch of commands for moving around the file system…

…and three for working with text: wc to count things, sort to sort them, and head to select lines from the front of a file.

After this episode is over, please go and explore a few other simple text processing commands, such as tail, split, cut, and uniq. Remember, each tool you learn multiplies the power of the tools you already know.

We've also met three more special characters: the pattern-matching wildcard *, redirection with >, and most important of all, the pipe |, which allows us to connect processes together.

Again, once this episode is over, please take a moment to find out what two other characters do: <, which redirect input, and ?, a wildcard that matches a single character instead of any number.

In our next episode, we'll have a look at how Unix controls who can do what to files and directories.