Hello, and welcome to the fourth episode of the Software Carpentry lecture on the Unix shell. In this episode, we'll look at what makes the shell so powerful: the ease with which it lets you combine existing programs in new ways.
As we saw in previous episodes, a shell is a program that takes commands from the user, tells the computer to run the corresponding programs, and shows the user their output.
We've already seen commands to move around the filesystem, and to create, rename, copy, and delete files and directories.
What we'll see in this episode is that commands like these are much more powerful when they're combined.
We'll start with a directory called
molecules that contains six files describing some simple organic molecules. The
.pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.
Let's go into that directory with
…and run the command
*.pdb is a wildcard character.
It matches zero or more characters.
So the shell expands the expression
*.pdb to be the complete list of
The shell does this before
wc runs, so the actual command is
wc cubane.pdb ethane.pdb and so on.
wc stands for "word count".
It counts the number of lines, words, and characters in files.
Its output, shown here, prints these values in columns: lines, words, characters, and the filename, one line per file, with a line for the total at the end.
If we run
wc -l instead, our output shows only the number of lines per file.
We can use
-w to get only the number of words, or
-c to get only the number of characters.
Now, which of these files is shortest?
It's an easy question to answer when there are only six files…
…but what if there were 6000? That's the kind of job we want a computer to do.
Our first step toward a solution is to run the command
wc -l *.pdb > lengths.
> tells the shell to redirect the output to a file instead of printing it to the screen.
The shell will create the file if it doesn't exist…
…or overwrite its contents if it does.
Notice that there is no screen output: everything that
wc would have printed has gone into the file
ls lengths confirms that the file exists.
And we can print its contents to the screen using
cat stands for "concatenate": it prints the contents of files one after another.
In this case, there's only one file, so
cat just shows us what's in it.
Now let's use the
sort command to sort its contents. This does not change the file: instead, it prints the sorted lines to the screen as shown here.
We can put the sorted list of lines in another temporary file called
sorted-lengths by putting
> sorted-lengths after the command, just as we used
> lengths to put the output of
And now, we can run another command called
head to get the first few lines in
head the argument
-1 tells us we only want the first line of the file;
-20 would get the first 20, and so on.
This must be the file with the fewest lines, since
sorted-lengths holds files and their line counts in order from the least to the most.
If you think this is confusing, you're in good company: even once you understand what
head do, all those intermediate files make it hard to follow what's going on. How can we make it easier to understand?
Let's start by getting rid of the
sorted-lengths file by running the
head commands together.
That vertical bar between them is called a pipe.
It tells the shell that we want to take the output of the command on the left…
…and use it as the input to the command on the right…
…without explicitly creating a temporary file. The computer can create such a file itself if it wants to, or run the two programs simultaneously and pass data from one to the other through memory without ever putting it on disk: we don't have to know or care.
Well, if we don't need to create the temporary file
sorted-lengths, can we get rid of the
lengths file too? The answer is "yes": we can use another pipe to send the output of
wc directly to
sort, which then sends its output to
head. This is exactly like a mathematician nesting functions and saying "the square of the sine of x times π": in our case, the calculation is "head of sort of word count of
This simple idea is why Unix has been so successful.
Instead of creating enormous programs that try to do many different things, Unix programmers focus on creating lots of simple tools that:
each do one job well, and
work well with each other.
Ten such tools can be combined in 100 ways, and that's only looking at pairings: when we start to look at pipes with multiple stages, the possibilities are almost endless.
Here's what actually happens behind the scenes when we create a pipe. We'll use an octagon to show a running program.
The technical term for this is a process: it's a program that's actually loaded into memory and "live".
Every process has an input channel called standard input. By this point, you may be surprised that the name is so memorable, but don't worry:
most Unix programmers call it stdin, just to be safe.
Every process also has a default output channel called standard output, or stdout.
When we run a program normally, the shell temporarily sends whatever we type on our keyboard to the process's stdin, and sends whatever the process prints to stdout to our computer's screen.
For example, if we run
wc -l *.pdb > lengths…
…the shell starts by telling the computer to create a new process to run the
Since we've provided some filenames as arguments,
wc reads from them instead of from standard input.
And since we've used
> to redirect output to a file, the shell connects the process's standard output to that file.
Here's what happens when we run
wc -l *.pdb | sort instead. The shell creates two processes, one for each component of the pipe, so that
sort run simultaneously. The standard output of
wc is fed directly to the standard input of
sort; since there's no redirection with
sort's output goes to the screen.
And if we run
wc -l *.pdb | sort | head -1, we get the three processes shown here, with data flowing from the files, through
sort, and from
head to the screen.
This programming model is called pipes and filters.
A filter is a program that transforms a stream of input into a stream of ouptut. Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read from stdin, do something to what they've read, and write to stdout.
A pipe is just a connection between two filters. Behind the scenes, the computer may do some clever things to move data around, but from the user's point of view, all a pipe does is move bytes from one process to another.
The key is that any program that reads lines of text from standard input, and writes lines of text to standard output, can work with every other program that behaves this way as well.
You and and should write your programs this way, so that you and other people can put those programs into pipes to multiply their power.
To summarize, we now have a bunch of commands for moving around the file system…
…and three for working with text:
wc to count things,
sort to sort them, and
head to select lines from the front of a file.
After this episode is over, please go and explore a few other simple text processing commands, such as
uniq. Remember, each tool you learn multiplies the power of the tools you already know.
We've also met three more special characters: the pattern-matching wildcard
*, redirection with
>, and most important of all, the pipe
|, which allows us to connect processes together.
Again, once this episode is over, please take a moment to find out what two other characters do:
<, which redirect input, and
?, a wildcard that matches a single character instead of any number.
In our next episode, we'll have a look at how Unix controls who can do what to files and directories.