Hello, and welcome to the sixth episode of the Software Carpentry lecture on the Unix shell. This short episode will show you how to find things in files, and how to find files themselves.
We're looking at how to interact with a computer using a command-line shell.
Listening to how people talk about search, you can often guess their age.
Just as young people use "Google" as a verb…
…crusty old Unix programmers use the word "grep".
"grep" is a contraction of "global/regular expression/print", which was a common sequence of operations in early Unix text editors.
grep program does is find and print lines in files that match a pattern.
Here's the file we'll use for our examples. It contains three computer haikus taken from a competition that Salon magazine ran in 1998.
Let's run the command
grep not haiku.txt.
Here, "not" is the pattern we're searching for.
It's a pretty simple pattern: every alphanumeric character matches against itself.
After the pattern comes the name or names of the files we're searching in.
As you can see, the output is the three lines in the file that contain the letters "not".
Let's try a different pattern: "day". This time, the output is lines containing the words "Yesterday" and "Today", which both have the letters "day".
If we give
-w flag, it restricts matches to word boundaries, so that only lines with the word "day" will be printed, not lines with "Today" or "daytime". In this case, there aren't any, so
grep's output is empty.
Another useful option is
-n, which numbers the lines that match.
Here, we can see that lines 5, 9, and 10 in the file contain the word "it" (or a word that contains "it").
As with other Unix commands, we can combine flags…
…to get only whole-word matches, with line numbers.
Here's another example.
-i makes matching case-insensitive…
-v inverts the match, so that it only prints lines that don't match the pattern.
grep has lots and lots of options.
To find out what they are, we can type
man is the Unix "manual" command; it prints a description of a command and its options, and (if you're lucky) provides a few examples of how to use it.
grep's real power doesn't come from its options, though; it comes from the fact that its patterns can be regular expressions.
(That's what the "re" in "grep" stands for.)
Regular expressions are complex enough that we've devoted an entire lecture to them; if you want to do complex searches, please take a few minutes to watch its first few episodes.
grep's regular expressions use a slightly different syntax than what's used in most programming languages.
However, the basic ideas and rules are exactly the same.
grep finds lines in files, the
find command finds files themselves.
Again, it has a lot of options—too many to cover here.
To show how its basic features work, we'll use this directory tree. Under Vlad's home directory is one file,
notes.txt, and three subdirectories:
thesis (which is sadly empty),
data (which contains two files
second.txt), and a
tools directory that contains the programs
stats, and an empty subdirectory called
Here's a textual representation of that same tree, created using the Unix
ls -F, trailing
/'s show directories…
*'s show files we could run as programs.
For our first command, let's run
find . -type d.
. is the root directory of our search:
find will only look in it, and the things it contains.
-type d means "things that are directories".
find's output is the names of the five directories in our little tree (including
., the current working directory).
If we change
-type d to
-type f, we get a listing of all the files instead.
find automatically goes into subdirectories, their subdirectories, and so on to find everything that matches the pattern we've given it.
If we don't want to go that deep, we can use
-maxdepth to restrict the depth of search. Here,
-maxdepth 1 tells
find to only look at this level, so the only file it finds is
The opposite of
-mindepth, which tells
find to only report things that are at or below a certain depth.
-mindepth 2 therefore finds all the files that are two or more levels below us.
And here's another option:
-empty. This restricts matching to empty files and directories, of which we have two.
We can search by permissions, too. Here, for example, we can use
-perm -u=x to find both files and directories for which the user has 'x' permission.
Combine this with
-type f to exclude directories, and voila: a list of runnable program files.
Let's try matching by name with
find . -name *.txt. We expect it to find all the text files, but it only prints out
./notes.txt. What's gone wrong?
If you recall, the shell expands wildcard characters like
* before commands run.
*.txt in the current directory expands to
notes.txt, the command we actually ran was
find . -name notes.txt.
find did what we asked; we just asked for the wrong thing.
Let's try again, but this time we'll put
*.txt in single quotes to prevent the shell from expanding the
find actually gets the pattern , not the expanded filename
Sure enough, this time the output is the names of all three text files.
As we said in previous episodes, the command line's power lies in combining tools. We've seen how to do that with pipes; let's look at another technique.
As we just saw,
find . -name '*.txt' gives us a list of all text files in or below the current directory.
Here's how to combine that with
wc -l to count the lines in all those files.
The trick here is to put the
find command inside back quotes.
This tells the shell to run
find and then replace what's in the back quotes with the command's output.
This is exactly what the shell does when it expands
?, and other built-in wildcards, but more flexible, since we can use any command we want as our own "wildcard".
So, when the shell executes this line, the first thing it does is run the command that's inside the back quotes. Its output is the three filenames
The shell then replaces the back quotes with that output to construct the command
wc -l ./data/first.txt ./data/second.txt ./notes.txt.
And as you can see, that does what we originally wanted.
It's very common to use
grep together. The first finds files that match a pattern; the second looks for lines inside those files.
Here, for example, we can find PDB files that contain iron atoms by looking for the string "FE" in all the
.pdb files below the current directory.
So far, we have focused exclusively on finding things in text files. What if your data isn't text?
What if we have images, databases, spreadsheets, or some other format? There are basically three options.
The first is to extend tools like
grep to handle those formats.
This hasn't happened, and probably won't, because there are too many formats to support.
The second option is to convert the data to text, or extract the text-y bits from the data. This is probably the most common approach, since it only requires people to build one tool per data format (to extract information).
On the positive side, this makes simple things easy to do.
On the negative side, complex things are usually impossible. For example, it's easy enough to write a program that will extract X and Y dimensions from image files for
grep to play with, but how would you write something to find values in a spreadsheet whose cells contained formulas?
The third choice is to recognize that the shell and text processing have their limits, and to use a programming language such as Python instead.
When the time comes to do this, don't be too hard on the shell: many modern programming languages, Python included, have borrowed a lot of ideas from it.
And imitation is also the sincerest form of praise.