Find us on GitHub

Teaching basic lab skills
for research computing

The Unix Shell: Finding Things

The Unix Shell/Finding Things at YouTube

Hello, and welcome to the sixth episode of the Software Carpentry lecture on the Unix shell. This short episode will show you how to find things in files, and how to find files themselves.

We're looking at how to interact with a computer using a command-line shell.

Listening to how people talk about search, you can often guess their age.

Just as young people use "Google" as a verb…

…crusty old Unix programmers use the word "grep".

"grep" is a contraction of "global/regular expression/print", which was a common sequence of operations in early Unix text editors.

What the grep program does is find and print lines in files that match a pattern.

Here's the file we'll use for our examples. It contains three computer haikus taken from a competition that Salon magazine ran in 1998.

Let's run the command grep not haiku.txt.

Here, "not" is the pattern we're searching for.

It's a pretty simple pattern: every alphanumeric character matches against itself.

After the pattern comes the name or names of the files we're searching in.

As you can see, the output is the three lines in the file that contain the letters "not".

Let's try a different pattern: "day". This time, the output is lines containing the words "Yesterday" and "Today", which both have the letters "day".

If we give grep the -w flag, it restricts matches to word boundaries, so that only lines with the word "day" will be printed, not lines with "Today" or "daytime". In this case, there aren't any, so grep's output is empty.

Another useful option is -n, which numbers the lines that match.

Here, we can see that lines 5, 9, and 10 in the file contain the word "it" (or a word that contains "it").

As with other Unix commands, we can combine flags…

…to get only whole-word matches, with line numbers.

Here's another example.

-i makes matching case-insensitive…

…while -v inverts the match, so that it only prints lines that don't match the pattern.

grep has lots and lots of options.

To find out what they are, we can type man grep.

man is the Unix "manual" command; it prints a description of a command and its options, and (if you're lucky) provides a few examples of how to use it.

grep's real power doesn't come from its options, though; it comes from the fact that its patterns can be regular expressions.

(That's what the "re" in "grep" stands for.)

Regular expressions are complex enough that we've devoted an entire lecture to them; if you want to do complex searches, please take a few minutes to watch its first few episodes.

One caution: grep's regular expressions use a slightly different syntax than what's used in most programming languages.

However, the basic ideas and rules are exactly the same.

While grep finds lines in files, the find command finds files themselves.

Again, it has a lot of options—too many to cover here.

To show how its basic features work, we'll use this directory tree. Under Vlad's home directory is one file, notes.txt, and three subdirectories: thesis (which is sadly empty), data (which contains two files first.txt and second.txt), and a tools directory that contains the programs format and stats, and an empty subdirectory called old.

Here's a textual representation of that same tree, created using the Unix tree command.

As with ls -F, trailing /'s show directories…

…and trailing *'s show files we could run as programs.

For our first command, let's run find . -type d.

Here, . is the root directory of our search: find will only look in it, and the things it contains.

-type d means "things that are directories".

Sure enough, find's output is the names of the five directories in our little tree (including ., the current working directory).

If we change -type d to -type f, we get a listing of all the files instead. find automatically goes into subdirectories, their subdirectories, and so on to find everything that matches the pattern we've given it.

If we don't want to go that deep, we can use -maxdepth to restrict the depth of search. Here, -maxdepth 1 tells find to only look at this level, so the only file it finds is ./notes.txt.

The opposite of -maxdepth is -mindepth, which tells find to only report things that are at or below a certain depth. -mindepth 2 therefore finds all the files that are two or more levels below us.

And here's another option: -empty. This restricts matching to empty files and directories, of which we have two.

We can search by permissions, too. Here, for example, we can use -perm -u=x to find both files and directories for which the user has 'x' permission.

Combine this with -type f to exclude directories, and voila: a list of runnable program files.

Let's try matching by name with find . -name *.txt. We expect it to find all the text files, but it only prints out ./notes.txt. What's gone wrong?

If you recall, the shell expands wildcard characters like * before commands run.

Since *.txt in the current directory expands to notes.txt, the command we actually ran was find . -name notes.txt. find did what we asked; we just asked for the wrong thing.

Let's try again, but this time we'll put *.txt in single quotes to prevent the shell from expanding the * wildcard.

This way, find actually gets the pattern , not the expanded filename notes.txt.

Sure enough, this time the output is the names of all three text files.

As we said in previous episodes, the command line's power lies in combining tools. We've seen how to do that with pipes; let's look at another technique.

As we just saw, find . -name '*.txt' gives us a list of all text files in or below the current directory.

Here's how to combine that with wc -l to count the lines in all those files.

The trick here is to put the find command inside back quotes.

This tells the shell to run find and then replace what's in the back quotes with the command's output.

This is exactly what the shell does when it expands *, ?, and other built-in wildcards, but more flexible, since we can use any command we want as our own "wildcard".

So, when the shell executes this line, the first thing it does is run the command that's inside the back quotes. Its output is the three filenames ./data/first.txt, ./data/second.txt, and ./notes.txt.

The shell then replaces the back quotes with that output to construct the command wc -l ./data/first.txt ./data/second.txt ./notes.txt.

And as you can see, that does what we originally wanted.

It's very common to use find and grep together. The first finds files that match a pattern; the second looks for lines inside those files.

Here, for example, we can find PDB files that contain iron atoms by looking for the string "FE" in all the .pdb files below the current directory.

So far, we have focused exclusively on finding things in text files. What if your data isn't text?

What if we have images, databases, spreadsheets, or some other format? There are basically three options.

The first is to extend tools like grep to handle those formats.

This hasn't happened, and probably won't, because there are too many formats to support.

The second option is to convert the data to text, or extract the text-y bits from the data. This is probably the most common approach, since it only requires people to build one tool per data format (to extract information).

On the positive side, this makes simple things easy to do.

On the negative side, complex things are usually impossible. For example, it's easy enough to write a program that will extract X and Y dimensions from image files for grep to play with, but how would you write something to find values in a spreadsheet whose cells contained formulas?

The third choice is to recognize that the shell and text processing have their limits, and to use a programming language such as Python instead.

When the time comes to do this, don't be too hard on the shell: many modern programming languages, Python included, have borrowed a lot of ideas from it.

And imitation is also the sincerest form of praise.