Find us on GitHub

Teaching basic lab skills
for research computing

The Unix Shell: Advanced Tricks

The Unix Shell/Advanced Tricks at YouTube

Hello and welcome to the Advanced Shell Tricks episode of the Software Carpentry lectures on the Unix shell. In this episode, we'll look at some handy advanced shell techniques that can save you time.

So, in general, you encounter a technical problem and are wondering how to solve it…

For example, on your iPhone or Android smartphone you may hear, "There's an app for that… check this out!"

Whereas Unix shell programmers will say "There's a shell trick for that… check this out"… whilst perhaps recommending you upgrade to an Android smartphone.

In previous episodes, we've seen how to do a number of things. Combine existing programs using pipes and filters. For example, counting the number of lines in all PDB files, then sorting those results and picking the top one (i.e. the result with the greatest number of lines).

Redirect output from programs to files. For example, counting the number of lines in all pdb files and storing the results in a file.

Use variables to control program operation. For example, creating a new variable called SECRET_IDENTITY and assigning the value "Dracula" to it.

And this is one of the true strengths of the Unix shell; the way you can compose all these techniques together. We'll be seeing more of this in this episode!

We've covered some very handy techniques already, but of course we can go further. Redirection for example has some other very useful tricks you can easily learn and use which we'll look at.

We've already seen how we can redirect program output to a file, but what else can we do with redirection? Let's revisit our pdb files that we've seen in a previous episode. As a reminder, PDB files are Protein Data Format files.

So with this command, we can list all the files with a pdb filename extension in the current directory, redirecting the results to a file we call "files".

This is all made possible by the "redirection" operator.

But what about adding this together with other results generated later? In general, this would be very useful for any time we just want to add things to an existing file. In this example, let us consider a further set of protein data files in the older "ent" format… how do we perform the same ls operation and add these results to the previous results in "files"?

We can first get a list of the files with an ent extension…

…and put that list in a separate file.

Then, we can use our concatenate "cat" command to create a new file which has the contents of both files in it. But it's a bit long winded — couldn't we just "append" the results to our existing file "files"? The answer is yes!

This command does exactly that — the output from this command is redirected to a file as before, but in an "append" sense.

Also note that if the file doesn't exist beforehand, it is created.

Now let's look at something a little different…

So we know ls operates within its own process, and all output is normally directed through its parent "shell" process to the display.

But in the case of redirection, ls still operates within its own process, but instead of its output being directed through its parent shell process to the display, it is redirected to a file.

Now here's something to ponder… what happens with error messages?

For example, running ls on a non-existent path will give an error.

Ok, no problem so far…

But why isn't the error message in "files"? This is the big question! Why, and how, are we seeing it?

In essence, this is because the standard output and standard error are separate "channels".

So what was happening with the previous example? This is how we looked at it before, with standard output.

So let's expand on this a little, by adding in the standard error.

Now we see that the standard error is not being redirected, like the standard output, to a file.

Perhaps unsurprisingly, there is a way to capture standard error as well using the Unix shell.

So this "2>" operator deals with the redirection of standard error only.

As you might expect, error messages end up in our error-log file.

We can redirect both stdout and stderr like this. Plus, the order in which we add both redirections to the command doesn't matter.

So we can redirect both standard error and standard output simultaneously.

Of course, we could add other directories into this list too, perhaps one that does exist.

So what's this number 2 all about?

The "2" refers to the standard error channel, whilst "1" refers to the standard output.

By default, ">" on its own refers to standard output, so we could remove the "1" before the first greater-than sign for the same effect.

"&" is a useful shorthand if you want a single log of everything.

We can even use append here as well.

To summarise part 1, we've looked in more depth at redirection. We've looked at redirecting standard output and standard error to a file, overwriting anything in the file previously, or creating it if it doesn't exist.

We also looked at redirecting standard output and standard error to a file, but appending it to the contents of a file (although if the file doesn't exist it is created).

Now for something completely different… We've already seen how pipes and filters work with using a single program on some input data.

i.e. you have a program which takes some arguments, the program processes these arguments, and some results are output.

But what about running the same program separately, for each input? i.e. doing each of these program runs in sequence, one after the other.

So instead, we want to run the program separately on each argument. We could of course do this manually, but what if we need to do this with a great many arguments? This wouldn't be terribly efficient!

The good news is that there is a well known programming concept which you can use, called loops. Loops are very useful—it is difficult to overstate just how useful these can be…

So how can they help us with this situation?

You've probably encountered compressed files before (like .zip files), it's a common technique for reducing the size of a number of files whilst packaging them into a single, easy-to-manage file. If we consider these pdb files as large files, perhaps we want to email each of them to different individuals.

We can use the zip command very easily to compress our cubane.pdb into a zip file.

We can see that it has compressed the file by 73%. Zip is a handy tool in itself which can also work with directories and their contents. As an aside, you can use e.g. "unzip cubane.zip" to decompress the zip file and extract the cubane.pdb file.

The first argument is zip filename we wish to create.

The second is a list of files (just one in this case) which we want to add to the zip file.

This would obviously take too long if we were looking at, say, a hundred files. So how can we automate this using loops?

Using a loop, we can iterate over each file, and run zip on each of them. So how does this work?

The first part says we wish to iterate over our pdb files.

And on each iteration, let us run our zip command.

The "done" part shows the end of our loop, and instructs the loop to do the next file in our pdb list if one exists.

The semicolons separate each part of this command. But how does it pick up and use each separate pdb file?

This *.pdb would generate a list of the 6 pdb files, so we would expect this loop to run 6 times.

This "file" is a variable which we can use to reference each file within the loop.

e.g. if $file is "cubane.zip", this becomes cubane.pdb.zip.

So here we are just using the "file" variable to specify the file we wish to put in the zip file.

The zip command runs thus 6 times, once for each pdb file, generating a new zip file for each of them.

So, with a hundred such files, this would be much more efficient than running zip individually each time.

If we look at just the zip files, we can see the new ones that have just been created.

Let's look at a slightly different problem… What if we wanted to output the first line of each pdb file?

This isn't really what we want…

Each first line in this list is prefixed with the filename of where it came from.

So when run on multiple files, the head command inserts a filename before each file.

Perhaps we only want the actual first lines. In which case, we really want to miss out the filename prefixes you get from using the head command over multiple files.

The good news is that we can use loops to help here.

Using a loop, we can run the head command separately on each file.

We can do this in a very similar way to how we used a loop for zipping multiple files separately.

We get these results. So how does this fit in with what we have learned already with pipes and filters?

We can take this further… what if we wanted this list sorted in reverse afterwards?

We simply pipe the output from our loop to a command. Although not necessary, we can surround the loop part with parenthesis for clarity, so we can clearly see where the pipe is applied.

So we just add the "sort" command to the end of this via a pipe. The "-r" argument just means "sort the list in reverse".

And we get our previous list sorted in reverse. So we can happily use this technique within the pipes and filters model we've already learned!

In summary, what new things have we looked at in part 2? We've used the zip command to create a zip file, and used loops to repeat a command many times. This is only the beginning of what you can do with loops. As an exercise, why not take some time to find out what else you can do with them?

Thank you.