Find us on GitHub

Teaching basic lab skills
for research computing

Python: Input and Output

Python/Input and Output at YouTube

Hello, and welcome to the fifth episode of the Software Carpentry lecture on Python. In this episode, we'll have a look at input and output.

In the previous episodes, we've used print to see what our programs are doing. This is fine for tutorials…

…but in real programs, we'll want to save data to files.

And read data from them.

Python's tools for input and output are pretty simple, and owe a lot to those invented for C. Basically:

A file is a actually a sequence of bytes.

But it's often more useful to treat it as a sequence of lines.

For our examples, we'll work with a file containing several haikus taken from an online competition run by Salon magazine in 1998.

Let's start by asking, "How many characters are in the file?"

What we'll actually find out is how many bytes are in the file.

We'll assume right now that each character is stored in one byte.

But we'll revisit the subject later.

Here's our program.

It starts by opening the file using the built-in function open. This creates a file object that keeps track of the program's connection to the file.

The first argument to open is the name of the file the program wants to work with.

The second is the letter 'r', which signals that it wants to read from the file.

If all goes well, the program assigns the result of open to the variable reader, which will be the program's connection to the file.

The program can now call the method read to read in the entire content of the file…

…and assign it to the variable data.

Since the program is done with the file, it calls close.

This isn't strictly necessary in small programs, but it's a good habit to get into, since most operating systems limit the number of files a program can have open at one time.

Finally, we use len to find out how many characters are in the variable data, and print that out.

Again, what we're really reporting is the number of bytes, not the number of characters, but we'll assume for now that there's a one-to-one match.

When we run our program, it tells us the file is 293 characters long.

Our first program read the entire file into memory, then calculated its length. If the file might be very large—say, a few terabytes—it would be better to read it in chunks, so that we don't overflow memory.

Here's a program that does that. After opening the file as before…

…we pass the value 64 to the read method to indicate that we only want the next 64 bytes of data.

This method will return an empty string if there is nothing left to read.

The program then goes into a loop. As long as its last attempt to read from the file returned some data…

…it prints out how much data it read…

…then tries to read some more data.

As a check, we print the length of data after the loop is over. This should be zero, since the program should stay in the loop as long as it's actually getting data from the file.

Sure enough, the output is four full blocks of 64 characters, one partial block of 37, and then 0 at the end of the file.

You might think from this example that reading in blocks is the right way to do it, but the extra complexity is really only warranted if…

…the file really might be very large (or infinitely long, like a stream of readings from a lab instrument). Remember, premature optimization is the root of much evil.

read is the most fundamental way to get data from a file, but it's more common to read data a line at a time.

To show how this works, here's a program that calculates the average length of the lines in a file.

After opening the file, it uses readline to read the next line of text from the file.

As with read, this will return an empty string when there's nothing left in the file, so our program loops until it gets an empty string.

And inside the loop, it uses another readline call to try to get that next line.

When we run this program on our haiku file, it tells us that the average line is a little over nineteen and a half characters long.

Memory permitting, it's often more convenient to read all the lines in the file at once.

Here's a program that uses readlines (with an 's' on the end) to do that.

It returns a list of strings, which the program assigns to the variable contents.

The program then loops over that list with for, instead of using a while to read from the file until it's exhausted.

Again, the output is a little over nineteen and a half characters per line.

Reading the file's contents as a list of lines, then looping over that list, is a very common idiom.

So Python allows programs to just loop over a file line by line.

Here's the average line length program done that way.

The for loop assigns the lines in the file to the variable line one after another, halting automatically when the file is exhausted.

And yes, it's output is 19.53333…

Of course, we have to put data in files somehow.

In Python, we can do this with two other file methods called write and writelines.

Here's a program that writes to temp.txt using these two methods.

As before, we open the file with open.

The first argument is the file we want to write to. Its previous content will be overwritten if it already exists…

…and it will be created if it doesn't.

The difference between this call and the ones we've seen before is that the second argument is the string 'w' instead of the string 'r', which signals that we want to write to the file.

The program then uses the file object's write method to write a string to the file.

Alternatively, it can use writelines to write each string in a list.

But if we run the program, then look in temp.txt, the output is all crammed together.

The reason is that Python only writes what we tell it to, and we didn't tell it to write any end-of-line characters.

We have to modify the program to add a newline '\n' at the end of each line.

When we run this program, we get the output we want.

Rather than putting newline characters everywhere, many programmers find it simpler to write to files using print with a redirect, which is written as a double greater-than sign.

Here's our program rewritten to use this idiom.

The file we want to print to appears on the right of the >>; other than that, it looks like any other print statement.

And just like a regular print, this automatically adds a newline after everything.

So let's use what we've learned so far to copy one file's contents to another.

Here's the first version:

The first three lines read in everything from the source file…

…while the second three write it all out to the destination file.

As before, this probably won't work with a terabyte of data…

…but in almost all cases, that doesn't matter.

Here's another version that will work for a terabyte, provided it's a terabyte of text.

After opening both files, we read the input a line at a time, writing data out as we go, then close both files.

This does assume the file is text, though.

Or at least that the end-of-line character appears fairly frequently. If it doesn't, readline may be asked to read an enormous block of data into memory, which gets us back to our "can't read a terabyte" problem.

This version looks similar to the previous one, but doesn't make an exact copy of the original file.

Instead of using write, it prints to the file.

The problem is that Python keeps the newline character at the end of the input line when it's reading.

And print automatically adds a newline at the end of what it outputs.

So this program actually produces a double-spaced copy of the file.

Here's a better alternative.

It defines BLOCKSIZE to be 1024. (Python doesn't actually let us define constants, so by convention, any variable that we want to treat as a constant definition is spelled in all capitals.) We then use the "read in a loop" idiom we saw earlier to read up to 1024 bytes at a time, writing them out as we go.

This file copying program is efficient, and will handle files of any size, but it's harder to understand than our first two: needlessly so, unless we expect very large input files.

Thank you.