Hello, and welcome to the fifth episode of the Software Carpentry lecture on Python. In this episode, we'll have a look at input and output.
In the previous episodes, we've used
…but in real programs, we'll want to save data to files.
And read data from them.
Python's tools for input and output are pretty simple, and owe a lot to those invented for C. Basically:
A file is a actually a sequence of bytes.
But it's often more useful to treat it as a sequence of lines.
For our examples, we'll work with a file containing several haikus taken from an online competition run by Salon magazine in 1998.
Let's start by asking, "How many characters are in the file?"
What we'll actually find out is how many bytes are in the file.
We'll assume right now that each character is stored in one byte.
But we'll revisit the subject later.
Here's our program.
It starts by opening the file using the built-in function
open. This creates a file object that keeps track of the program's connection to the file.
The first argument to
open is the name of the file the program wants to work with.
The second is the letter
'r', which signals that it wants to read from the file.
If all goes well, the program assigns the result of
open to the variable
reader, which will be the program's connection to the file.
The program can now call the method
read to read in the entire content of the file…
…and assign it to the variable
Since the program is done with the file, it calls
This isn't strictly necessary in small programs, but it's a good habit to get into, since most operating systems limit the number of files a program can have open at one time.
Finally, we use
len to find out how many characters are in the variable
data, and print that out.
Again, what we're really reporting is the number of bytes, not the number of characters, but we'll assume for now that there's a one-to-one match.
When we run our program, it tells us the file is 293 characters long.
Our first program read the entire file into memory, then calculated its length. If the file might be very large—say, a few terabytes—it would be better to read it in chunks, so that we don't overflow memory.
Here's a program that does that. After opening the file as before…
…we pass the value 64 to the
read method to indicate that we only want the next 64 bytes of data.
This method will return an empty string if there is nothing left to read.
The program then goes into a loop. As long as its last attempt to read from the file returned some data…
…it prints out how much data it read…
…then tries to read some more data.
As a check, we print the length of
data after the loop is over. This should be zero, since the program should stay in the loop as long as it's actually getting data from the file.
Sure enough, the output is four full blocks of 64 characters, one partial block of 37, and then 0 at the end of the file.
You might think from this example that reading in blocks is the right way to do it, but the extra complexity is really only warranted if…
…the file really might be very large (or infinitely long, like a stream of readings from a lab instrument). Remember, premature optimization is the root of much evil.
read is the most fundamental way to get data from a file, but it's more common to read data a line at a time.
To show how this works, here's a program that calculates the average length of the lines in a file.
After opening the file, it uses
readline to read the next line of text from the file.
read, this will return an empty string when there's nothing left in the file, so our program loops until it gets an empty string.
And inside the loop, it uses another
readline call to try to get that next line.
When we run this program on our haiku file, it tells us that the average line is a little over nineteen and a half characters long.
Memory permitting, it's often more convenient to read all the lines in the file at once.
Here's a program that uses
readlines (with an 's' on the end) to do that.
It returns a list of strings, which the program assigns to the variable
The program then loops over that list with
for, instead of using a
while to read from the file until it's exhausted.
Again, the output is a little over nineteen and a half characters per line.
Reading the file's contents as a list of lines, then looping over that list, is a very common idiom.
So Python allows programs to just loop over a file line by line.
Here's the average line length program done that way.
for loop assigns the lines in the file to the variable
line one after another, halting automatically when the file is exhausted.
And yes, it's output is 19.53333…
Of course, we have to put data in files somehow.
In Python, we can do this with two other file methods called
Here's a program that writes to
temp.txt using these two methods.
As before, we open the file with
The first argument is the file we want to write to. Its previous content will be overwritten if it already exists…
…and it will be created if it doesn't.
The difference between this call and the ones we've seen before is that the second argument is the string
'w' instead of the string
'r', which signals that we want to write to the file.
The program then uses the file object's
write method to write a string to the file.
Alternatively, it can use
writelines to write each string in a list.
But if we run the program, then look in
temp.txt, the output is all crammed together.
The reason is that Python only writes what we tell it to, and we didn't tell it to write any end-of-line characters.
We have to modify the program to add a newline
'\n' at the end of each line.
When we run this program, we get the output we want.
Rather than putting newline characters everywhere, many programmers find it simpler to write to files using
Here's our program rewritten to use this idiom.
The file we want to print to appears on the right of the
>>; other than that, it looks like any other
And just like a regular
So let's use what we've learned so far to copy one file's contents to another.
Here's the first version:
The first three lines read in everything from the source file…
…while the second three write it all out to the destination file.
As before, this probably won't work with a terabyte of data…
…but in almost all cases, that doesn't matter.
Here's another version that will work for a terabyte, provided it's a terabyte of text.
After opening both files, we read the input a line at a time, writing data out as we go, then close both files.
This does assume the file is text, though.
Or at least that the end-of-line character appears fairly frequently. If it doesn't,
readline may be asked to read an enormous block of data into memory, which gets us back to our "can't read a terabyte" problem.
This version looks similar to the previous one, but doesn't make an exact copy of the original file.
Instead of using
write, it prints to the file.
The problem is that Python keeps the newline character at the end of the input line when it's reading.
So this program actually produces a double-spaced copy of the file.
Here's a better alternative.
BLOCKSIZE to be 1024. (Python doesn't actually let us define constants, so by convention, any variable that we want to treat as a constant definition is spelled in all capitals.) We then use the "read in a loop" idiom we saw earlier to read up to 1024 bytes at a time, writing them out as we go.
This file copying program is efficient, and will handle files of any size, but it's harder to understand than our first two: needlessly so, unless we expect very large input files.