Find us on GitHub

Teaching basic lab skills
for research computing

Python: Text

Python/Text at YouTube

Hello, and welcome to the thirteenth episode of the Software Carpentry lecture on Python. In this episode, we'll step back from the language itself and ask, what is text data?

Let's start with a simple question: how should a computer represent single characters?

For American English in the 1960s, the solution was simple:

there are 26 characters, which have upper and lower case representations…

…ten digits…

…some punctuation…

…and a few special "characters" for controlling the teletype terminals of the period (meaning "go to a new line", "move back to the start of the line", "start a new page", "ring the bell", and so on).

There were fewer than 128 of these, so the ASCII committee standardized on an encoding that used 7 bits per character.

Next question: how should text containing many characters be represented?

The first choice, which was dictated by the punch card technology of the 1940s and 1950s, was to use fixed-width records, in which each line was exactly the same length.

For example, a computer would lay out this haiku…

…in three records as shown here (where the dot character means "unused").

This representation makes it easy to skip forward or backward by N lines, since each is exactly the same size…

…but it may waste space…

…and no matter what maximum length we choose, we'll eventually have to deal with lines that are longer.

Over time, most programmers switched to a different representation, in which text is just a stream of bytes, some of which mean "the current line ends here".

With this representation, our haiku would be stored like this, where the gray cells mean "end of line".

This is more flexible…

…and wastes less space…

…but skipping forward or backward by N lines is harder, since each one might be a different length…

…and of course, we have to decide what to use to mark the ends of lines.

Unfortunately, different groups picked different things. On Unix, the end of line is marked by a single newline character, which is written '\n'.

On Windows, the end of line is marked with a carriage return followed by a newline, which is written '\r\n'.

Most editors can detect and handle the difference, but it's still annoying for programmers, who need to be able to handle both.

Python tries to help by converting '\r\n' to '\n' when it's reading data from a file on Windows, and converting the other way when it's writing. This is the right behavior for text…

…but if you're reading an image, an audio file, or some other binary file that might just happen to have the numbers representing '\r' and '\n' after each other, you definitely don't want this conversion to happen. To prevent it, you must open the file in binary mode.

To do this, put the letter 'b' after the 'r' or 'w' when you call open, as shown here.

Now, back to characters…

ASCII is fine for the digit 2, the letter 'q', or a circumflex '^', but how should we store 'ê', 'β', or other characters?

Well, 7 bits gives us the numbers from 0 to 127…

…but an 8-bit byte can represent numbers up to 255, so why not extend the ASCII standard to define meanings for those "extra" 128 numbers?

Unfortunately, everyone did, but in different and incompatible ways.

The result was a mess: if a program assumed characters were encoded using Spanish rules when they were actually encoded in Bulgarian, what it got was gibberish.

And setting that aside, many languages—particularly those of East Asia—use a lot more than 256 distinct symbols.

The solution that emerged in the 1990s is called the Unicode standard.

It defines integer values to represent thousands of different characters and symbols…

…but does not define how to store those integers in a file, or as a string in memory.

The simplest choice would be to switch from using an 8-bit byte for each character to using a 32-bit integer…

…but that would waste a lot of space for alphabetic languages like English, Estonian, and Brazilian Portuguese.

Despite this, 32 bits per character is actually used in memory, where access speed is important…

…but most programs and programmers use something else when saving data to a file or sending it over the Internet.

That something else is (almost) always an encoding called UTF-8, which uses a variable number of bytes per character.

For backward compatibility's sake, the first 128 characters (i.e., the old ASCII character set) are stored in one byte each.

The next 1920 characters are stored using two bytes each, the next 61,000-odd in three bytes each, and so on.

If you're curious, the way this works is shown…

…in…

…this…

…table…

…but you don't have to know or care.

What you do have to know these days is that Python 2.* provides two kinds of strings.

A "classic" string uses one byte per character, just as it always did.

While a "Unicode" string uses enough memory per character to store any kind of text.

Unicode strings are indicated by putting a lower-case 'u' in front of the opening quote.

If we want to convert a Unicode string to a string of bytes, we must specify an encoding.

You should always use UTF-8 unless you have a very, very good reason to do something else.

And even then, you should think twice.