Hello, and welcome to the sixth episode of the Software Carpentry lecture on Python. This episode will show you more about how to work with strings.
We've been using strings since our very first episode, and as you've probably guessed by now, a string is just a sequence of characters.
There is actually no separate data type for individual characters: a character is just a string of length 1.
Strings are indexed exactly like lists.
If name
holds the string 'Darwin'
, then name[0]
is the first character, 'D', and name[-1]
is the last one, 'n'.
for
loops work the same way they do for lists too.
If the program contains for c in name
, Python assigns each character to the variable c
in turn.
Strings can be wrapped in either single quotes or double quotes, as long as the quotes match.
Here, the first string is in single quotes, and the second in double quotes.
It doesn't matter which form is used: the string's value is the same.
We can use ==
to test this.
And speaking of comparison, we can use "less than" or "greater than or equal to" to compare strings. When Python compares strings, it works character by character from left to right.
As you'd expect, 'a' is less than 'b'.
And 'ab' is less than 'abc' (since 'ab' runs out of characters first).
The digit characters are ordered in the natural way too.
But if you put these rules together, the string '100' is less than the string '9', because '1' is less than '9'.
It may also surprise you to discover that an upper-case 'A' is less than a lower-case 'a'. In fact, every upper case letter comes before every lower case letter.
One more surprise is that strings are immutable, i.e., they cannot be changed in place.
For example, if we try to overwrite the 'D' in 'Darwin' with a 'C', Python gives us an error. This is different from most languages, which allow strings to be changed in place.
Python strings are immutable because it improves performance by allowing Python to do some internal optimization that wouldn't be possible if strings could be changed arbitrarily.
It also helps make programmers more productive by making some kinds of errors impossible—we'll explore this in more detail in the next episode.
But hang on a second: we've already seen that we can use + to concatenate strings.
For example, we can "add" the strings 'Charles', space, and 'Darwin' to produce 'Charles Darwin'.
What happens is that concatenation always produces a new string.
If the variable original
refers to the string 'Charles'
…
…and the variable name
refers to the same string…
…then when we add "space 'Darwin'" to name
, it actually creates a new string and assigns that to name
, leaving original
pointing at the original string. It does not modify the string 'Charles'
in place.
Novices often use string concatenation to format output.
Here's an example: we concatenate three constants strings and the string representations of two numbers to produce a single string of output.
It works, but there's a much better way.
In Python, we can use the %
operator to format output. On the left, we have a format string with placeholders where we want to insert values. On the right, we have the values we want to insert.
Here's a simple example: the format string is 'reagant: %d'
, and the value we're inserting is 123. The format specifier '%d'
in the format string means "decimal integer", so Python creates a new string with the value 123 in place of the '%d'.
We can control the width and precision of values too: in this example, '%6.2f'
means "floating point number, six characters wide, two digits after the decimal point".
If we want to format multiple values at once, we have to put them in parentheses after the %
.
Here's our earlier example re-done with string formatting: we have used '%d'
to format an integer, '%f'
to format a floating-point number…
…and '%%'
to format an actual percentage sign. We have to do this because when Python applies %
to a string, it expects something after every actual percentage sign in that string. We'll come back to this idea in a few moments.
Even without the percentage operator, we sometimes use two characters in a program to put one character into a string. The most common example is probably \n
, which means "a newline character".
We can also use \'
to insert a literal single quote, or \"
to insert a literal double quote.
Here, for example, we have a single-quoted string that contains both a newline and a single quote.
And here, we have a double-quoted string that contains a newline and a double quote.
So if \
is used to start special two-character sequences, how do we represent an actual backslash? The answer is, with two backslashes.
Here, for example, is a string that includes a single literal backslash character. It is written with two backslashes, but when Python reads the program, it only puts one in the string.
This doubling up is a common pattern with so-called escape sequences.
We use some character to mean, "What follows is special."
And then double up that character to mean, "The character itself."
There's another way to get newline characters in strings. If we use three quotes of either kind to start and end a string, it can span multiple lines.
Here, for example, we have a four-line string.
There's nothing magical about this: Python just puts the newline characters at the end of the first three lines into the string data it stores in memory.
We could just as well write this as two "normal" strings, with embedded newline characters, and then concatenate them.
Like lists, strings have methods.
For example, the capitalize
, upper
, and lower
methods return new strings that translate some or all of the characters of the original. These methods don't modify the original string, though, because strings are immutable.
Another method, count
, returns the number of times a character occurs in the string.
And find
returns the index of the first occurrence of a character, or -1 if the character can't be found.
Another useful method is replace
, which creates a new string with every occurrence of one character replaced with another.
In fact, these will find or replace entire strings, not just single characters.
One common idiom in Python and other languages is to chain method calls together.
Here's a rather contrived example.
The first method call—the one that is invoked directly on the variable element
—returns a string that is the upper-case version of the string 'cesium'
.
We then call center
on this string to create yet another one that has the upper-case copy of 'cesium' centered in a field 10 characters wide.
The result is shown here; the technique is no different from a mathematician writing f(g(x)).
Thank you.