Find us on GitHub

Teaching basic lab skills
for research computing

Matrix Programming: Basic Operations

Matrix Programming/Basic Operations at YouTube

Hello, and welcome to this episode of the Software Carpentry lecture on matrix programming. In the next few minutes, we'll show you a few basic things that libraries like NumPy can do to make your life easier. These same tools are also available in languages like MATLAB and R, or in libraries for Java, Perl, and so on: there may be minor differences in how they work, but the basic ideas are universal.

Numerical Python, or NumPy, provides MATLAB-style arrays for Python… …and many other things as well. More importantly, it provides a data-parallel programming model. You write 'x*A*x.T'… …and the computer takes care of the loops. All of this functionality is encapsulated in special objects called arrays.

Let's start by creating an array from a list. We import numpy… …and then called 'numpy.array' with a list of initial values as an argument. The resulting array is three elements long.

Unlike Python lists, NumPy arrays are homogeneous… …i.e., all values must have exactly the same type. This allows values to be packed together as shown here… …which saves memory… …and is much faster to process.

If we give NumPy initial values of different types… …it finds the most general type—in this case, float—and uses that.

If you want a specific type, rather than letting NumPy decide, you can pass an optional argument to 'array' called 'dtype' (for 'data type'). In this case, we have told NumPy that we want 32-bit floating point values.

As an exercise, think of a case where you would want this kind of control.

You can also choose a data type later using array's 'astype' method. Please do not simply assign a type to the array's 'dtype' unless you are really, really sure you know what you are doing. This causes NumPy to reinterpret the bits in the array according to the new data type, which is almost never what you want.

NumPy provides many basic numerical data types. Note that 'int', 'float', and 'complex' are whatever the underlying hardware uses as its native type: this will usually be 32 or 64 bit, but 128-bit machines are becoming common. Note also that 'uint' is an unsigned integer, i.e., all the bits are used for magnitude, but negative values cannot be represented.

There are many other ways to create arrays besides calling 'array'. For example, the 'zeros' function takes a tuple specifying array dimensions as an argument, and returns an array of zeros of that size. The array's data type is 'float' unless something else is specified using 'dtype'.

As an exercise, see if you can guess what 'ones' and 'identity' do, and then run a Python interpreter and check your guesses.

It's also possible to create NumPy arrays without filling them with data using the 'empty' function. This function does not initialize the values, so they are whatever bits were lying around in memory when it was called.

As an exercise, see if you can think of a case where this is useful.

As with everything else in Python, assigning an array to a variable does not copy its data: it creates an alias for the original data. For example, let's create an array of ones and assign it to a variable 'first'… …then assign the value of 'first' to 'second'. If we modify an element of 'second'… …the change shows up in 'first', since the two variables are pointing at the same chunk of memory.

Notice, by the way, that when we subscript a NumPy array to select a single element, we put all the indices inside one set of square brackets, separated with commas, rather than bracketing each index separately. Also notice that, like lists, arrays are indexed from 0, not from 1.

If we really want a copy of the array so that we can make changes without affecting the original data, we can use the 'copy' method. As this example shows, we can now overwrite the data that 'second' points to without affecting the data pointed to by 'first'.

Arrays have properties as well as methods. One of the most useful is 'shape', which is a tuple of the array's size along each dimension.

Note the lack of parentheses: 'shape' is data, not a method call.

Also note that the tuple in 'shape' is exactly what we pass into functions like 'zeros' to create new arrays, which makes it easy to reproduce the shape of existing data.

Another data member is 'size', which is the total number of elements in the array. As this example shows, it is simply the product of the array's lengths along its dimensions.

We can rearrange the data in an array using the 'transpose' method, which flips the array on all its axes. But this doesn't actually move values around in memory: instead, it creates an alias that appears to have the values stored differently.

The 'ravel' method does something similar: it creates a one-dimensional alias for the original data. As you'd expect, its shape has a single value, which is the number of elements we started with.

What order do raveled values appear in? Let's start by thinking about a 2×4 array A. It looks two-dimensional… …but the computer's memory is 1-dimensional: each location is identified by a single integer address. Any program that works with multi-dimensional data must therefore decide how to lay out those values.

One possibility is row-major order, which concatenates the rows. This is what C uses, and since Python was originally written in C, it uses the same convention.

In contrast, column-major order concatenates the columns. Fortran does this, and MATLAB follows along.

There's no real difference in performance or usability… …but these differences cause headaches when data has to be moved from one programming language to another. For example, if your Python code wants to call an image processing function written in Fortran, you have to be careful about how data is ordered… …just as you have to be careful about 0-based versus 1-based indexing.

As an exercise, see if you can guess how the two types of languages store 3-dimensional or higher-dimensional data.

There are many other ways to reshape arrays. The most common is unsurprisingly called 'reshape'. Its arguments are the array's new dimensions, not a tuple of those dimensions. Once again, this aliases the data.

When we are reshaping, the new shape must have the same size as the original: we cannot add or drop elements. NumPy has this rule because it is just aliasing the existing data.

If we really want to change the physical size of the data, we have to use 'array.resize'. For example, if we resize a 3×3 array to be 2×2, we get the values that were in the first two rows and columns. This works in place, i.e., it modifies the array, rather than returning a new alias.

As an exercise, try to guess what happens when the size of the array increases, and then try out your guess.

To review: Arrays are blocks of homogeneous data. They are less flexible than lists, but operations on them are much faster. Most operations create aliases, rather than copying values. They can be reshaped in various ways (all of which leave the overall size the same)… …or resized.

In the next episode, we'll have a look at how we can select values from arrays.