Find us on GitHub

# Essays: Counting Things

The starting point for this essay is:

We have a bunch of things, and we want to know how many there are. How should we count them?

The answer depends on (a) what constitutes a "thing", and (b) how they are stored (or whether they are stored at all).

## What if my things are in a list in memory?

Easy: use `len(list_of_things)` or whatever equivalent your language provides. After all, your program must know how many things it's managing, right? So just ask it for the answer.

## But what if the program doesn't know?

How could it not know? First, in some really old languages, there is no way to ask an array at runtime for its length, even though the program knows. For example, suppose you declare:

```CONSTANT INTEGER Width 70
REAL values(Width)```

Since you know the array's size, you can loop over its elements quite easily:

```INTEGER i
REAL sum
sum = 0
DO i=1, Width
sum = sum + values(i)
CONTINUE```

But there's nothing equivalent to Python's `len` function that will let you ask an arbitrary array how big it is. In particular, if you're writing a function, and are getting arrays as parameters, you must know their length—which means you cannot write something that handles arbitrary arrays like this:

```FUNCTION total(REAL values(*))
total = 0
DO i=1, LENGTH(values)            !! does not work
total = total + values(i)
CONTINUE
END```

Why not? Because these languages only allow functions to return scalars (i.e., single values), not arrays or other multi-value structures, so there's no way to make this work:

```FUNCTION total(REAL grid(*, *)) !! sum values in two-dimensional array
INTEGER width, height, i, j
width, height = LENGTH(grid)     !! does not work!
total = 0
DO i=1, width
DO j=1, height
total = total + grid(i, j)
CONTINUE
CONTINUE
END```

This is such a pain that every language invented since the early 1960s does something smarter. One possibility is a `SIZE` function that returns an array's size along some axis:

```INTEGER width = SIZE(grid, 1)
INTEGER height = SIZE(grid, 2)```

Another is an `EXTENT` function that takes a second array as a parameter and fills it with sizes:

```INTEGER dims(2)
EXTENT(grid, dims)
!! dims(1) is now the X extent
!! dims(2) is now the Y extent```

No matter how it's presented to programmers, the underlying implementation is the same: store the array's actual size in memory at runtime along with the data it contains, rather than throwing the size away after compilation and trusting the programmer to get things right. We have to do this anyway if we're going to check that indices are in bounds (i.e., that we're not asking for element 10000 of a 10-element array), so it isn't actually an added cost.

The second situation in which a program might not know the size of something comes up in C, where a text string is stored as a sequence of bytes ending in a byte with the value 0 like this:

 h e l l o 104 101 108 108 111 0

All the program has is a pointer to the start of the string: if it wants to know the length, it has to walk through memory from that location until it finds the terminating 0. Of course, if the 0 isn't there, the answer will be wrong (and the program will probably blow up in some hard-to-find way). This design decision made (some) sense back in 1970, but has been the cause of more security bugs in the last 20 years than any other single thing, which is why all modern programming languages explicitly store the lengths of things instead.

The third case in which the program might not know the size at runtime is when values are stored in a linked list. In a linked list, every entry is a pair: one part is the value being stored, and the second is a pointer to the next element of the list. This representation makes inserting and deleting values fast and easy, but counting (and sorting) are slower and more expensive than they are with contiguous representations that store items side by side in one block of memory. As with C strings, the only way to find out how many things are in the list is to start at the front and count them. We can speed this up using a "manager" structure that stores pointers to the head and tail of the list, and a count of the number of items. The cost is a few words of memory, and the extra complexity of keeping the count in the manager in sync with the number of values actually in the list. Both are small enough that most linked list implementations do it, which makes the linked-vs-contiguous choice a straight engineering tradeoff between two desirable but mutually-exclusive sets of properties.

## How many things do I have that satisfy some condition?

Back to counting... What if we don't want to count how many things we have in total, but how many we have that satisfy some condition? The straightforward answer is a loop:

```def count_odd(values):
result = 0
for v in values:
if is_odd(v):
result += 1
return result```

It's simple, and it's easy to test, but it's slow: every time we want the answer, we have to step through all the values again. It's also inflexible: if we want to know how many values are greater than some threshold, we have to write a whole new function.

Let's tackle those problems in order. If we know the condition (or conditions) in advance, we can keep a tally as we insert and remove values:

```def change(values, location, new_value):
values[location] = new_value
if is_odd(new_value):
return 1
else:
return -1

numbers = [SIZE]
num_odd = 0
num_odd += change(numbers, 0, 31)
num_odd += change(numbers, 1, 22)
...```

This is a bad idea—a really bad idea. Sooner or later, someone is going to forget that we're tracking the number of odd values in `numbers` and overwrite an entry directly:

`numbers[5] = 55555`

Poof: our count of how many entries are odd now has a 50/50 chance of being wrong (since `numbers[5]` might have been either even or odd to start with). A better solution is to create a class that does the counting for us:

```class OddCountList(object):

def __init__(self):
self.data = []
self.num_odd = 0

def __setitem__(self, i, v):
if is_odd(self.data[i]):
self.num_odd -= 1
if is_odd(v):
self.num_odd += 1
self.data[i] = v```

(If I wanted to write this class properly, I'd derive it from Python's built-in `list` class, but this hack is enough to get hte idea across.) The `__setitem__` method is called whenever the program assigns to `X[i]`, where `X` is an instance of this class. It subtracts one from the odd count if we're overwriting an odd number, adds one to the count if the new value is odd, and then saves the new value. Nobody has to remember anything: once they've created an `OddCountList` instead of a plain old list, the object does the right thing every time.

All right, but what if we want to count the number of values greater than some threshold, rather than the number that are odd? It would be easy enough to define another class, and then another to count the number of negative values, and so on, but there's a better way. As we explained in the lecture on first-class functions, most modern languages make it easy to pass a function as a parameter to another function, or store a reference to a function in a data structure. So let's modify our class:

```class CountingList(object):

def __init__(self, condition):
self.data = []
self.num = 0
self.condition = condition

def __setitem__(self, i, v):
if self.condition(self.data[i]):
self.num -= 1
if self.condition(v):
self.num += 1
self.data[i] = v```

If we want to keep track of odd numbers, it's just:

```def is_odd(v): return (v % 2) == 1

numbers = CountingList(is_odd)```

If we want to keep track of numbers greater than 100, it's:

```def is_greater_than_100(v): return v > 100

numbers = CountingList(is_greater_than_100)```

and so on. Separating the logic that does the counting from the test to see what should be counted or not lets us recycle the former: we only ever have to write it once.

## What if my things are in a file instead of in memory?

Easy: read the file into a list and count that.

## What if I have so many things that the won't fit into memory?

Good question. Data sets are getting larger every day (literally); gigabytes are commonplace, and terabytes are no longer even newsworthy, so scientists increasingly have to think about what to do when things won't fit into RAM. In this situation, the answer is straightforward: count items one by one as they're read:

```def count_in_stream(stream, condition):
result = 0
for value in stream:
if condition(value):
result += 1
return result

def count_in_file(filename, condition):
return result```

This is just like keeping a tally of the number of items that satisfy a condition as we write them into a list, except we're never storing the items: we just keep the tally.

## What if I'm counting the number of items of each of a set of fixed types?

Reading from a disk is tens of thousands of times slower than reading from memory, and if my data stream is coming live from a satellite or some other piece of experimental apparatus, I may not actually be able to read it a second time. If I want to count multiple things, I therefore have to count them all at once. If I know what they are in advance, I can do this:

```def count_in_stream(stream, conditions):
results = [0] * len(conditions)
for value in stream:
for i in range(len(conditions)):
c = conditions[i]
if c(value):
results[i] += 1
return results```

Here, `conditions` is a list of functions, each of which tests for one thing. `len(conditions) is the number of conditions we're given, so results is initialized to a list of that many zeroes. Each time we read a new value from the stream, we pass it to each checking function in turn, and increment the corresponding count if the checking function says "yes". Here's an example of how we'd use it:`

```def is_odd(v): return (v % 2) == 1

def is_greater_than_100(v): return v > 100

def is_negative(v): return v < 0

tests = [is_odd, is_greater_than_100, is_negative]

odds, greaters, negatives = count_in_stream(reader, tests)```

If we have written this, should we keep the earlier version (that only handles one test at a time) in our program? It's redundant: for any given test function, we can get the same effect with:

`num = count_in_stream(reader, [some_single_test])[0]`

Or should we keep it around as a convenience, like this:

`num = count_in_stream_single(reader, some_single_test)`

If we do the latter (which I would, because it's a common case that's more readable in its original form), it's really important to write the function's body so that it's just a wrapper around the general case:

```def count_in_stream_single(reader, single_test):

Remember, duplicated code is bad code: anything that appears in two or more places in a program will eventually be wrong in at least one.

Note for nerds: in Python, Ruby, and other dynamically-typed languages, we don't actually have to write a separate function to handle the special case. Instead, we can test the type of the function's second parameter and wrap it up if necessary:

```def count_in_stream(reader, test_or_tests):

# We have been passed a single function.
if is_callable(test_or_tests):

# We have been passed a list, so do exactly what we did before.
elif is_list(test_or_tests):
results = [0] * len(test_or_tests)
for value in stream:
for i in range(len(test_or_tests)):
c = test_or_tests[i]
if c(value):
results[i] += 1
return results

# Shouldn't ever get here.
else:
assert False, "Don't know what to do with test_or_tests"```

If `test_or_tests` is a single function, we wrap it up in a list, pass that list to `count_in_stream`, and return the zero'th element of the result to our caller. If it's already a list, we do what we did in the previous version: create a list of counts, then loop over the conditions for each value we read. And if it's any other type, we complain.

And yes, this means that `count_in_stream` is calling `count_in_stream`. It's called recursion, and it's really a lot simpler than some people make it out to be. What we've done is squeeze two logically separate things—handling a single test, and handling multiple tests at once—into a single function. If we're asked to do the special case, we just ask another runtime copy of ourself to treat it like the general case. We could do this:

```def count_in_stream(reader, test_or_tests):

# Do we have a single function?
if is_callable(test_or_tests):
wrapped_up = [test_or_tests]
unwrap = True
# Or do we have a list of functions?
else:
wrapped_up = test_or_tests # already a list!
unwrap = False

intermediate_result = ...all the general-case code goes here...

if unwrap:
return intermediate_result[0] # extract single value from list
else:
return intermediate_result    # give back all results```

This works, but I think the recursive version is a lot easier to read. It's certainly a lot less fragile: setting a condition at the start of the function in order to do something at the end requires us to keep long-distance interactions straight in our head while we're programming, and human beings are really bad at that...

There are still lots of interesting common cases to discuss—counting up to a marker, only counting within regions, counting in trees and graphs, counting in databases—but this seems like a good point to wind down what has turned into a rather long post. We'll come back and look at those other cases next time.

The first part of this series on counting looked at:

1. lists in memory,
2. things that satisfy some condition,
3. things in files,
4. streams, and
5. counting by type.

Let's explore a few more complicated cases.

## What if I only want to count things up to some stop sign?

Suppose you want to count how many people you talked to before you found one who was left-handed. The first (inefficient) version practically writes itself:

```def count_before(values):
num = 0
before = True
for v in values:
if v == THE_STOP_SIGN:
before = False
elif before:
num += 1
return num```

The variable `before` is a one-way switch. It starts in one state (`True`), and its value is only ever changed once. While it is true, we're counting; when it's not, we're not.

This code does the right thing, but there are two things wrong with it. First, it only looks for one stop sign (the one with the value `THE_STOP_SIGN`). We can generalize this function by passing in the value that signals the end of interesting data—the change is highlighted below:

```def count_before(values, stop_at):
num = 0
before = True
for v in values:
if v == stop_at:
before = False
elif before:
num += 1
return num```

Second, this function is inefficient: if the stop sign happens to be the third value out of a million, it will still look at the other 999,996. A purist would solve this by changing the loop condition so that we only keep looking at values while we haven't seen the stop sign. Let's do that in stages. First, we'll change the `for` loop to a `while` loop:

```def count_before(values, stop_at):
num = 0
before = True
i = 0
while i < len(values):
v = values[i]
if v == stop_at:
before = False
elif before:
num += 1
i += 1
return num```

Next, we'll change the condition on the loop so that the loop will stop executing once we have seen the stop sign:

```def count_before(values, stop_at):
num = 0
before = True
i = 0
while before and i < len(values):
v = values[i]
if v == stop_at:
before = False
elif before:
num += 1
i += 1
return num```

All right, the function isn't doing any unnecessary work. But can we have the best of both worlds? Can we use a `for` loop (which is easier to read, and also stop at the right time? The answer is "yes", and here's the code:

```def count_before(values, stop_at):
num = 0
for v in values:
if v == stop_at:
break

elif before:
num += 1
return num```

The trick is to use `break` to jump out of the loop at the right time. If we do this, we don't need the `before` variable any longer: as soon as we see the stop sign, we jump out of the loop and go down to the `return` statement. It's an elegant solution...

...but it can easily be abused. When someone reads a program, the condition that controls a loop is obvious: it's right there at the start of the loop. If it's a `while` loop, the condition is some logical expression; if it's a `for` loop, the control is "for every value in some collection". If there are `break` statements inside the loop, though, those conditions are only part of the story: there are other ways for the loop to stop, or equivalently, the data that's actually processed may only be a subset of the data that's apparently processed looking only at the loop control. The larger the body of the loop, and the more deeply nested the `break` is, the easier it will be for people (including you) to misunderstand what the code actually does.

In this case, I think that using `break` actually makes the program easier to understand. In other cases, I would use something like the `before` variable, and change its value inside the loop body, so that the entire stopping condition was stated explicitly in the loop control.

## What if my test depends on context?

An obvious generalization of the problem above is to count the number of things that occur in regions of the data when the data may contain any number of regions. (The case above is just a single region running from the start of data to the stop sign.) We'll assume for the moment that the regions of interest do not overlap, i.e., any item is either in a single region or not, and no item can ever be in two or more regions. We'll also assume that there are explicit "start of region" and "end of region" markers. In the case above, the region implicitly started at the beginning of the data, and ended either when we saw an explicit stop sign, or implicitly when we ran out of data and there hadn't been a stop sign at all. (I almost always forget to test for the second case.) Here's some code:

```def count_within(values, start, stop):
num = 0
counting = False
for v in values:
if v == start:
assert not counting, 'Saw a start marker while already counting'
counting = True
elif v == stop:
assert counting, 'Saw a stop marker outside a region'
counting = False
elif counting:
num += 1
return num```

Here, `counting` can switch back and forth between `True` and `False` any number of times. When it's `True`, the only legal possibilities are something to be counted, or the end-of-region marker. When it's `False`, all we can legally see is a start-of-region marker or something we don't care about (which is why the `if` has three branches rather than four: since we don't care about it, we don't need to include an `elif` or `else` to handle it).

Here's a variation on the same function that contains a common bug:

```def wrong_bad_buggy_count_within(values, start, stop):
num = 0
counting = False
for v in values:
if counting:
num += 1
elif v == start:
assert not counting, 'Saw a start marker while already counting'
counting = True
elif v == stop:
assert counting, 'Saw a stop marker outside a region'
counting = False
return num```

All we've done is move the check to see if we're in the region to the front of the `if`/`elif` tests. It seems innocuous, but look at what happens when we count values between "start" and "stop" in a list of strings:

```>>> test = ["no", "no", "start", "yes", "yes", "yes", "stop", "no"]
5```

Why did we get 5 instead of 3? Because we never stop counting: on the seventh trip around the loop, `counting` is `True`, so we increment `num` without checking to see if the value `v` is our stop sign.

## What if my things are nested?

Let's back up a bit and go down a different rabbit hole. Suppose we want to represent the following files and directories inside a program (where a trailing "/" indicates that something is a directory):

```/home/aturing/
+- myparams.cfg
+- program.py
+- data/
+- sample_1.dat
+- sample_2.dat
+- papers/
+- nature/
+- nature.doc
+- fig1.png
+- science/
+- science.doc
+- fig1.png
+- fig1.png```

We could do it with a list of lists (of lists, of lists...) like this:

```file_tree = [
"/home/aturing/",
"myparams.cfg",
"program.py",
["data/",
"sample_1.dat",
"sample_2.dat"],
["papers/",
["nature/",
"nature.doc",
"fig1.png"],
["science/",
"science.doc",
"fig1.png",
"fig2.png"]
]
]```

Directories are represented as lists, with the directory's name as the first item, and files are represented as strings. Given this, how can we count the number of DOC or PNG files to be found at all levels? Here's a simple solution:

```def count_files(files_and_directories, suffix):
count = 0
for entry in files_and_directories:

# If this is a file...
if is_string(entry):

# ...check whether it ends with the right thing.
if entry.endswith(suffix):
count += 1

# Otherwise, if this is a directory...
elif is_list(entry):

# FIXME: figure out what to do

# If it's neither a string or a list, something has gone wrong.
else:
assert False, 'Unrecognized entry'

# Report
return count```

Files are easy enough: we just check to see if they have the right ending (like `'.doc'` or `'.png'`). But what about lists? Wouldn't it be nice if we had a function that could process them.

Well, we do: it's called `count_files`. When we find a sub-list inside `files_and_directories`, we can call `count_files` again `count_files`. It doesn't matter that we're calling it from within `count_files`: that is no weirder than sending yourself email. Here's the modified code:

```def count_files(files_and_directories, suffix):
count = 0
for entry in files_and_directories:

# If this is a file...
if is_string(entry):

# ...check whether it ends with the right thing.
if entry.endswith(suffix):
count += 1

# Otherwise, if this is a directory...
elif is_list(entry):

# ...recurse.
temp = count_files(entry, suffix)
count += temp

# If it's neither a string or a list, something has gone wrong.
else:
assert False, 'Unrecognized entry'

# Report
return count```

That's it: we're done. Or almost done—as it's written right now, this function checks the first element of each list it looks at to see if it matches `suffix`, even though the first element is always the name of a directory rather than the name of a file. We'll leave it as an exercise to modify this function so that it skips the first element of each list it looks at, but you might also want to think about what `count_files([...], '/')` does, and whether it's (a) useful and (b) morally defensible.

## What if my data is in an irregular structure?

Via Wikipedia, here's a genealogical chart showing the ancestry of Charles II of Spain:

Here's one of many ways to represent this in a program:

```parentage = {
'Charles II of Spain' : ('Mariana of Austria',
'Philip IV of Spain'),

'Mariana of Austria'  : ('Maria Anna of Spain',
'Ferdinand III'),

'Philip IV of Spain'  : ('Margarita of Austria',
'Philip III of Spain'),

'Maria Ana of Spain'  : ('Margarita of Austria',
'Philip III of Spain'),

...
}```

This is a dictionary: the keys are people's names, and the values are the names of their mothers and fathers. How many distinct male ancestors does Ferdinand II have? (The word "distinct" is important because Ferdinand I is both his grandfather and his great-grandfather.) Here's our first pass:

```to_search = {'Ferdinand II'}
seen = {}
num = 0
while to_search:
current = to_search.pop()
mother, father = parentage[current]

if mother not in seen:
This code maintains two sets: one of people that haven't been looked at yet (`to_search`), and another of people that have been looked at (`seen`). Each time around the loop, it takes one person at random from `to_search` (that's the `pop` call at the start of the loop body) and fetches that person's mother and father from the `parentage` dictionary. If the mother's ancestry hasn't already been examined, she goes into `to_search`; if it has, we don't add her, because we don't want to over-count. We treat fathers the same way, except the first time we see one, we increment our count of the number of distinct males we've encountered.
We've really only just scratched the surface in this essay. We could look at counting things in a database using `WHERE`, `GROUP BY`, and `COUNT`, or using the Iterator and Visitor patterns to count things, but we'll stop here and save those for later.