Find us on GitHub

Teaching basic lab skills
for research computing

Essays: Counting Things

The starting point for this essay is:

We have a bunch of things, and we want to know how many there are. How should we count them?

The answer depends on (a) what constitutes a "thing", and (b) how they are stored (or whether they are stored at all).

What if my things are in a list in memory?

Easy: use len(list_of_things) or whatever equivalent your language provides. After all, your program must know how many things it's managing, right? So just ask it for the answer.

But what if the program doesn't know?

How could it not know? First, in some really old languages, there is no way to ask an array at runtime for its length, even though the program knows. For example, suppose you declare:

CONSTANT INTEGER Width 70
REAL values(Width)

Since you know the array's size, you can loop over its elements quite easily:

INTEGER i
REAL sum
sum = 0
DO i=1, Width
  sum = sum + values(i)
CONTINUE

But there's nothing equivalent to Python's len function that will let you ask an arbitrary array how big it is. In particular, if you're writing a function, and are getting arrays as parameters, you must know their length—which means you cannot write something that handles arbitrary arrays like this:

FUNCTION total(REAL values(*))
  total = 0
  DO i=1, LENGTH(values)            !! does not work
      total = total + values(i)
  CONTINUE
END

Why not? Because these languages only allow functions to return scalars (i.e., single values), not arrays or other multi-value structures, so there's no way to make this work:

FUNCTION total(REAL grid(*, *)) !! sum values in two-dimensional array
  INTEGER width, height, i, j
  width, height = LENGTH(grid)     !! does not work!
  total = 0
  DO i=1, width
      DO j=1, height
          total = total + grid(i, j)
      CONTINUE
  CONTINUE
END

This is such a pain that every language invented since the early 1960s does something smarter. One possibility is a SIZE function that returns an array's size along some axis:

INTEGER width = SIZE(grid, 1)
INTEGER height = SIZE(grid, 2)

Another is an EXTENT function that takes a second array as a parameter and fills it with sizes:

INTEGER dims(2)
EXTENT(grid, dims)
!! dims(1) is now the X extent
!! dims(2) is now the Y extent

No matter how it's presented to programmers, the underlying implementation is the same: store the array's actual size in memory at runtime along with the data it contains, rather than throwing the size away after compilation and trusting the programmer to get things right. We have to do this anyway if we're going to check that indices are in bounds (i.e., that we're not asking for element 10000 of a 10-element array), so it isn't actually an added cost.

The second situation in which a program might not know the size of something comes up in C, where a text string is stored as a sequence of bytes ending in a byte with the value 0 like this:

h e l l o
104 101 108 108 111 0

All the program has is a pointer to the start of the string: if it wants to know the length, it has to walk through memory from that location until it finds the terminating 0. Of course, if the 0 isn't there, the answer will be wrong (and the program will probably blow up in some hard-to-find way). This design decision made (some) sense back in 1970, but has been the cause of more security bugs in the last 20 years than any other single thing, which is why all modern programming languages explicitly store the lengths of things instead.

The third case in which the program might not know the size at runtime is when values are stored in a linked list. In a linked list, every entry is a pair: one part is the value being stored, and the second is a pointer to the next element of the list. This representation makes inserting and deleting values fast and easy, but counting (and sorting) are slower and more expensive than they are with contiguous representations that store items side by side in one block of memory. As with C strings, the only way to find out how many things are in the list is to start at the front and count them. We can speed this up using a "manager" structure that stores pointers to the head and tail of the list, and a count of the number of items. The cost is a few words of memory, and the extra complexity of keeping the count in the manager in sync with the number of values actually in the list. Both are small enough that most linked list implementations do it, which makes the linked-vs-contiguous choice a straight engineering tradeoff between two desirable but mutually-exclusive sets of properties.

How many things do I have that satisfy some condition?

Back to counting... What if we don't want to count how many things we have in total, but how many we have that satisfy some condition? The straightforward answer is a loop:

def count_odd(values):
  result = 0
  for v in values:
      if is_odd(v):
          result += 1
  return result

It's simple, and it's easy to test, but it's slow: every time we want the answer, we have to step through all the values again. It's also inflexible: if we want to know how many values are greater than some threshold, we have to write a whole new function.

Let's tackle those problems in order. If we know the condition (or conditions) in advance, we can keep a tally as we insert and remove values:

def change(values, location, new_value):
  values[location] = new_value
  if is_odd(new_value):
      return 1
  else:
      return -1

numbers = [SIZE]
num_odd = 0
num_odd += change(numbers, 0, 31)
num_odd += change(numbers, 1, 22)
...

This is a bad idea—a really bad idea. Sooner or later, someone is going to forget that we're tracking the number of odd values in numbers and overwrite an entry directly:

numbers[5] = 55555

Poof: our count of how many entries are odd now has a 50/50 chance of being wrong (since numbers[5] might have been either even or odd to start with). A better solution is to create a class that does the counting for us:

class OddCountList(object):

  def __init__(self):
      self.data = []
      self.num_odd = 0

  def __setitem__(self, i, v):
      if is_odd(self.data[i]):
          self.num_odd -= 1
      if is_odd(v):
          self.num_odd += 1
      self.data[i] = v

(If I wanted to write this class properly, I'd derive it from Python's built-in list class, but this hack is enough to get hte idea across.) The __setitem__ method is called whenever the program assigns to X[i], where X is an instance of this class. It subtracts one from the odd count if we're overwriting an odd number, adds one to the count if the new value is odd, and then saves the new value. Nobody has to remember anything: once they've created an OddCountList instead of a plain old list, the object does the right thing every time.

All right, but what if we want to count the number of values greater than some threshold, rather than the number that are odd? It would be easy enough to define another class, and then another to count the number of negative values, and so on, but there's a better way. As we explained in the lecture on first-class functions, most modern languages make it easy to pass a function as a parameter to another function, or store a reference to a function in a data structure. So let's modify our class:

class CountingList(object):

  def __init__(self, condition):
      self.data = []
      self.num = 0
      self.condition = condition

  def __setitem__(self, i, v):
      if self.condition(self.data[i]):
          self.num -= 1
      if self.condition(v):
          self.num += 1
      self.data[i] = v

If we want to keep track of odd numbers, it's just:

def is_odd(v): return (v % 2) == 1

numbers = CountingList(is_odd)

If we want to keep track of numbers greater than 100, it's:

def is_greater_than_100(v): return v > 100

numbers = CountingList(is_greater_than_100)

and so on. Separating the logic that does the counting from the test to see what should be counted or not lets us recycle the former: we only ever have to write it once.

What if my things are in a file instead of in memory?

Easy: read the file into a list and count that.

What if I have so many things that the won't fit into memory?

Good question. Data sets are getting larger every day (literally); gigabytes are commonplace, and terabytes are no longer even newsworthy, so scientists increasingly have to think about what to do when things won't fit into RAM. In this situation, the answer is straightforward: count items one by one as they're read:

def count_in_stream(stream, condition):
  result = 0
  for value in stream:
      if condition(value):
          result += 1
  return result

def count_in_file(filename, condition):
  reader = open(filename, 'r')
  result = count_in_stream(reader)
  reader.close()
  return result

This is just like keeping a tally of the number of items that satisfy a condition as we write them into a list, except we're never storing the items: we just keep the tally.

What if I'm counting the number of items of each of a set of fixed types?

Reading from a disk is tens of thousands of times slower than reading from memory, and if my data stream is coming live from a satellite or some other piece of experimental apparatus, I may not actually be able to read it a second time. If I want to count multiple things, I therefore have to count them all at once. If I know what they are in advance, I can do this:

def count_in_stream(stream, conditions):
  results = [0] * len(conditions)
  for value in stream:
      for i in range(len(conditions)):
          c = conditions[i]
          if c(value):
              results[i] += 1
  return results

Here, conditions is a list of functions, each of which tests for one thing. len(conditions) is the number of conditions we're given, so results is initialized to a list of that many zeroes. Each time we read a new value from the stream, we pass it to each checking function in turn, and increment the corresponding count if the checking function says "yes". Here's an example of how we'd use it:

def is_odd(v): return (v % 2) == 1

def is_greater_than_100(v): return v > 100

def is_negative(v): return v < 0

tests = [is_odd, is_greater_than_100, is_negative]

reader = space_probe_reader('XR211') # connect to space probe data stream
odds, greaters, negatives = count_in_stream(reader, tests)

If we have written this, should we keep the earlier version (that only handles one test at a time) in our program? It's redundant: for any given test function, we can get the same effect with:

num = count_in_stream(reader, [some_single_test])[0]

Or should we keep it around as a convenience, like this:

num = count_in_stream_single(reader, some_single_test)

If we do the latter (which I would, because it's a common case that's more readable in its original form), it's really important to write the function's body so that it's just a wrapper around the general case:

def count_in_stream_single(reader, single_test):
  return count_in_stream(reader, [single_test])[0]

Remember, duplicated code is bad code: anything that appears in two or more places in a program will eventually be wrong in at least one.

Note for nerds: in Python, Ruby, and other dynamically-typed languages, we don't actually have to write a separate function to handle the special case. Instead, we can test the type of the function's second parameter and wrap it up if necessary:

def count_in_stream(reader, test_or_tests):

  # We have been passed a single function.
  if is_callable(test_or_tests):
      return count_in_stream(reader, [test_or_tests])[0]

  # We have been passed a list, so do exactly what we did before.
  elif is_list(test_or_tests):
      results = [0] * len(test_or_tests)
      for value in stream:
          for i in range(len(test_or_tests)):
              c = test_or_tests[i]
              if c(value):
                  results[i] += 1
      return results

  # Shouldn't ever get here.
  else:
      assert False, "Don't know what to do with test_or_tests"

If test_or_tests is a single function, we wrap it up in a list, pass that list to count_in_stream, and return the zero'th element of the result to our caller. If it's already a list, we do what we did in the previous version: create a list of counts, then loop over the conditions for each value we read. And if it's any other type, we complain.

And yes, this means that count_in_stream is calling count_in_stream. It's called recursion, and it's really a lot simpler than some people make it out to be. What we've done is squeeze two logically separate things—handling a single test, and handling multiple tests at once—into a single function. If we're asked to do the special case, we just ask another runtime copy of ourself to treat it like the general case. We could do this:

def count_in_stream(reader, test_or_tests):

  # Do we have a single function?
  if is_callable(test_or_tests):
      wrapped_up = [test_or_tests]
      unwrap = True
  # Or do we have a list of functions?
  else:
      wrapped_up = test_or_tests # already a list!
      unwrap = False

  intermediate_result = ...all the general-case code goes here...

  if unwrap:
      return intermediate_result[0] # extract single value from list
  else:
      return intermediate_result    # give back all results

This works, but I think the recursive version is a lot easier to read. It's certainly a lot less fragile: setting a condition at the start of the function in order to do something at the end requires us to keep long-distance interactions straight in our head while we're programming, and human beings are really bad at that...

There are still lots of interesting common cases to discuss—counting up to a marker, only counting within regions, counting in trees and graphs, counting in databases—but this seems like a good point to wind down what has turned into a rather long post. We'll come back and look at those other cases next time.


The first part of this series on counting looked at:

  1. lists in memory,
  2. things that satisfy some condition,
  3. things in files,
  4. streams, and
  5. counting by type.

Let's explore a few more complicated cases.

What if I only want to count things up to some stop sign?

Suppose you want to count how many people you talked to before you found one who was left-handed. The first (inefficient) version practically writes itself:

def count_before(values):
  num = 0
  before = True
  for v in values:
      if v == THE_STOP_SIGN:
          before = False
      elif before:
          num += 1
  return num

The variable before is a one-way switch. It starts in one state (True), and its value is only ever changed once. While it is true, we're counting; when it's not, we're not.

This code does the right thing, but there are two things wrong with it. First, it only looks for one stop sign (the one with the value THE_STOP_SIGN). We can generalize this function by passing in the value that signals the end of interesting data—the change is highlighted below:

def count_before(values, stop_at):
  num = 0
  before = True
  for v in values:
      if v == stop_at:
          before = False
      elif before:
          num += 1
  return num

Second, this function is inefficient: if the stop sign happens to be the third value out of a million, it will still look at the other 999,996. A purist would solve this by changing the loop condition so that we only keep looking at values while we haven't seen the stop sign. Let's do that in stages. First, we'll change the for loop to a while loop:

def count_before(values, stop_at):
  num = 0
  before = True
    i = 0
  while i < len(values):
      v = values[i]
      if v == stop_at:
          before = False
      elif before:
          num += 1
      i += 1
  return num

Next, we'll change the condition on the loop so that the loop will stop executing once we have seen the stop sign:

def count_before(values, stop_at):
  num = 0
  before = True
  i = 0
  while before and i < len(values):
      v = values[i]
      if v == stop_at:
          before = False
      elif before:
          num += 1
      i += 1
  return num

All right, the function isn't doing any unnecessary work. But can we have the best of both worlds? Can we use a for loop (which is easier to read, and also stop at the right time? The answer is "yes", and here's the code:

def count_before(values, stop_at):
  num = 0
  for v in values:
      if v == stop_at:
          break

      elif before:
          num += 1
  return num

The trick is to use break to jump out of the loop at the right time. If we do this, we don't need the before variable any longer: as soon as we see the stop sign, we jump out of the loop and go down to the return statement. It's an elegant solution...

...but it can easily be abused. When someone reads a program, the condition that controls a loop is obvious: it's right there at the start of the loop. If it's a while loop, the condition is some logical expression; if it's a for loop, the control is "for every value in some collection". If there are break statements inside the loop, though, those conditions are only part of the story: there are other ways for the loop to stop, or equivalently, the data that's actually processed may only be a subset of the data that's apparently processed looking only at the loop control. The larger the body of the loop, and the more deeply nested the break is, the easier it will be for people (including you) to misunderstand what the code actually does.

In this case, I think that using break actually makes the program easier to understand. In other cases, I would use something like the before variable, and change its value inside the loop body, so that the entire stopping condition was stated explicitly in the loop control.

What if my test depends on context?

An obvious generalization of the problem above is to count the number of things that occur in regions of the data when the data may contain any number of regions. (The case above is just a single region running from the start of data to the stop sign.) We'll assume for the moment that the regions of interest do not overlap, i.e., any item is either in a single region or not, and no item can ever be in two or more regions. We'll also assume that there are explicit "start of region" and "end of region" markers. In the case above, the region implicitly started at the beginning of the data, and ended either when we saw an explicit stop sign, or implicitly when we ran out of data and there hadn't been a stop sign at all. (I almost always forget to test for the second case.) Here's some code:

def count_within(values, start, stop):
  num = 0
  counting = False
  for v in values:
      if v == start:
          assert not counting, 'Saw a start marker while already counting'
          counting = True
      elif v == stop:
          assert counting, 'Saw a stop marker outside a region'
          counting = False
      elif counting:
          num += 1
  return num

Here, counting can switch back and forth between True and False any number of times. When it's True, the only legal possibilities are something to be counted, or the end-of-region marker. When it's False, all we can legally see is a start-of-region marker or something we don't care about (which is why the if has three branches rather than four: since we don't care about it, we don't need to include an elif or else to handle it).

Here's a variation on the same function that contains a common bug:

def wrong_bad_buggy_count_within(values, start, stop):
  num = 0
  counting = False
  for v in values:
        if counting:
          num += 1
      elif v == start:
          assert not counting, 'Saw a start marker while already counting'
          counting = True
      elif v == stop:
          assert counting, 'Saw a stop marker outside a region'
          counting = False
  return num

All we've done is move the check to see if we're in the region to the front of the if/elif tests. It seems innocuous, but look at what happens when we count values between "start" and "stop" in a list of strings:

>>> test = ["no", "no", "start", "yes", "yes", "yes", "stop", "no"]
>>> print wrong_bad_buggy_count_within(test, "start", "stop")
5

Why did we get 5 instead of 3? Because we never stop counting: on the seventh trip around the loop, counting is True, so we increment num without checking to see if the value v is our stop sign.

What if my things are nested?

Let's back up a bit and go down a different rabbit hole. Suppose we want to represent the following files and directories inside a program (where a trailing "/" indicates that something is a directory):

/home/aturing/
+- myparams.cfg
+- program.py
+- data/
 +- sample_1.dat
 +- sample_2.dat
+- papers/
 +- nature/
    +- nature.doc
    +- fig1.png
 +- science/
    +- science.doc
    +- fig1.png
    +- fig1.png

We could do it with a list of lists (of lists, of lists...) like this:

file_tree = [
  "/home/aturing/",
  "myparams.cfg",
  "program.py",
  ["data/",
   "sample_1.dat",
   "sample_2.dat"],
  ["papers/",
   ["nature/",
    "nature.doc",
    "fig1.png"],
   ["science/",
    "science.doc",
    "fig1.png",
    "fig2.png"]
  ]
]

Directories are represented as lists, with the directory's name as the first item, and files are represented as strings. Given this, how can we count the number of DOC or PNG files to be found at all levels? Here's a simple solution:

def count_files(files_and_directories, suffix):
  count = 0
  for entry in files_and_directories:

      # If this is a file...
      if is_string(entry):

          # ...check whether it ends with the right thing.
          if entry.endswith(suffix):
              count += 1

      # Otherwise, if this is a directory...
      elif is_list(entry):

          # FIXME: figure out what to do

      # If it's neither a string or a list, something has gone wrong.
      else:
          assert False, 'Unrecognized entry'

  # Report
  return count

Files are easy enough: we just check to see if they have the right ending (like '.doc' or '.png'). But what about lists? Wouldn't it be nice if we had a function that could process them.

Well, we do: it's called count_files. When we find a sub-list inside files_and_directories, we can call count_files again count_files. It doesn't matter that we're calling it from within count_files: that is no weirder than sending yourself email. Here's the modified code:

def count_files(files_and_directories, suffix):
  count = 0
  for entry in files_and_directories:

      # If this is a file...
      if is_string(entry):

          # ...check whether it ends with the right thing.
          if entry.endswith(suffix):
              count += 1

      # Otherwise, if this is a directory...
      elif is_list(entry):

            # ...recurse.
          temp = count_files(entry, suffix)
          count += temp

      # If it's neither a string or a list, something has gone wrong.
      else:
          assert False, 'Unrecognized entry'

  # Report
  return count

That's it: we're done. Or almost done—as it's written right now, this function checks the first element of each list it looks at to see if it matches suffix, even though the first element is always the name of a directory rather than the name of a file. We'll leave it as an exercise to modify this function so that it skips the first element of each list it looks at, but you might also want to think about what count_files([...], '/') does, and whether it's (a) useful and (b) morally defensible.

What if my data is in an irregular structure?

Via Wikipedia, here's a genealogical chart showing the ancestry of Charles II of Spain:

Ancestry of Charles II of Spain

Here's one of many ways to represent this in a program:

parentage = {
  'Charles II of Spain' : ('Mariana of Austria',
                           'Philip IV of Spain'),

  'Mariana of Austria'  : ('Maria Anna of Spain',
                           'Ferdinand III'),

  'Philip IV of Spain'  : ('Margarita of Austria',
                           'Philip III of Spain'),

  'Maria Ana of Spain'  : ('Margarita of Austria',
                           'Philip III of Spain'),

  ...
}

This is a dictionary: the keys are people's names, and the values are the names of their mothers and fathers. How many distinct male ancestors does Ferdinand II have? (The word "distinct" is important because Ferdinand I is both his grandfather and his great-grandfather.) Here's our first pass:

to_search = {'Ferdinand II'}
seen = {}
num = 0
while to_search:
  current = to_search.pop()
  mother, father = parentage[current]

  if mother not in seen:
      to_search.add(mother)
  seen.add(mother)

  if father not in seen:
      to_search.add(father)
      num += 1
  seen.add(father)

This code maintains two sets: one of people that haven't been looked at yet (to_search), and another of people that have been looked at (seen). Each time around the loop, it takes one person at random from to_search (that's the pop call at the start of the loop body) and fetches that person's mother and father from the parentage dictionary. If the mother's ancestry hasn't already been examined, she goes into to_search; if it has, we don't add her, because we don't want to over-count. We treat fathers the same way, except the first time we see one, we increment our count of the number of distinct males we've encountered.

How else could we count?

We've really only just scratched the surface in this essay. We could look at counting things in a database using WHERE, GROUP BY, and COUNT, or using the Iterator and Visitor patterns to count things, but we'll stop here and save those for later.