Tuesday, June 25, 2013

Much ado about monoids

I was inspired by Tim Docker's recent illuminating talk on Data analysis with monoids at Syd-FP, to look closer at his approach. He used Haskell for his implementation, how would things work out if I used Scala for mine?
Note: I gave a presentation on this subject at ScalaSyd; slides are available here.
Here's the idea. We're given a CSV file of stock exchange data, looking something like this:

20091230,GOOG,618.07,622.73,618.01,622.73,14663
20091231,GOOG,624.86,625.4,619.98,619.98,12202
20100104,GOOG,627.46,629.51,624.24,626.75,19579
20100105,GOOG,627.84,627.84,621.54,623.99,30078
20100106,GOOG,625.08,625.86,606.36,608.26,39806
...


The columns here are date, ticker symbol, opening price, highest price in the course of the day, lowest price, closing price, and volume of shares traded in the day.

We want to perform some analysis on this data, say filter by ticker symbol, and output traded price range and count of shares traded, month by month. And it's a big file, so we want to do this in a single pass.

Lets have a look at how we might approach this using a functional style in Scala, stopping in for a scenic tour of some useful concepts along the way, to give a hint of the possibilities inherent in this approach.

Monoids (and semigroups)


We're going to need the concept of a monoid for what follows. Let's take a look.
Look at the following examples, and notice what's similar in each case.
  1. scala> (1 + 2) + 3 == 1 + (2 + 3)
    res0: Boolean = true
    
    scala> 9 + 0
    res1: Int = 9
    
    scala> 0 + 9
    res2: Int = 9
    

  2. scala> ("philip" + "k") + "dick" == "philip" + ("k" + "dick")
    res1: Boolean = true
    
    scala> "something" + ""
    res2: Boolean = "something"
    
    scala> "" + "something"
    res3: Boolean = "something"
    

  3. scala> BigInt(30) gcd (BigInt(42) gcd BigInt(70))
    res0: scala.math.BigInt = 2
    
    scala> (BigInt(30) gcd (BigInt(42)) gcd BigInt(70)
    res1: scala.math.BigInt = 2
    
    scala> BigInt(345) gcd BigInt(0)
    res2: scala.math.BigInt = 345
    
    scala> BigInt(0) gcd BigInt(345)
    res3: scala.math.BigInt = 345
    
What's similar in each case? In each case we've got
  1. a way of sticking two things of the same type together to get something else of that type: this is the operation of addition of integers, concatenation of strings, or greatest common divisor,
  2. we've got an element that doesn't do anything when you stick it onto anything (from either side),
  3. and we have the property that if we're sticking together a few things, it doesn't matter where we put the brackets, we still end up with the same result.
    In other words, the final result is independent of the way we "associate" the elements together, providing we keep them in the same order. This property is called "associativity".
Monoids are an abstraction of this common pattern, and the above paragraph is the definition of a monoid. The neutral element is usually called the "unit" of the monoid. If we don't require a unit, then we say we just have a "semigroup" rather than a monoid.

So, where does the name come from? Well, "mono" stands for "one", because it's a gadget with only one operation. And Nicholas Bourbaki, who came up with the name and popularised the concept, liked names that end in "-oid", I guess.

Monoids and semigroups are ubiquitous


Once you grasp the general idea, you start seeing them everywhere. Let's think of some more examples!
  • + for Int, Short, Long Double, Float, BigInt and BigDecimal with 0 as the unit
  • * for Int, Short, Long, Double, Float, BigInt and BigDecimal with 1 as the unit
  • || for Boolean with false as the unit
  • && for Boolean with true as the unit
  • greatest common divisor (for Int, Short, Long or BigInt), with 0 as the unit
  • lowest common multiple (for Int, Short, Long or BigInt), with 1 as the unit
  • maximum of two numbers.
  • minimum of two numbers.These last two aren't naturally a monoid, because it's not obvious what the unit should be. What's the maximum of zero numbers? We'll see below what possible workarounds we have up our sleeves if we do want a monoid.
  • set union, with the empty set as unit
  • set intersection on subsets of a Set S, with S as unit
  • functions from A to A, for some type A. This is what people like on call "endofunctions on A". "Endo" is just Greek for "within", because the function never takes values outside of A. The operation is function composition, and the identity function a =>a provides the unit. This is in a sense the archetypal monoid. Especially for functional programmers.
  • for any monoid M, we can make a monoid of out functions A => M by borrowing the monoid structure on M, i.e.
    implicit def pointwiseMonoid[A, M](implicit m: Monoid[M]) = new Monoid[A => M] {
        def zero = a => m.zero
    
        def append(f1: A => M, f2: A => M): A => M = a => m.append(f1(a), f2(a))
      }
    
    Here I'm using scalaz's Monoid type class, which calls the unit "zero" and the operation "append".
  • we can "compose" monoids by doing them side by side. That is, by tupling them together. Here's an example:
    scala> ("hi", List(3, 4, 5)) |+| (" Mum", List(7, 8, 9))
    res0: (String, List[Int]) = (hi Mum,List(3, 4, 5, 7, 8, 9))
    
    The |+| function here is just scalaz syntactic sugar for append.
  • convolution of functions. There may not be a unit, depending on the definition of convolution you use
  • concatentation of lists, with the empty list as unit
  • joining images of the same height together to make a new image, with the empty image as the unit. Or the same thing vertically.
  • an operation on numbers that works like this: 123 ⊹ 456 = 123456. This seems a little silly at first, but the equivalent in base 16 version is handy for chunking together binary data, and the binary version is useful when dealing with bit flags. Note that we can't define a unit for this operation, so this is only a semigroup.
  • Scalaz defines a monoid instance for its Ordering class, which consists of the values EQ, LT and GT. append is defined so that x |+| y == x when x is LT or GT, but EQ |+| y == y. This allows us to write chained comparisons with immaculate ease. For example, if we wanted to sort Twitter users by number of followers, then by number of tweets, then by name, we could use
    def compareUsers(a: User, b: User): Ordering = 
      (a.followers ?|? b.followers) |+| (a.tweets ?|? b.tweets) |+| (a.name ?|? b.name)
    
    Note: ?|? is just a little more scalaz syntactic sugar:
    scala> 10 ?|? 12
    res0: scalaz.Ordering = LT
    
As an aside, you might observe that List doesn't actually do anything with the values that have been appended - they just sit in there, and you can get them out again if you want. Intuitively, this is what qualifies List to be the "free monoid". We can go in the other direction here, too, and deliberately clobber the arguments to append. This leads us to
  • this odd operation: x ⊹ y = y, ie. we throw away the left operand. Clearly, if the type has more than one member we won't be able to define a unit, so this is again only a semigroup. We might call this the "retain last" semigroup. In parallel there's a "retain first" semigroup.
  • the trivial monoid on type T. Choose any z of type T and define x ⊹ y = z, eg. on Int we might have x ⊹ y = 0. This operation discards both operands.
Phew! That's a lot of monoids. Hopefully your brain is now abuzz with possibilities, and you've come up with another half a dozen more examples yourself.

What's the point?


The advantage of recognising and naming a recurring pattern, apart from clarifying for ourselves and others the essence of what we're doing, is we can reduce the amount of boilerplate we need to write. In this case the grunt work of actually doing the accumulation: we just specify the monoid we want, and let it perform the accumulation. In addition we can take advantage of library code written for any monoid, and by writing code that's parametrized by monoid we get a lot of power for free.

Put another way, monoids and semigroups abstract the idea of accumulation, and find a natural home any time there's some kind of combining or reduction operation going on. They allow us to clump together any number of values into one:
  • if we've got no values, use the unit.
  • if we've got one only value, great, we're done.
  • if we've got two values, we combine them with append
  • if we've got more than two, reduce in steps using append. Associativity gives us some leeway in how we choose to do this.
You can probably see there's a natural parallel here between non empty structures, semigroups, and reduce on the one hand, and possibly empty structures, monoids, and fold. These considerations lead us on to...

Foldable


Consider the fold method exposed by Scala's collection classes (for our purposes, either foldLeft or foldRight will do). Here's the signature of foldLeft defined on TraversableOnce[A], the root of Scala's collection hierarchy:

def foldLeft[B](z: B)(op: (B, A) => B): B

Now in the majority of use cases there's a monoid waiting to be discovered here. z is not any old value: it's the unit of a monoid, and op is not any old function, it's our monoid's operation in disguise. That is, we could refactor to use this method instead:

def foldMap[B](f: A => B)(implicit F: Monoid[B]) =
    foldLeft[B](F.zero){ case (b, a) => F.append(b, f(a)) }

Now our code that uses foldMap is a little cleaner, as we only need one parameter, and don't have to manage the accumulation ourselves.

Explaining the name


Why did we call our new operation "foldMap"? Well, think of the flatMap method: it's equivalent to first performing a map, and then a flatten on the resulting structure.

In much the same way, foldMap is conceptually equivalent to first performing a map, and then a fold to combine the resulting collection of monoid values into one.

The Scala standard library doesn't provide foldMap, but we can find an implementation in scalaz's Foldable type class, along with a number of other methods. Scalaz provides instances of Foldable for subclasses of Iterable, as well as for many scalaz data structures.

Aside: somewhat perplexingly, it doesn't provide instances of Foldable for all subclasses of TraversableOnce (the root of Scala's collection classes).This means in particular we'll have to write our own Foldable instance to be able to use it with views (lazy versions of collections).


A problem


One obvious thing we might like to do is to average our prices, but we face a problem. Taking the mean does not form a monoid - if you give me two averages, I can't combine them unless I know the respective sizes of the datasets they were calculated from.

But the mean is the ratio of total and count, which are both monoidal.

So I could accumulate values using something along the lines of
case class Mean[N: Fractional](sum: N, count: Int)
which has a natural monoid structure as the product (i.e. tuple) of two monoids, and then finally extract the mean using
def mean = sum / fromInt(count)
I've parametrised over the exact type N here, so I don't have to write a separate version for Double, BigDecimal, etc, but I need N to belong to the Fractional type class so I can make use of division when calculating the mean.

In Part II I'll look at how we can take these ideas further, and come back to tackling our original problem. Or, if you can't wait, check out the slides.