I was inspired by Tim Docker's recent illuminating talk on
Data analysis with monoids
at Syd-FP, to look closer at his approach. He used Haskell for his implementation,
how would things work out if I used Scala for mine?

Here's the idea. We're given a CSV file of stock exchange data, looking something like this:

The columns here are date, ticker symbol, opening price, highest price in the course of the day, lowest price, closing price, and volume of shares traded in the day.

We want to perform some analysis on this data, say filter by ticker symbol, and output traded price range and count of shares traded, month by month. And it's a big file, so we want to do this in a single pass.

Lets have a look at how we might approach this using a functional style in Scala, stopping in for a scenic tour of some useful concepts along the way, to give a hint of the possibilities inherent in this approach.

We're going to need the concept of a

Look at the following examples, and notice what's similar in each case.

So, where does the name come from? Well, "mono" stands for "one", because it's a gadget with only one operation. And Nicholas Bourbaki, who came up with the name and popularised the concept, liked names that end in "-oid", I guess.

Once you grasp the general idea, you start seeing them everywhere. Let's think of some more examples!

The advantage of recognising and naming a recurring pattern, apart from clarifying for ourselves and others the essence of what we're doing, is we can reduce the amount of boilerplate we need to write. In this case the grunt work of actually doing the accumulation: we just specify the monoid we want, and let it perform the accumulation. In addition we can take advantage of library code written for any monoid, and by writing code that's parametrized by monoid we get a lot of power for free.

Put another way, monoids and semigroups abstract the idea of accumulation, and find a natural home any time there's some kind of combining or reduction operation going on. They allow us to clump together any number of values into one:

Consider the fold method exposed by Scala's collection classes (for our purposes, either foldLeft or foldRight will do). Here's the signature of foldLeft defined on TraversableOnce[A], the root of Scala's collection hierarchy:

Now in the majority of use cases there's a monoid waiting to be discovered here.

Now our code that uses foldMap is a little cleaner, as we only need one parameter, and don't have to manage the accumulation ourselves.

Why did we call our new operation "

In much the same way,

The Scala standard library doesn't provide

Aside: somewhat perplexingly, it doesn't provide instances of

####

One obvious thing we might like to do is to average our prices, but we face a problem. Taking the mean does not form a monoid - if you give me two averages, I can't combine them unless I know the respective sizes of the datasets they were calculated from.

But the mean is the ratio of total and count, which

So I could accumulate values using something along the lines of

In Part II I'll look at how we can take these ideas further, and come back to tackling our original problem. Or, if you can't wait, check out the slides.
Here's the idea. We're given a CSV file of stock exchange data, looking something like this:

20091230,GOOG,618.07,622.73,618.01,622.73,14663 20091231,GOOG,624.86,625.4,619.98,619.98,12202 20100104,GOOG,627.46,629.51,624.24,626.75,19579 20100105,GOOG,627.84,627.84,621.54,623.99,30078 20100106,GOOG,625.08,625.86,606.36,608.26,39806 ...

The columns here are date, ticker symbol, opening price, highest price in the course of the day, lowest price, closing price, and volume of shares traded in the day.

We want to perform some analysis on this data, say filter by ticker symbol, and output traded price range and count of shares traded, month by month. And it's a big file, so we want to do this in a single pass.

Lets have a look at how we might approach this using a functional style in Scala, stopping in for a scenic tour of some useful concepts along the way, to give a hint of the possibilities inherent in this approach.

#### Monoids (and semigroups)

We're going to need the concept of a

**monoid**for what follows. Let's take a look.Look at the following examples, and notice what's similar in each case.

scala> (1 + 2) + 3 == 1 + (2 + 3) res0: Boolean = true scala> 9 + 0 res1: Int = 9 scala> 0 + 9 res2: Int = 9

scala> ("philip" + "k") + "dick" == "philip" + ("k" + "dick") res1: Boolean = true scala> "something" + "" res2: Boolean = "something" scala> "" + "something" res3: Boolean = "something"

scala> BigInt(30) gcd (BigInt(42) gcd BigInt(70)) res0: scala.math.BigInt = 2 scala> (BigInt(30) gcd (BigInt(42)) gcd BigInt(70) res1: scala.math.BigInt = 2 scala> BigInt(345) gcd BigInt(0) res2: scala.math.BigInt = 345 scala> BigInt(0) gcd BigInt(345) res3: scala.math.BigInt = 345

- a way of sticking two things of the same type together to get something else of that type: this is the operation of addition of integers, concatenation of strings, or greatest common divisor,
- we've got an element that doesn't do anything when you stick it onto anything (from either side),
- and we have the property that if we're sticking together a few things, it doesn't matter where we put the brackets, we still end up with the same result.

In other words, the final result is independent of the way we "associate" the elements together, providing we keep them in the same order. This property is called "associativity".

So, where does the name come from? Well, "mono" stands for "one", because it's a gadget with only one operation. And Nicholas Bourbaki, who came up with the name and popularised the concept, liked names that end in "-oid", I guess.

#### Monoids and semigroups are ubiquitous

Once you grasp the general idea, you start seeing them everywhere. Let's think of some more examples!

`+`

for Int, Short, Long Double, Float, BigInt and BigDecimal with 0 as the unit`*`

for Int, Short, Long, Double, Float, BigInt and BigDecimal with 1 as the unit`||`

for Boolean with`false`

as the unit`&&`

for Boolean with`true`

as the unit- greatest common divisor (for Int, Short, Long or BigInt), with 0 as the unit
- lowest common multiple (for Int, Short, Long or BigInt), with 1 as the unit
- maximum of two numbers.
- minimum of two numbers.These last two aren't naturally a monoid, because it's not obvious what the unit should be. What's the maximum of zero numbers? We'll see below what possible workarounds we have up our sleeves if we do want a monoid.

- set union, with the empty set as unit
- set intersection on subsets of a Set S, with S as unit
- functions from A to A, for some type A. This is what people like on call "endofunctions on A". "Endo" is just Greek for "within", because the function never takes values outside of A. The operation is function composition, and the identity function a =>a provides the unit. This is in a sense the archetypal monoid. Especially for functional programmers.
- for any monoid M, we can make a monoid of out functions A => M by borrowing the monoid structure on M, i.e.
implicit def pointwiseMonoid[A, M](implicit m: Monoid[M]) = new Monoid[A => M] { def zero = a => m.zero def append(f1: A => M, f2: A => M): A => M = a => m.append(f1(a), f2(a)) }

Here I'm using scalaz's Monoid type class, which calls the unit "zero" and the operation "append". - we can "compose" monoids by doing them side by side. That is, by tupling them together. Here's an example:
scala> ("hi", List(3, 4, 5)) |+| (" Mum", List(7, 8, 9)) res0: (String, List[Int]) = (hi Mum,List(3, 4, 5, 7, 8, 9))

The`|+|`

function here is just scalaz syntactic sugar for`append`

. - convolution of functions. There may not be a unit, depending on the definition of convolution you use
- concatentation of lists, with the empty list as unit
- joining images of the same height together to make a new image, with the empty image as the unit. Or the same thing vertically.
- an operation on numbers that works like this: 123 ⊹ 456 = 123456. This seems a little silly at first, but the equivalent in base 16 version is handy for chunking together binary data, and the binary version is useful when dealing with bit flags. Note that we can't define a unit for this operation, so this is only a semigroup.
- Scalaz defines a monoid instance for its
`Ordering`

class, which consists of the values`EQ`

,`LT`

and`GT`

.`append`

is defined so that`x |+| y == x`

when`x`

is`LT`

or`GT`

, but`EQ |+| y == y`

. This allows us to write chained comparisons with immaculate ease. For example, if we wanted to sort Twitter users by number of followers, then by number of tweets, then by name, we could usedef compareUsers(a: User, b: User): Ordering = (a.followers ?|? b.followers) |+| (a.tweets ?|? b.tweets) |+| (a.name ?|? b.name)

Note:`?|?`

is just a little more scalaz syntactic sugar:scala> 10 ?|? 12 res0: scalaz.Ordering = LT

`append`

. This leads us to
- this odd operation: x ⊹ y = y, ie. we throw away the left operand. Clearly, if the type has more than one member we won't be able to define a unit, so this is again only a semigroup. We might call this the "retain last" semigroup. In parallel there's a "retain first" semigroup.
- the trivial monoid on type T. Choose any z of type T and define x ⊹ y = z, eg. on Int we might have x ⊹ y = 0. This operation discards both operands.

#### What's the point?

The advantage of recognising and naming a recurring pattern, apart from clarifying for ourselves and others the essence of what we're doing, is we can reduce the amount of boilerplate we need to write. In this case the grunt work of actually doing the accumulation: we just specify the monoid we want, and let it perform the accumulation. In addition we can take advantage of library code written for any monoid, and by writing code that's parametrized by monoid we get a lot of power for free.

Put another way, monoids and semigroups abstract the idea of accumulation, and find a natural home any time there's some kind of combining or reduction operation going on. They allow us to clump together any number of values into one:

- if we've got no values, use the unit.
- if we've got one only value, great, we're done.
- if we've got two values, we combine them with append
- if we've got more than two, reduce in steps using append. Associativity gives us some leeway in how we choose to do this.

`reduce`

on the one hand, and
possibly empty structures, monoids, and `fold`

.
These considerations lead us on to...### Foldable

Consider the fold method exposed by Scala's collection classes (for our purposes, either foldLeft or foldRight will do). Here's the signature of foldLeft defined on TraversableOnce[A], the root of Scala's collection hierarchy:

def foldLeft[B](z: B)(op: (B, A) => B): B

Now in the majority of use cases there's a monoid waiting to be discovered here.

`z`

is not any old value: it's the unit of a monoid, and `op`

is not any old function, it's our monoid's operation in disguise. That is, we could refactor to use this method instead:
def foldMap[B](f: A => B)(implicit F: Monoid[B]) = foldLeft[B](F.zero){ case (b, a) => F.append(b, f(a)) }

Now our code that uses foldMap is a little cleaner, as we only need one parameter, and don't have to manage the accumulation ourselves.

#### Explaining the name

Why did we call our new operation "

`foldMap`

"? Well, think of the `flatMap`

method: it's equivalent to first performing a `map`

, and then a `flatten`

on the resulting structure.In much the same way,

`foldMap`

is conceptually equivalent to first performing a `map`

, and then a
`fold`

to combine the resulting collection of monoid values into one.The Scala standard library doesn't provide

`foldMap`

, but we can find an implementation in scalaz's `Foldable`

type class, along with a number of other methods. Scalaz provides instances of `Foldable`

for subclasses of `Iterable`

, as well as for many scalaz data structures.Aside: somewhat perplexingly, it doesn't provide instances of

`Foldable`

for all subclasses of `TraversableOnce`

(the root of Scala's collection classes).This means in particular we'll have to write our own Foldable instance to be able to use it with views (lazy versions of collections).#### A problem

One obvious thing we might like to do is to average our prices, but we face a problem. Taking the mean does not form a monoid - if you give me two averages, I can't combine them unless I know the respective sizes of the datasets they were calculated from.

But the mean is the ratio of total and count, which

**are**both monoidal.So I could accumulate values using something along the lines of

case class Mean[N: Fractional](sum: N, count: Int)which has a natural monoid structure as the product (i.e. tuple) of two monoids, and then finally extract the mean using

def mean = sum / fromInt(count)I've parametrised over the exact type

`N`

here, so I don't have to write a separate version for Double, BigDecimal, etc, but I need `N`

to belong to the `Fractional`

type class so I can make use of division when calculating the mean.