MongoDB: MapReduce Functions for Grouping

SQL GROUP BY allows you to perform aggregate functions on data sets; To count all of the stores in each state, to average a series of related numbers, etc. MongoDB has some aggregate functions but they are fairly limited in scope. The MongoDB group function also suffers from the fact that it does not work on sharded configurations. So how do you perform grouped queries using MongoDB? By using MapReduce functions of course (you read the title right?)

Understanding MapReduce

Understanding MapReduce requires, or at least is made much easier by, understanding functional programming concepts. map and reduce (fold, inject) are functions that come from Lisp and have been inherited by a lot of languages (Scheme, Smalltalk, Ruby, Python).

map
A higher-order function which transforms a list by applying a function to each of its elements. Its return value is the transformed list. In MongoDB terms, the map is a function that is run for each Document in a collection and can return a value for that row to be included in the transformed list.
reduce
A higher-order function that iterates an arbitrary function over a data structure and builds up a return value. The reduce function takes the values returned by map and allows you to run a function to manipulate those values in some way.

Some Examples

Let’s start with some sample data:

db.factories.insert( { name: "Miller", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Lakefront", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Point", metro: { city: "Steven's Point", state: "WI" } } );
db.factories.insert( { name: "Pabst", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Blatz", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Coors", metro: { city: "Golden Springs", state: "CO" } } );
db.factories.find()

Lets say I want to count the number of factories in each of the cities (ignore the fact that I could have the same city in more than one state, I don’t in my data). For a count, I write a function that “emits” the group by key and a value that you can count. It can be any value, but for simplicity I’ll make it 1. emit() is a MongoDB server-side function that you use to identify a value in a row that should be added to the transformed list. If emit() is not called then the values for that row will be excluded from the results.

mapCity = function () {
emit(this.metro.city, 1);
}

The next piece is the reduce() function. The reduce function will be passed a key and an array of values that were collected by the map() function. I know my map function returns a 1 for each row keyed by city. So the reduce function will be called with a key “Golden Springs” and a single-element array containing a 1. For “Milwaukee” it will be passed an 4-element array of 1s.

reduceCount = function (k, vals) {
var sum = 0;
for (var i in vals) {
sum += vals[i];
}
return sum;
}

With those 2 functions I can call the mapReduce function to perform my Query.

res = db.factories.mapReduce(mapCity, reduceCount)
db[res.result].find()

This results in:

{ "_id" : "Golden Springs", "value" : 1 }
{ "_id" : "Milwaukee", "value" : 4 }
{ "_id" : "Steven's Point", "value" : 1 }

Counting is not the only thing I can do of course. Anything can be returned by the map function including complex JSON objects. In this example I combine the names of all of the Factories in a given City into a simple comma-separated list.

mapCity = function () {
emit(this.metro.city, this.name);
}
reduceNames = function (k, vals) {
return vals.join(",");
}
res = db.factories.mapReduce(mapCity, reduceNames)
db[res.result].find()

Give you:

{ "_id" : "Golden Springs", "value" : "Coors" }
{ "_id" : "Milwaukee", "value" : "Miller,Lakefront,Pabst,Blatz" }
{ "_id" : "Steven's Point", "value" : "Point" }

Conclusion

These are fairly simple examples, but I think it helps to work through this kind of simple thing to fully understand a new technique before you have to work with harder examples.

For more on MongoDB check out these books:

6 thoughts on “MongoDB: MapReduce Functions for Grouping”

  1. Geoff,

    A fine post: thanks!

    In your last example, you define the reduceNames function but don’t use it in your mapReduce. Perhaps it’d be useful to show those results as well.

    I think for me — as a SQL-lovin’ kind of dude, one of the great MongoDB hurdles has been aggregation. SQL makes it so easy to aggregate and especially sort on aggregated values (ORDER BY count(blah)…).

    And without really embracing M/R in Mongo, you can’t get that. The Gold, for me, with MongoDB’s M/R output is that you can get the collection it returns, and then perform the other important cursor operations on that result, especially sort. So if your M/R adds computed fields, you can then sort on those fields.

    Something I’ve not investigated fully, but I’m interested in your thoughts/advice on this: using these functions, it’s kind of hard to debug. When you’re writing, say, javascript, with firebug, you can use console.log() to print results as you go, which can be a real help while you’re developing js functions. How do you go about building up / debugging M/R functions in Mongo?

  2. @Marc,
    I fixed the typo you found and added in the output for the second example.

    I still find SQL much easier as well. It’s partially the familiarity of doing it for 10+ years. M/R, in the end, is more flexible as it’s a way of writing your own aggregate functions.

    As for debugging, I don’t have a good answer. I’ve never really done that much, but you can use print() to write things to stdout or the log file.

    Thanks for the comment!

  3. Nice piece of work, thanks for posting Geoff. I also think that in later versions (im using 1.8.2) you will need a third arg in the mapReduce method to keep the output. Hence the line:
    res = db.factories.mapReduce(mapCity, reduceNames)
    should now be something like
    res = db.factories.mapReduce(mapCity, reduceNames, {out : “OutputCollection”})
    More info can be found here:
    http://www.mongodb.org/display/DOCS/MapReduce

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>