MongoDB: MapReduce Functions for Grouping

SQL GROUP BY allows you to perform aggregate functions on data sets; To count all of the stores in each state, to average a series of related numbers, etc. MongoDB has some aggregate functions but they are fairly limited in scope. The MongoDB group function also suffers from the fact that it does not work on sharded configurations. So how do you perform grouped queries using MongoDB? By using MapReduce functions of course (you read the title right?)

Understanding MapReduce

Understanding MapReduce requires, or at least is made much easier by, understanding functional programming concepts. map and reduce (fold, inject) are functions that come from Lisp and have been inherited by a lot of languages (Scheme, Smalltalk, Ruby, Python).

map
A higher-order function which transforms a list by applying a function to each of its elements. Its return value is the transformed list. In MongoDB terms, the map is a function that is run for each Document in a collection and can return a value for that row to be included in the transformed list.
reduce
A higher-order function that iterates an arbitrary function over a data structure and builds up a return value. The reduce function takes the values returned by map and allows you to run a function to manipulate those values in some way.

Some Examples

Let’s start with some sample data:

db.factories.insert( { name: "Miller", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Lakefront", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Point", metro: { city: "Steven's Point", state: "WI" } } );
db.factories.insert( { name: "Pabst", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Blatz", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Coors", metro: { city: "Golden Springs", state: "CO" } } );
db.factories.find()

Lets say I want to count the number of factories in each of the cities (ignore the fact that I could have the same city in more than one state, I don’t in my data). For a count, I write a function that “emits” the group by key and a value that you can count. It can be any value, but for simplicity I’ll make it 1. emit() is a MongoDB server-side function that you use to identify a value in a row that should be added to the transformed list. If emit() is not called then the values for that row will be excluded from the results.

mapCity = function () {
    emit(this.metro.city, 1);
}

The next piece is the reduce() function. The reduce function will be passed a key and an array of values that were collected by the map() function. I know my map function returns a 1 for each row keyed by city. So the reduce function will be called with a key “Golden Springs” and a single-element array containing a 1. For “Milwaukee” it will be passed an 4-element array of 1s.

reduceCount = function (k, vals) {
    var sum = 0;
    for (var i in vals) {
        sum += vals[i];
    }
    return sum;
}

With those 2 functions I can call the mapReduce function to perform my Query.

res = db.factories.mapReduce(mapCity, reduceCount)
db[res.result].find()

This results in:

{ "_id" : "Golden Springs", "value" : 1 }
{ "_id" : "Milwaukee", "value" : 4 }
{ "_id" : "Steven's Point", "value" : 1 }

Counting is not the only thing I can do of course. Anything can be returned by the map function including complex JSON objects. In this example I combine the names of all of the Factories in a given City into a simple comma-separated list.

mapCity = function () {
    emit(this.metro.city, this.name);
}
reduceNames = function (k, vals) {
    return vals.join(",");
}
res = db.factories.mapReduce(mapCity, reduceNames)
db[res.result].find()

Give you:

{ "_id" : "Golden Springs", "value" : "Coors" }
{ "_id" : "Milwaukee", "value" : "Miller,Lakefront,Pabst,Blatz" }
{ "_id" : "Steven's Point", "value" : "Point" }

Conclusion

These are fairly simple examples, but I think it helps to work through this kind of simple thing to fully understand a new technique before you have to work with harder examples.

For more on MongoDB check out these books:

All Database Backed Web Apps are “Just CRUD Apps”

It’s time to end the debate between “Just simple CRUD Apps” and “more complex” apps.

Very complex behavior can be expressed through a small set of fairly simple rules. This is often described as emergent behavior or just emergence. The general idea is that simple building blocks can construct complex systems – atoms form molecules, molecules form proteins form cells, forms a human. Each of the building blocks is simple, but the end result is an amazingly complex organism.

The same use of sets of simple commands often is expressed in software development. In object-oriented development, groups of simple methods are used to create more complex behavior. Likewise simple objects collaborate to form more complex behavior. Likewise all of the databases in the world answering questions about all kinds of things, aggregating and filtering, are being manipulated by four simple statements. To access data using SQL, you use four simple commands:

SQL

  • SELECT
  • INSERT
  • UPDATE
  • DELETE

What other SQL operations are there? None. The only other things you can do are DDL operations to modify a database schema. Every operation that can be done to manipulate data in a schema can be expressed by some combination of these four simple statements.

REST/HTTP

  • GET
  • POST
  • PUT
  • DELETE

HTTP happens to map the same concepts as SQL to simple Verbs. While HTTP supports a few other Verbs – OPTIONS, HEAD, TRACE, and CONNECT – these methods are generally for diagnostics, discovery and proxy support. Two independent protocols have defined the same set of general purpose actions and have decided that they are the only ones needed. I would hazard a guess that this is not a coincidence.

The input from the user and the storage of the system can only be manipulated using four simple commands that correspond to Create, Read, Update and Delete. These simple, constrained sets of operations allows you to build systems that solve complex problems. As a software developer you’re doing something wrong if the way that you are building an application doesn’t allow you to start simple and only add complexity as needed. Software is at its best when it is only as complex as it needs to be to solve the problem at hand. Software is also at its best when it can easily be modified to add complexity as needed without drastic rework. Intentionally imposing some constraints on a design can help enforce a consistent, simple design through the entire system.

Using SQL Compact Edition Under ASP.NET

What is SQLCE?

SQL Compact Edition is the “low end” version of a SQL database solution from Microsoft. It is a single-file, application managed, database implementation. It doesn’t have all of the bells and whistles of the high end database solutions. This is great when you realize the next lowest version, SQL Express is over a 100MB install.

The beta of this software was called SQL Everywhere Edition (SQLEV). Microsoft decided that they didn’t like the name so they went with SQL Compact Edition (SQLCE ). The name Compact Edition is a bit of a misnomer. It can be used on the Compact Framework, but it can also be used on the full-framework anywhere a single-file, zero-install database might be needed. Well almost Everywhere at least. There is an explicit check when running on the Full Framework to make sure that you are not using it in an ASP.NET application. SQLCE is not a very scalable solution. It has some inherent limitations with concurrent connections for example. This is fine though if you go back to “what are you using this for”? An embedded, single-file database. Well I ran into a case where I need a small web-service to be running where an embedded database makes a lot of sense. I’m using the HttpListener class to run my own Http.sys server without using IIS. This still counts as ASP.NET to the SQLCE code though.

Force SQLCE to run under ASP.NET

Steve Lasker posted blog entry on how to use SQLCE under ASP.NET using the pre-release version of the SQLEV. Under SQLEV you set a flag that would tell the SQL Connection, “yes I know this isn’t supported, but let me do it anyway”:

AppDomain.CurrentDomain.SetData("SQLServerEverywhereUnderWebHosting", true)

As you can see the name of the product is right there in the key. Well they changed the name of the product and so they changed the name of the key. So, if you were using the beta for development and are now switching over to the release version of SQLCE , you will need to change the key:

AppDomain.CurrentDomain.SetData("SQLServerCompactEditionUnderWebHosting", true)

That should allow you to use the database under ASP.NET. Now you can revel in the fact that you are swimming in unsupported waters!

Special Thanks
I found this using the great .NET disassembler Lutz Roeder’s .NET Reflector. You should check it out. It can be a great way to track down details of an API implementation when the abstraction is a bit leaky.