NoSQL with MongoDB and Ruby Presentation

I presented at the Milwaukee Ruby User’s Group tonight on NoSQL using MongoDB and Ruby.

Code Snippets for the Presentation

Basic Operations

// insert data
db.factories.insert( { name: "Miller", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Lakefront", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Point", metro: { city: "Steven's Point", state: "WI" } } );
db.factories.insert( { name: "Pabst", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Blatz", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Coors", metro: { city: "Golden Springs", state: "CO" } } );

// simple queries
db.factories.find()
db.factories.findOne()
db.factories.find( { "metro.city" : "Milwaukee" } )
db.factories.find( { "metro.state": {$in : ["WI", "CO"] } } )

// update data
db.factories.update( { name: "Lakefront"}, { $set : { thebest : true } } );
db.factories.find()

// delete data
db.factories.remove({name:"Coors"})
db.factories.remove()

Ruby Example


require 'rubygems'
require 'mongo'
include Mongo

db = Connection.new.db('sample-db')
coll = db.collection('factories')

coll.remove

coll.insert( { :name => "Miller", :metro => { :city => "Milwaukee", :state => "WI" } } )
coll.insert( { :name => "Lakefront", :metro => { :city: "Milwaukee", :state => "WI" } } )
coll.insert( { :name => "Point", :metro => { :city => "Steven's Point", :state => "WI" } } )
coll.insert( { :name => "Pabst", :metro => { :city => "Milwaukee", :state => "WI" } } )
coll.insert( { :name => "Blatz", :metro => { :city => "Milwaukee", :state => "WI" } } )
coll.insert( { :name => "Coors", :metro => { :city => "Golden Springs", :state => "CO" } } )

puts "There are #{coll.count()} factories. Here they are:"
coll.find().each { |doc| puts doc.inspect }
coll.map_reduce("function () { emit(this.metro.city, this.name); }", "function (k, vals) { return vals.join(","); }").each { |r| puts r.inspect }

Map Reduce Example


db.factories.insert( { name: "Miller", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Lakefront", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Point", metro: { city: "Steven's Point", state: "WI" } } );
db.factories.insert( { name: "Pabst", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Blatz", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Coors", metro: { city: "Golden Springs", state: "CO" } } );

var fmap = function () {
emit(this.metro.city, this.name);
}
var fred = function (k, vals) {
return vals.join(",");
}
res = db.factories.mapReduce(fmap, fred)
db[res.result].find()
db[res.result].drop()

The Presentation

Download NoSQL with MongoDB and Ruby Slides

Thanks to Meghan at 10Gen for sending stickers and a copy of MongoDB: The Definitive Guide that I gave out as a door prize. I read the book quickly this weekend before the talk and found it quite good, so I recommend it if you want to get started with MongoDB.

MongoDB: MapReduce Functions for Grouping

SQL GROUP BY allows you to perform aggregate functions on data sets; To count all of the stores in each state, to average a series of related numbers, etc. MongoDB has some aggregate functions but they are fairly limited in scope. The MongoDB group function also suffers from the fact that it does not work on sharded configurations. So how do you perform grouped queries using MongoDB? By using MapReduce functions of course (you read the title right?)

Understanding MapReduce

Understanding MapReduce requires, or at least is made much easier by, understanding functional programming concepts. map and reduce (fold, inject) are functions that come from Lisp and have been inherited by a lot of languages (Scheme, Smalltalk, Ruby, Python).

map
A higher-order function which transforms a list by applying a function to each of its elements. Its return value is the transformed list. In MongoDB terms, the map is a function that is run for each Document in a collection and can return a value for that row to be included in the transformed list.
reduce
A higher-order function that iterates an arbitrary function over a data structure and builds up a return value. The reduce function takes the values returned by map and allows you to run a function to manipulate those values in some way.

Some Examples

Let’s start with some sample data:

db.factories.insert( { name: "Miller", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Lakefront", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Point", metro: { city: "Steven's Point", state: "WI" } } );
db.factories.insert( { name: "Pabst", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Blatz", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Coors", metro: { city: "Golden Springs", state: "CO" } } );
db.factories.find()

Lets say I want to count the number of factories in each of the cities (ignore the fact that I could have the same city in more than one state, I don’t in my data). For a count, I write a function that “emits” the group by key and a value that you can count. It can be any value, but for simplicity I’ll make it 1. emit() is a MongoDB server-side function that you use to identify a value in a row that should be added to the transformed list. If emit() is not called then the values for that row will be excluded from the results.

mapCity = function () {
emit(this.metro.city, 1);
}

The next piece is the reduce() function. The reduce function will be passed a key and an array of values that were collected by the map() function. I know my map function returns a 1 for each row keyed by city. So the reduce function will be called with a key “Golden Springs” and a single-element array containing a 1. For “Milwaukee” it will be passed an 4-element array of 1s.

reduceCount = function (k, vals) {
var sum = 0;
for (var i in vals) {
sum += vals[i];
}
return sum;
}

With those 2 functions I can call the mapReduce function to perform my Query.

res = db.factories.mapReduce(mapCity, reduceCount)
db[res.result].find()

This results in:

{ "_id" : "Golden Springs", "value" : 1 }
{ "_id" : "Milwaukee", "value" : 4 }
{ "_id" : "Steven's Point", "value" : 1 }

Counting is not the only thing I can do of course. Anything can be returned by the map function including complex JSON objects. In this example I combine the names of all of the Factories in a given City into a simple comma-separated list.

mapCity = function () {
emit(this.metro.city, this.name);
}
reduceNames = function (k, vals) {
return vals.join(",");
}
res = db.factories.mapReduce(mapCity, reduceNames)
db[res.result].find()

Give you:

{ "_id" : "Golden Springs", "value" : "Coors" }
{ "_id" : "Milwaukee", "value" : "Miller,Lakefront,Pabst,Blatz" }
{ "_id" : "Steven's Point", "value" : "Point" }

Conclusion

These are fairly simple examples, but I think it helps to work through this kind of simple thing to fully understand a new technique before you have to work with harder examples.

For more on MongoDB check out these books:

MongoDB Replication is Easy

Database replication with MongoDB is easy to setup. Replication duplicates all of the data from a master to one or more slave instances and allows for safety and quick recovery in case of a problem with your master database. Here is an example of how quick and easy it is to test out replication in MongoDB. Create a couple of directories for holding your mongo databases.
mkdir master slave
Start by running an instance of the “master” database.
cd master
mongod --master --dbpath .
Start a new terminal window and continue by running an instance of a “slave” database. This example is running on the same machine as master which is great for testing, but wouldn’t be a good setup if you were really trying to implement replication in a production environment since you would still have a single-point-of-failure in the single server case.
cd slave
mongod --slave --port 27018 --dbpath . --source localhost
And start another terminal window to use as the client
mongo
db.person.save( {name:'Geoff Lane'} )
db.person.save( {name:'Joe Smith'} )
db.person.find()
db.person.save( {name:'Jim Johnson', age: 65} )
db.person.find()
Now kill the master instance in your terminal with Control+C. This simulates the the master server dying. Lastly connect to the slave instance with a mongo client by specifying the port.
mongo --port 27018
db.person.find()
As you can see, the db.person.find() returns all of the values that were saved in the master list as well which shows that replication is working. One of the other interesting facts is that you can start a slave instance even after the mongod master is already running and has data and all of the data will be replicated over to the slave instance as well. This all works without ever shutting down your mongod master instance. This allows you to add replication after the fact with no downtime. For more on MongoDB check out these books:
* MongoDB: The Definitive Guide
* The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing
* MongoDB for Web Development (Developer’s Library)

MongoDB and Java: Find an item by Id

MongoDB is one of a number of new databases that have cropped up lately eschewing SQL. These NoSQL databases provide non-relational models that are suitable for solving different kinds of problems. This camp includes document oriented, tabular and key/value oriented models among others. These non-relational databases are supposed to excel at scalability through parallelization and replication but sometimes (although not always) at the expense of some of the transactional guarantees of SQL databases.

Why would you care about any of this? Document oriented databases allow for each document to store arbitrary pieces of data. This could allow for much easier customization of data storage such as when you want to store custom fields. Many of these databases also make horizontal scaling quite simple as well as providing high performance for write heavy applications.

With this in mind I figured I should look and see what’s there. So I started looking at MongoDB.

Start by creating an object to add to the database

With MongoDB, a collection is conceptually similar to a table in a SQL database. It holds a collection of related documents. A DBObject represents a document that you want to add to a collection. MongoDB automatically creates an id for each document that you add. That id is set in the DBObject after you pass it to the save method of the collection. In a real world application you might need that id to later access the document.


DBObject obj = new BasicDBObject();
obj.put("title", getTitle());
obj.put("body", getBody());

DBCollection coll = db.getCollection("note"));
coll.save(obj);

String idString = obj.get("_id").toString();

Retrieve an object previously added to a collection

To get a document from MongoDB you again use a DBObject. It does double duty in this case acting as a the parameters you want to use to identify a matching document. (There are ways you can do comparisons other than equality, of course, but I’ll leave that for a later post.) Using this as a “query by example” model we can set the _id property that we previously retrieved. The one catch is that the id is not just a string, it’s actually an instance of an ObjectId. Fortunately when we know that it’s quite easy to construct an instance with the string value.


String idString = "a456fd23ac56d";
DBCollection coll = db.getCollection(getCollectionName());
DBObject searchById = new BasicDBObject("_id", new ObjectId(idString));
DBObject found = coll.findOne(searchById);

A couple of easy examples, but it wasn’t obvious to me when I started how to get the id of a document that I just added to the database. More to come in the future.

For more on MongoDB check out these books: