Mapping SQL Joins using Anorm

Anorm is a Scala framework that is a fairly thin veneer over JDBC that allows you to write SQL queries and map results into Scala objects. The examples easily found on the web have a tendency to be fairly simple. One of the first problems I ran into was mapping a Parent-Child hierarchy where the parent has a collection of values from a different table.

For this post, I’m using a simple, contrived schema that looks like the following:

CREATE TABLE user (
id SERIAL,
user_name varchar(100),
CONSTRAINT pk_user PRIMARY KEY (id)
);

CREATE TABLE email (
user_id LONG,
email varchar(100),
CONSTRAINT pk_email PRIMARY_KEY (user_id, email)
);
ALTER TABLE email ADD CONSTRAINT fk_email_user FOREIGN_KEY (user_id) REFERENCES user (id);

CREATE TABLE phone (
user_id LONG,
phone varchar(11),
CONSTRAINT pk_phone PRIMARY_KEY (user_id, phone)
);
ALTER TABLE phone ADD CONSTRAINT fk_phone_user FOREIGN_KEY (user_id) REFERENCES user (id);

The Simple Case

In the simplest case, Anorm allows you to map the results of a query to a Scala case class like the following:

case class User(id:Long, name: String)
object User {
def rowMapper = {
long("id") ~
str("user_name") map {
case id ~ name => User(id, name)
}
}
}

def getUser(id: String): Option[User] = {
DB.withConnection {
implicit conn =>
SQL("SELECT id, user_name FROM user WHERE user.id = {id}")
.on("id" -> id)
.as(User.rowMapper singleOpt)
}
}

The query is executed and the results of the query are mapped to the User using a RowMapper which converts the result columns into Scala types and ultimately to a Scala User object that you’ve defined.

Joins

But what if you want a more complex object, such as adding Phone numbers and Email addresses to your user object? Lets say you want something more like the following:

case class User(id:Long, name: String, emails: List[String], phones: List[String])
object User {
def rowMapper = {
long("id") ~
str("user_name") ~
(str("email") ?) ~
(str("number") ?) map {
case id ~ name ~ email ~ number => ((id, name), email, number)
}
}
}

This row mapper doesn’t return a User object directly, but rather the columns grouped into a Triple (with id and name as the first part of the Triple).

Anorm doesn’t have a lot of support for that out of the box, but Scala’s built in functions for dealing with Lists and Maps have the tools that you need. Take a look at the following example. If you’re new to Scala, good luck wrapping your brain around it.

def getUser(id: String): Option[User] = {
DB.withConnection {
implicit conn =>
SQL(
"""SELECT user_name, email.email, phone.number
|FROM user
|LEFT JOIN email ON email.user_id = user.id
|LEFT JOIN phone ON phone.user_id = user.id
|WHERE user.id = {id}""".stripMargin)
.on("id" -> id)
.as(User.rowMapper *)
.groupBy(_._1)
.map {
case ((dbId, name), rest) => User(dbId, name, rest.unzip3._2.map(_.orNull), rest.unzip3._3.map(_.orNull))
}.headOption
}
}

But we can break down those steps a little bit, include the type declarations of what happens at each step to make it more clear as to what’s being done. Using those type declarations you end up with something like the following.

def getUser(id: String): Option[User] = {
DB.withConnection {
implicit conn =>
val result: List[((Long, String), Option[String], Option[String])] = SQL(
"""SELECT user_name, email.email, phone.number
|FROM user
|LEFT JOIN email ON email.user_id = user.id
|LEFT JOIN phone ON phone.user_id = user.id
|WHERE user.id = {id}""".stripMargin)
.on("id" -> id)
.as(User.rowMapper *)

val queryGroupedByUser: Map[(Long, String), List[((Long, String), Option[String], Option[String])]]
= result.groupBy(_._1)

val listOfUser: Iterable[User] = queryGroupedByUser.map {
case ((dbId, name), rest) => {
val emails: List[String] = rest.unzip3._2.map(_.orNull) // convert Option[String]s to List[String] where Some[String]
val phones: List[String] = rest.unzip3._3.map(_.orNull) // convert Option[String]s to List[String] where Some[String]
User(dbId, name, emails, phones)
}
}

listOfUser.headOption
}
}

Let’s break that down a little more:

val result: List[((Long, String), Option[String], Option[String])] = SQL(
"""SELECT user_name, email.email, phone.number
|FROM user
|LEFT JOIN email ON email.user_id = user.id
|LEFT JOIN phone ON phone.user_id = user.id
|WHERE user.id = {id}""".stripMargin)
.on("id" -> id)
.as(User.rowMapper *)

This code creates a List as you can see from the type declaration. The List contains an entry for each row returned in the result set. Because we used JOIN clauses, we might have gotten back many rows. For example, if a user had 2 emails the results might have looked like:

id, name, email, number
1, Geoff, geoff@example.com, 15135551212
1, Geoff, geoff2@example.com, 15135551212

The resulting Scala List that directly contains the data from that result set. But we take an extra step of grouping the basic User data (the parent) into its own Tuple which we’ll use later to identify the unique Users. The Scala list of the above result set would contain:

List(((1, "Geoff"), Some("geoff@example.com"), Some("15135551212")),
((1, "Geoff"), Some("geoff2@example.com"), Some("15135551212")))

Next we create a map of the results where the key to the map is the unique users:

val queryGroupedByUser: Map[(Long, String), List[((Long, String), Option[String], Option[String])]]
= result.groupBy(_._1)

From that list, we create a map, where the keys of the map are the unique parent objects. This turns the list shown above into a map like:

Map((1, "Geoff"),
List(((1, "Geoff"), Some("geoff@example.com"), Some("15135551212")),
((1, "Geoff"), Some("geoff2@example.com"), Some("15135551212")))
)

This mapping will work if there are many keys returned as well (assuming you were querying by something non-unique). In that case your map will contain one entry for each of the unique parents.

Finally, we need to take apart the Map and turn it into our domain object:

val listOfUser: Iterable[User] = queryGroupedByUser.map {
case ((dbId, name), rest) => {
val emails: List[String] = rest.unzip3._2.map(_.orNull) // convert Option[String]s to List[String] where not null
val phones: List[String] = rest.unzip3._3.map(_.orNull) // convert Option[String]s to List[String] where not null
User(dbId, name, emails, phones)
}
}

The case statement destructures the Map back into the key containing the basic user information and then the list of all the other data associated with that user. rest.unzip3 turns the List(A, B, C) into (List[A], List[B], List[C]). _.2 takes the second element out of the Triple, in this case the List[String] containing the emails. We then map over them to get the value or null from the Option[String] to create a list of the items that are not null. The same process is done for emails and phones. Those values along with the key from the map are used to create the instances of our Users. In this case, since we only expect one based on an id, we also use listOfUser.headOption to get the first element of the list (or None if the list is empty).

Hopefully breaking down the Scala into smaller chunks will help some people understand how this stuff works.

MongoDB: MapReduce Functions for Grouping

SQL GROUP BY allows you to perform aggregate functions on data sets; To count all of the stores in each state, to average a series of related numbers, etc. MongoDB has some aggregate functions but they are fairly limited in scope. The MongoDB group function also suffers from the fact that it does not work on sharded configurations. So how do you perform grouped queries using MongoDB? By using MapReduce functions of course (you read the title right?)

Understanding MapReduce

Understanding MapReduce requires, or at least is made much easier by, understanding functional programming concepts. map and reduce (fold, inject) are functions that come from Lisp and have been inherited by a lot of languages (Scheme, Smalltalk, Ruby, Python).

map
A higher-order function which transforms a list by applying a function to each of its elements. Its return value is the transformed list. In MongoDB terms, the map is a function that is run for each Document in a collection and can return a value for that row to be included in the transformed list.
reduce
A higher-order function that iterates an arbitrary function over a data structure and builds up a return value. The reduce function takes the values returned by map and allows you to run a function to manipulate those values in some way.

Some Examples

Let’s start with some sample data:

db.factories.insert( { name: "Miller", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Lakefront", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Point", metro: { city: "Steven's Point", state: "WI" } } );
db.factories.insert( { name: "Pabst", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Blatz", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Coors", metro: { city: "Golden Springs", state: "CO" } } );
db.factories.find()

Lets say I want to count the number of factories in each of the cities (ignore the fact that I could have the same city in more than one state, I don’t in my data). For a count, I write a function that “emits” the group by key and a value that you can count. It can be any value, but for simplicity I’ll make it 1. emit() is a MongoDB server-side function that you use to identify a value in a row that should be added to the transformed list. If emit() is not called then the values for that row will be excluded from the results.

mapCity = function () {
emit(this.metro.city, 1);
}

The next piece is the reduce() function. The reduce function will be passed a key and an array of values that were collected by the map() function. I know my map function returns a 1 for each row keyed by city. So the reduce function will be called with a key “Golden Springs” and a single-element array containing a 1. For “Milwaukee” it will be passed an 4-element array of 1s.

reduceCount = function (k, vals) {
var sum = 0;
for (var i in vals) {
sum += vals[i];
}
return sum;
}

With those 2 functions I can call the mapReduce function to perform my Query.

res = db.factories.mapReduce(mapCity, reduceCount)
db[res.result].find()

This results in:

{ "_id" : "Golden Springs", "value" : 1 }
{ "_id" : "Milwaukee", "value" : 4 }
{ "_id" : "Steven's Point", "value" : 1 }

Counting is not the only thing I can do of course. Anything can be returned by the map function including complex JSON objects. In this example I combine the names of all of the Factories in a given City into a simple comma-separated list.

mapCity = function () {
emit(this.metro.city, this.name);
}
reduceNames = function (k, vals) {
return vals.join(",");
}
res = db.factories.mapReduce(mapCity, reduceNames)
db[res.result].find()

Give you:

{ "_id" : "Golden Springs", "value" : "Coors" }
{ "_id" : "Milwaukee", "value" : "Miller,Lakefront,Pabst,Blatz" }
{ "_id" : "Steven's Point", "value" : "Point" }

Conclusion

These are fairly simple examples, but I think it helps to work through this kind of simple thing to fully understand a new technique before you have to work with harder examples.

For more on MongoDB check out these books:

MongoDB Replication is Easy

Database replication with MongoDB is easy to setup. Replication duplicates all of the data from a master to one or more slave instances and allows for safety and quick recovery in case of a problem with your master database. Here is an example of how quick and easy it is to test out replication in MongoDB. Create a couple of directories for holding your mongo databases.
mkdir master slave
Start by running an instance of the “master” database.
cd master
mongod --master --dbpath .
Start a new terminal window and continue by running an instance of a “slave” database. This example is running on the same machine as master which is great for testing, but wouldn’t be a good setup if you were really trying to implement replication in a production environment since you would still have a single-point-of-failure in the single server case.
cd slave
mongod --slave --port 27018 --dbpath . --source localhost
And start another terminal window to use as the client
mongo
db.person.save( {name:'Geoff Lane'} )
db.person.save( {name:'Joe Smith'} )
db.person.find()
db.person.save( {name:'Jim Johnson', age: 65} )
db.person.find()
Now kill the master instance in your terminal with Control+C. This simulates the the master server dying. Lastly connect to the slave instance with a mongo client by specifying the port.
mongo --port 27018
db.person.find()
As you can see, the db.person.find() returns all of the values that were saved in the master list as well which shows that replication is working. One of the other interesting facts is that you can start a slave instance even after the mongod master is already running and has data and all of the data will be replicated over to the slave instance as well. This all works without ever shutting down your mongod master instance. This allows you to add replication after the fact with no downtime. For more on MongoDB check out these books:
* MongoDB: The Definitive Guide
* The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing
* MongoDB for Web Development (Developer’s Library)

Scala and Adding New Syntax

One interesting thing about some languages is their support for adding new syntax. While all languages have the ability to add new functions or types some have specific properties that make it easy to add what looks like new built-in syntax.

Scala is an Object Oriented language. You can declare classes and objects, do inheritance and composition, and all the other things you might expect from an OO language. Scala is also a Functional language because functions are first-class citizens (also called a functor). And when I say Scala is an OO language I really mean it: everything is an Object. Even functions are Objects. (Chew on that one for a bit.)

Scala also supports the idea of optional parenthesis for method calls that only take a single argument (Note: This applies to method calls on object only. Not to functions.). This ends up being for a very practical reason. Take the following example:

1 + 2

This is a very nice way to write an addition operation. In reality what’s happening is:

1.+(2)

1 is an object and + is a method on that object that takes a single parameter. Applying the previous rule we get to remove the dot and the the parenthesis. Which allows us to write our previous example 1 + 2.

The good news is they bring this consistency to the language as a whole, so any method call can optionally use the dot. Any call to a method that only takes a single parameter can exclude the parenthesis around its arguments. These features make it pretty easy to emulate the built-in syntax of a language.

Your Own While Loop

Let’s say I want to write my own while loop:

def mywhile(condition: => Boolean)(command: => Unit) {
if (condition) {
command
mywhile(condition)(command)
}
}

var x = 1
mywhile(x < 100000) { println(x) x += 1 }

As you can see, I end up calling mywhile the same as I would call a built-in while. This is implemented as a tail-recursive function. If the condition is met, the command is executed. The function then recurses, calling itself to continue. x < 100000 is an anonymous function that returns a boolean expression.

Your Own Do...While Loop

A while loop can be built using just a single function. What if you want to create a do...while loop instead? In this case you can make use of the OO/functional hybrid.


class Repeater(command: => Unit){
final def aslongas(condition: => Boolean) {
command
if (condition) aslongas(condition)
}
}

def mydo(command: => Unit): Repeater = {
new Repeater(command)
}

var x = 0
mydo {
x += 1
println(x)
} aslongas (x < 100000)

In this case I use recursion again to do the looping. But I use an Object to bind the command to and an aslongas method to run that command and check the looping condition. I use a function mydo to bootstrap an instance of the Repeater class. Scala gives us the ability to use functions and objects when they make sense.

Why Should You Care?

Ok, so you're not going to write your own while loops. The language has them built-in already. But what this allows you to see is how you can add new "syntax". That ability makes it quite convenient and easy to write higher-order syntax to solve application specific problems or to create DSLs.

Update: Changed the until name to 'aslongas' since it really wasn't until the condition was met.

Scheme/HtDP Unit Testing Functions

How to Design Programs (HtDP) provides a series of unit testing functions that allow you to test the output of any kind of function. The current version talks about testing your code, but doesn’t offer a lot of guidance into how to do that. Everything you need is there though, so in case you are doing this on your own and don’t have a teacher to tell you to do this, here’s some explanation that I hope helps.

check-expect

check-expect takes the results of a function and the expected value. This is the simplest of the testing constructs which answers the question “Does the expected result of a function call equal the result returned by the function?”. It handles all of the details about types of values returned, so it can work with a simple value like a number, a complex value like a structure or a list, or even a picture.


;; add: number number -> number
;; add two numbers together
(define (add n m)
(+ n m))

;; Tests
(check-expect (add 1 1) 2)

check-within

check-within is similar to check-expect, but it allows for a third argument to tell wether or not the result is within a certain difference of the actual answer. This is good for certain functions that might produce random results, but the most common case is when dealing with floating point numbers. Floating point numbers are not always represented precisely, so there needs to be some “room for error” so to speak.


(define PI 3.14)
(check-within (/ 22 7) PI .003)

(check-within (random 10) 0 9)

check-error

Finally, you can use check-error to test cases where your function throws an error. Those error conditions are an important part of a function contract, so testing them should be done just like any other possible conditions.

One of the things that threw me at first was what to use as the expected value of a check-error test.

An error is HtDP is “throw” by a function like:

(error 'function-name "some message")

After a bit of trial and error, I found out the expected result is a string like:

; "function-name: some message"
(check-error (error 'list-pick "list too short") "list-pick: list too short")

More Realistic Example


;; list-pick : list-of-symbols N[>= 0] -> symbol
;; to determine the nth symbol from alos, counting from 0;
;; signals an error if there is no nth item
(define (list-pick alos n)
(cond
[(empty? alos) (error 'list-pick "list too short")]
[(= n 0) (first alos)]
[(> n 0) (list-pick (rest alos) (sub1 n))]))

;; Tests
(check-error (list-pick empty 1) "list-pick: list too short")
(check-expect (list-pick (list 'a 'b 'c 'd) 3) 'd)
(check-error (list-pick (list 'a 'b 'c 'd) 4) "list-pick: list too short")

Studying HtDP

I haven’t posted in a while, but I’ve recently been studying How to Design Programs (HtDP). It’s really a book about teaching beginners how to program by following a series of recipes that tell you how to identify and solve different classes of problems. The techniques are taught in a very basic subset of Scheme. (There are actually multiple learning languages that slowly build up the available functionality.) You don’t have to know any Scheme to get started though.

Basic Scheme

I’m sure you can figure out what the following does without knowing any scheme:

(+ 1 3)
(* 10 10)

They explain a bit about equality and conditionals:

(= 1 1) ; true
(and true true) ; true
(or true false) ; false
(and (>= 1 1) (= 2 2)) ; true

Then building on basic arithmetic and boolean logic, they add a bit of algebra to teach how to define your own methods:

(define (add1 n)
(+ n 1))
(define (sqr n)
(* n n))

;; Tests
(= (add1 2) 3)
(= (sqr 10) 100)

It grows from those simple constructs into how to build structured and list data and how to process those kinds of data.

Why Might You Care?

It’s very hard for me to judge what it would be like to learn to program using this book. I personally have 10 years of programming experience myself and learned first with BASIC, then really with Pascal and C/C++. Professionally I first programmed in Java, then C#. From there I really got interested in dynamic languages like Python then Ruby and looked a bit at Objective-C in there. By and large these are all Imperative Programming Languages. I’m sure my experience isn’t unique for those of us who didn’t go to MIT. So, if like me, you are new to Functional Programming then it is an interesting book because it teaches you how to think in terms of functional decomposition.

Multi-core processing is the future and functional programming, due to the ability to much more easily create side-effect free programs, is well suited for parallelization. Understanding functional programming is going to be very important for programmers in the future. We might never go to all functional programming, but we might go to more mixed-mode languages that allow us to easily write parallelized code for portions of a program that need it.

Erlang First Post

Some linguists and philosophers posit the idea that you can not have a thought for which you do not have language.

“The limits of my language mean the limits of my world.”
— Ludwig Wittgenstein

Erlang a bit. Erlang is a functional programming language that is very unlike the imperative languages which I, and many others, are most familiar. Learning a new programming language and especially a fundamentally new programming paradigm really changes the way you think about solving problems. If the linguists are to be believed, it fundamentally allows you to have no thoughts and ideas. This is what makes it so valuable for software developers to look at new languages. I thought I would share some tidbits.

Erlang Variables Aren’t Variable

Erlang only allows you to set the value of a “variable” once. If you try to set it more than once the language actually throws an error.


Eshell V5.6.2 (abort with ^G)
1> X = "foo".
"foo"
2> X = "bar".
** exception error: no match of right hand side value "bar"

At first this sounds like a huge limitation to my imperative mind. In many imperative it probably would be. In Erlang it ends up not being much of an issue. In many ways it’s one of the things that forces you do more idiomatic, functional Erlang.

Side Effect Free

Not being able to modify variables is one of the things that helps keep Erlang programs side effect free. Side Effects make programs harder to understand and can make them more error prone. A side effect free method is one that is “idempotent”, a fancy term that means every time you call it with a given value, it will return the same result.

Thinking of side effects and how they can be reduced or removed from imperative programs can make those programs easier to understand and test.

Pattern Matching as Polymorphism

To my imperative brain that grew up mostly on Object Oriented programming languages Polymorphism and related abstraction are notions of classes and types. Erlang changes this abstraction into one of Pattern Matching. Erlang’s pattern matching is almost a declarative construct like you would find in XSL. When you find call to a function that matches this pattern, use it, otherwise try the next function until you find one that matches.

To compute a Factorial, you can use 2 function signatures. The first *fac(0)* is called whenever the function is called with the integer value of zero. If the value is not zero, then that pattern is not matched and the next version is tried. In that case *fac(N)* where N is any value other than 0 is evaluated.

fac(0) ->
1;
fac(N) ->
N * fac(N-1).

In a slightly more complex example, you can actually pass a keyed Tuple. The key, in Erlang speak, is an atom, very similar to a Ruby symbol. Those atoms are used as part of the pattern matching to determine which function to execute.

area({square, Side}) ->
Side * Side;
area({circle, Radius}) ->
3.14 * Radius * Radius.


area({circle, 10}).
area({triangle, 2, 2, 2.82}). %% error, no match

Thinking about abstractions beyond types or classes in your Object Oriented programs could open you to some interesting new solutions

Distribution Should be Explicit, Not Necessarily Hard

Distributed computing takes more work than assuming all your code will be running on a singe machine, in the same process. The Fallacies of Distributed Computing are all about the assumptions we make when we try and hide the complexity of distribution from the caller. With Java and .NET, for example, remote calls can become so hidden that they look like just another method call within the local VM. While this is convenient, it also can lead to serious problems when users don’t take into account the overhead and the extra things that can go wrong with remote calls.

Erlang makes concurrent programming and spawning lots of processes to do work a very natural part of the language. Part of the language then is how to deal with the problems that arise when you run into issues talking to a remote process. The language has exception handling, it can be set up to receive a response, only wait for a timeout, etc.

The biggest thing that Erlang does is not try to hide the fact that you are communicating to a remote process (whether that process is in the same node, a different node, or a different machine). It gives you the programmer the tools to decide what conditions you need to handle based on how your program is built and deployed. All those scenarios are easy, but it’s still explicit that you are talking to a different process.


ping(N, PongPid) ->
PongPid ! {ping, self()},
receive
pong ->
io:format("ping received ~B~n", [N])
after 1000 ->
io:format("I don't think we're gonna get a response~n")
end,
ping(N - 1, PongPid).
pong() ->
receive
finished ->
io:format("pong finished~n", []);
{ping, PingPid} ->
io:format("pong received~n", []),
PingPid ! pong,
pong()
end.

Think about how you can make your distribution explicit and not hidden from callers who could make bad assumptions.

I’m finding Erlang and really trying to think in a functional programming language very interesting right now, so I’ll probably post some more about it. I think it will allow me to talk more intelligently to Grant at least.