Choosing an open source NoSQL database

I’m currently evaluating which NoSQL we could use in a new project and thought I’d document the considered options are and the relevant criteria here.

First let’s see what a NoSQL database is at all. NoSQL doesn’t necessarily stand for “No SQL” but rather for “Not only SQL”. So this means it’s a database which can be worked with without using SQL but it doesn’t mean that none of them actually supports an SQL syntax. But the goal is of course not to get rid of SQL but rather to support use cases which are currently not well supported by classical relational database management system (e.g. Oracle, Sybase ASE, MS SQL Server, MySQL…).

Shortcomings of relational databases

So what are the main shortcomings of relational databases:

  1. Effort to setup and maintain.
  2. Scalability.
  3. Performance.

Effort to setup and maintain

In a relational database, all entities stored in the database must be defined with a schema known in advance. The relationships between entities have to be modeled. Whenever you need to be able to store new types of data or whenever you need to store additional attributes, you need to update your schema and make sure that existing data are made compatible to the new schema.

So the effort to setup comes from the fact that to park a car in there, you need to disassemble the car to be able to store it in the garage. This means that the whole structure of the data stored in the database needs to be known in order to create a schema which is required in order to store data. The only way to be able to store different types of data not known in advance is to store them as BLOBs which makes it impossible to use any of the advantages of relational databases later on.

The effort to maintain comes from the fact that if you want to start parking trucks additionally to cars, you need to figure out what are the parts making up a truck and what’s common between cars and trucks before you can store trucks.

Scalability

The main scalability issue with relational databases is that in order to keep data integrity and support transactions, a relational database needs to handle transactions and synchronize writes across multiple related entities which if stored on different servers requires much more effort especially to handle deadlocks. Making sure that all data related to a single entity on a single machine becomes increasingly complex. The synchronization costs tend to quickly increase as the complexity of stored data increase.

A big issue regarding scalability arises from the fact that relational databases usually require much more expensive hardware in order to scale. Scaling a relational database with commodity hardware becomes a very difficult task because of the need to support a global lock manager and distributed synchronized writes.

So basically relational database scale very well on a single server but the problem arise when you need to scale it beyond the single server deployment.

Performance

Since the car was split into individual parts in order to park it in the garage, retrieving the car from the garage means reassembling it from its parts. This is where the performance issue of relational databases comes from. Whenever you need to retrieve an entity and related data, it becomes less efficient if they are stored separately. If you stored them all together, you’d be able to retrieve the whole car much faster.

Since a relational database also needs to make sure that the integrity of the model is maintained and ensure atomicity when storing the different parts of the car, storing the car in the garage is slower than if you just stored the whole car at once.

Both points above are of course also related to the scalability issue since the cost to store and retrieve is increased even more when the different parts are physically stored on different machines.

Different types of NoSQL databases

In order to overcome the shortcoming of relational databases in some scenario, different types of NoSQL database came to life. There are basically 4 big groups of NoSQL databases:

  1. Column-oriented databases.
  2. Key-Value stores.
  3. Graph databases.
  4. Document databases.

In the sections below, I’ve listed the characteristics of the different NoSQL database and database products in each category. I’ve only considered products matching the following criteria:

  1. License: Open Source
  2. Supports disk storage
  3. Deployable on Linux, Mac OS X and Windows
  4. Deployable on an own server

The OS requirement is important for me since the final deployment will be on a Linux server (or servers) but development will be done on Mac and Windows. It is also important that the database software doesn’t put unnecessary restrictions on the operating system we’ll use in development and deploy it on in the end.

We also plan to deploy the solution on our own servers so databases which can only be used in combination with a specific cloud offering are not considered.

Column-oriented databases

A column-oriented databases stores data tables as a set of columns rather than a set of rows. They are mostly used for data warehouses and CRM systems where it’s important to be able to aggregate data over large numbers of similar data items.

Cassandra by the Apache Software Foundation seems to be the only database product in this category which seems to match the above criteria. It is released under the Apache license.

Key-Value stores

They data are stored by the application in a schema-less way. It’s value is associated to a key which uniquely identifies it.

Unfortunately, although there are many key-value stores on the market, I couldn’t find a single one matching all the criteria above. If you happen to know of such a database please let me know in the comments and I’ll update this post.

Graph databases

They are able to store elements interconnected with an undetermined number of relations between them. They are mostly appropriate for modelling social relationships, maps and transportation.

I could find two databases which match the above criteria.

The first one is Neo4j by Neo Technology. It is released under the GPL. One disadvantage of Neo4j is about scalability. It doesn’t seem to be part of the main design of Neo4j though Neo4j Enterprise seems to have some support for replication allowing performing a online backup. Without Neo4j you will need to shutdown the database and copy the database files. Neo4j also seems to lack partitioning support.

The second one is OrientDB by Orient Technologies. It is released under the Apache license.

Both products support ACID transactions. Only OrientDB supports partitioning. OrientDB supports replication out of the box. Also the license of OrientDB is more developer friendly. So if you do need a graph database but do not have time to evaluate both products, I’d recommend checking OrientDB.

Document databases

Document databases (also called document stores) store documents encoding data using e.g. XML, YAML, and JSON/BSON (or also as PDF or Microsoft Office files). The documents are stored as collections of documents. These collections are similar to tables in relational databases (the documents being the records in those tables) but unlike relational databases, the document in these collections do not need to have the same schema. They can actually have completely different fields.

Each document in the database has a unique key used to identify it. But unlike key-value stores, document stores also provide functionality to retrieve documents based on their contents (even though all documents do not have the same attributes/fields).

There are quite a few document databases matching my above requirements.

MongoDB by 10gen

It is released under the AGPL. It’s the most well-known document database on the market. It is used by Craigslist, Foursquare and Shutterfly.
It comes with a lot of functionality:

  • predefined datatypes
  • indexes
  • JavaScript server-side scripting
  • partitioning
  • master-slave replication
  • MapReduce
  • eventual and immediate consistency
  • atomic operations within one document

MongoDB supports ad-hoc queries pretty well and its query tools support a lot of what can be done in SQL (of course with the exception of joins). So if you have experience working with an SQL relational database, you should be able to get used to it pretty quickly.

CouchDB by the Apache Software Foundation

It is released (as expected) under the Apache license. It’s a document store inspired by Lotus Notes. It is used by quite a few organizations no big names like MongoDB or Couchbase.

Compared to MongoDB, it does support a few more operating systems (e.g. Android and BSD). But it does not support the following:

  • predefined datatypes
  • immediate consistency

But it does support the following which is not supported by MongoDB:

  • triggers
  • master-master replication

CouchDB is a single node solution with peer-to-peer replication technology and is better suited for decentralized systems. So if you do not need immediate consistency and need master-master replication to build a decentralized system, CouchDB might be a better fit than MongoDB.

Couchbase by Couchbase, Inc.

It is a JSON-based document store derived from CouchDB with a Memcached-compatible interface and is released under an Apache license. It is used by many companies including Adidas, Adobe, Aol, BMW, Cisco, Ebay, Intel, Mozilla, Nokia, Vodafone and Zynga.

Compared to MongoDB it doesn’t support deployment on Solaris and also lacks predefined datatypes.

Being based on CouchDB, it also supports triggers and master-master replication but also supports immediate consistency like MongoDB (and which isn’t supported by CouchDB).

Couchbase additionally has a built-in clustering system and can spread data automatically across multiple nodes.

Also since Couchbase provides built-in Memcached-based caching, it is usually better suited for use cases where low latency or high throughput is a requirement.

If easy scalability and high throughput are important to you but you do not want to sacrifice immediate consistency, then Couchbase might be the right solution for you.

But you should keep in mind that Couchbase is not entirely open-source. There are two versions: Community Edition (free but no latest bug fixes) and Enterprise Edition (with additional restrictions). If you plan to use the Enterprise Edition, you should carefully read the license terms.

ArangoDB

ArangoDB is released under an Apache license. It supports both disk and RAM-based storage of JSON data.

Compared to MongoDB, it does not provide support for MapReduce but supports the ArangoDB query language which allows using aggregation, graph queries, grouping, joins, list iteration, results filtering, results projection, sorting and variables. It also supports ACID transactions.

An advantage of ArangoDB is that it supports database models based on graphs, key-values and documents.

Summary

Since our goal is to store multiple XML document types which might have different schemas and need to be able to generate reports based on their contents, our obvious choice is to go for a document oriented database. This leaves us with 4 database products to choose from:

  1. MongoDB
  2. CouchDB
  3. Couchbase
  4. ArangoDB

Since not all non-functional requirements are yet available, we need to try and define sets of possible requirements and for each of them define the corresponding product which would be our favorite.

First, ArangoDB seems to be a very good product from it’s supported functionality and architecture. But it’s quite a new product (initial release in 2011) and doesn’t have the same kind of user community the other 3 products have. Looking at Google Trend, you will also get the following figures for March 2014:

  • MongoDB: 100
  • CouchDB: 8
  • Couchbase: 7
  • ArangoDB: 0

Of course, if you read this article in a few years from now, the situation will most probably be different. Looking into my crystal ball, I’d say that in a year or two from now, ArangoDB will be up a little bit, interest for CouchDB will further move to Couchbase and MongoDB will still be number one but by not as much as now.

Also since scalability is very important, we would rather tend to use Couchbase than CouchDB. Of course we need to further analyze the differences between the Community and Enterprise editions and also check the exact terms of the license for the enterprise edition.

The only two thing speaking against MongoDB seem to be:

  1. The AGPL license which I’ve always found scary.
  2. Scaling with MongoDB seems to be more complex than with Couchbase.

Right now, I am not 100% sure whether to go for Couchbase or MongoDB. Both seem to meet all our requirements and we probably need to give them both a try and see which one is the perfect fit.