When we first thought about writing this post we were rather tempted to title it “MongoDB – 10 things I hate about you”, but that wouldn’t be fair after all.
We learned the hard way how painful it can be to replace a working MySQL installation with a mongoDB, especially in a large scale project environment, but blaming mongoDB for that would be like blaming a microwave for not being able to prepare an exquisite 3-course menu.
So we won’t (or at least we do not intend to) write your common mongoDB rent, especially because in the end we were able to make it work. But this won’t be a praise on the new hip database either.
Instead we want to give you some hands-on advice on how to actually make mongo work in large IT projects and which common mistakes and misunderstandings to avoid.
The more, the better
When we started off planning the migration from MySQL to mongoDB we assumed that with a lot of powerful hardware we would be on the save side. So for the new mongoDB setup we planned several times the computing power and memory than we had available for our mysql servers.
Thus, whilst planning the migration of the MySQL data to our new awesome mongoDB environment, we naturally were certain that our old and less powerful MySQL servers would be the bottleneck of the migration process.
So much for the theory… And actually when we started to migrate our data to the mongoDB it looked very promising at first. The migration was only limited by the rate we were able to read the data from the MySQL servers and everything was running smoothly, for about a day. Once we hit a certain quantity of data in our mongoDB things began to look rather bleak.
We really were taken unprepared – how on earth could the inserts on our 48 core machines and hundreds of GB of ram be that slow? In the end the explanation was quite easy, first of all we should have rather invested in more normal sized hardware rather than a few very powerful machines – and we didn’t take disk space into consideration. Especially in a sharded mongoDB setup the i/o load increases rapidly with growing data, so fast disks are even more important than computing power or even memory for scaling your mongoDB servers.
Only because you can use map-reduce, doesn’t mean you should
I think one thing that seems very appealing at first glance is the support of map-reduce operations (http://docs.mongodb.org/manual/applications/map-reduce/) in mongoDB.
Especially compared to MySQL we are very limited when it comes to aggregating data with mongoDB, for mongoDB beginners it might seem obvious that using map-reduce functionality to mimic some of MySQLs aggregating operations is a good idea.
Again we had to find out that with mongoDB things are not as intuitive as one might hope, especially if you are coming from a relational database background we can only advice you to do the research rather than trusting your instincts.
So again we had to learn the hard way that MongoDb’s map-reduce functionality just isn’t meant for real time computing, it is extremely slow, especially when you have a large amount of data in a sharded environment.
This issue shouldn’t be as sever anymore as in the mean time, as mentioned, the so called mongoDB Aggregation Framework was introduced and in addition to that the infamous global locks were eliminated.But still keep in mind that map-reduce is expensive.
One of mongoDB major selling points is that it provides an auto-sharding (partitioning) architecture and thus scales horizontally out of the box.
And indeed, the prospect of being able to easily add more servers to handle increasing data volume without having to take care about rebalancing or querying the right shard does sound very appealing. But again reality had a number of surprises in store for us. In order to understand the limitations and specifics of mongoDBs sharding let us try to give you a short, simplified overview on how sharding is actually realized.
MongoDBs documents are stored in so called chunks. For each chunk the maximum and minimum shardKey as well as the location (on which shard it is stored) and its size are known. Now when another document is inserted it will be stored within the chunk with the fitting id (shardKey) range. Now several things can happen. If you are lucky and the shards are balanced already, and the document doesn’t increase the chunks size over a given limit, thats just it, the document is inserted and you are fine. But if the document exceeds the chunks size limit, the chunk will be split into two different chunks. And if the shards are unbalanced, on top of that the new chunk will be moved to another shard. The split is a rather inexpensive operation, but moving the data to a different shard can be very expensive.
Due to this sharding logic the following things should be kept in mind when introducing sharding.
The shard key, as described above, is the field in a collection which determines where the document will be stored. This field has to be unique. Now coming from a SQL background, one might be tempted to use a typical sequential ID as a shard key. However, this is a bad choice especially when you have a lot of new documents. Because what obviously happens with when you consider the descirbed sharding logic, is that documents with sequential ids will be written to the same chunk, alas to the same shard. So new documents aren’t evenly distributed, which can lead to so called “hot shard”, meaning the load on one shard is extremely high, and expansive rebalancing will be required.
According to mongoDBs Documentation the ideal shard key:
- is easily divisible which makes it easy for MongoDB to distribute content among the shards. Shard keys that have a limited number of possible values are not ideal as they can result in some chunks that are “unsplitable.”
- will distribute write operations among the cluster, to prevent any single shard from becoming a bottleneck. Shard keys that have a high correlation with insert time are poor choices for this reason; however, shard keys that have higher “randomness” satisfy this requirement better.
- will make it possible for the mongos to return most query operations directly from a single specific mongod nstance. Your shard key should be the primary field used by your queries, and fields with a high degree of “randomness” are poor choices for this reason.
When sharding keep in mind, that if you want to have unique fields in your collection, they have to be part of the shard key.
As handy as the balancer is, you have to keep in mind that moving chunks to another shard is a very expensive operation, which can even slow down your complete application.So put new shards into production as early as possible: the longer you wait, the more data will have to be moved.
Another thing which we found helpful was the so called pre-splitting, where you basically tell mongo where to put which chunk in advance – if you can predict the data this might be a good option to avoid expensive balancing. Additionally it might be a good idea to stop the balancer during daytime / the time your application is used most, and only balance e.g. during the night.
The bottom line is that mongoDB is not the silver bullet which will magically solve all your problems, and might even be the wrong choice for your purpose.
But it’s a very interesting concept and it is remarkable how fast bugs are fixed and improvements are realized, so its definitely worth having a look at. We personally wouldn’t use it for another large project just yet, but we will keep a close look at it and are excited about what the future will bring