These days we all want to build the next big thing which will be deployed across the world. This of course is all fun and games, but there are also some technical difficulties you have to overcome when creating a software platform which has to be available from everywhere in the world with a responsive interface.
One of these difficulties you will have to face is getting the required data near your customers. Most of the time we are using a database to store this data for us. In the traditional form, this will likely be an on-premise database somewhere in a datacenter (with all disaster recovery aspects in place of course). Note, I’m talking about the traditional relational databases over here, but most of it will also apply on the non-relational databases.
In order to deploy your software globally, the data has to move with it. It doesn’t make much sense to deploy your software solution in a datacenter on the other side of the world when all data still remains far away, because the latency will slow down the experience. There are of course multiple solutions you can think off to solve this problem, one of them is to use sharding in your database.
Sharding is a concept which is widely used in todays noSQL database systems, most of the time the sharding is done locally though. It means you are splitting your data in multiple databases.
Wikipedia on sharding
Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than being split into columns (which is what normalization and vertical partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.
As you can imagine, this is a great solution if you want to make data available in a datacenter near a customer. You can put data of your European customers in database which is is deployed Europe and the data of an American customer somewhere in the USA.
I’ve found a nice, simple, picture to make the concept a bit more visual.
In the above image you see one massive database which holds all of the data on the left. On the right, this data is sharded across multiple databases. This is what sharding is all about.
The main advantage to this principle is you make sure it has the smallest possible latency between the customer and the data. Another advantage is you can create ‘premium’ databases and ‘standard’ databases, depending on the customer and the load it produces. You can even split customer data in several smaller databases (shards) and micro manage these even better.
As always, there are also several disadvantages which you will need to consider. One of them is the increased latency when querying data from multiple shards (databases). The queries will have to run on all shards, therefore it will be as slow as the slowest connection/database in the system.
Another disadvantage, which is more important to the development team, is the complexity of a proper sharding solution. When implementing a sharding solution you have to make sure all data is stored consistently, create your own transactions, copy data of reference tables across all shards, solve failure scenarios, migrating shardlets (data) between shards, etc.
Managing all of this yourself is doable, but not something you or your team want do unless it’s explicitly necessary.
Lucky for us who live in the Microsoft Azure space, Microsoft has been releasing their Elastic Scale library since October 2014 and I have had the pleasure to use this library for a customer of mine. As of April 2015 (BUILD 2015) this library has the 1.0 version number, so it’s ‘safe’ to use for everyone.
The Elastic Scale library saves us developers from the hassle it takes to create your own sharding implementation. The Elastic Scale library works great with SQL Azure databases and with the Split-Merge Service package you also get a web application which is able to move, merge & split shardlets (data) between your shards. This Split-Merge service uses an eventual consistency technique, so you don’t have to worry (much) about corrupt data when moving shardlets around.
There is quite a bit of documentation (and samples) available from the documentation map. The getting started guide is also nice place to start checking out this library. Keep in mind, you do need a Windows Azure subscription in order to use this solution.
I’ll describe some of the implementation details in the upcoming posts about this subject.