There are many reasons for which databases must be scaled. The majority of the time they must be scaled to accommodate for performance issues as the product grows. Though NoSQL is making a lot of noise these days, it is to no one’s surprise that SQL is still extremely popular. In general, the same principles are followed while scaling out any SQL product, be it MySQL, MsSQL, Oracle or even DB2. Scaling is often done to overcome performance issues as the product grows. However, when dealing with big data, scaling is often done to balance the data across multiple hardware nodes or clusters.
Most SQL products are scaled in clusters called shards. Each shard contains one or two masters and several slaves. Master servers are responsible for writing data whereas the slaves are responsible for reading data. NoSQL has become more popular over the years as it doesn’t require the complex infrastructures we see in SQL. In order to reduce calls made to the databases, a caching layer is added. These are often easy to put up and are cheap to run. As the product grows, the infrastructure can end up looking like this picture below.
I had to learn this graphic the hard way. Not expecting the product to be popular immediately after launch, we delayed scaling. To our surprise we hit 60k users after the second day. After learning the hard way, the game was scaled and was able to hit over 250k a few days later. Foresight is a great thing to have though often it is best to be ready to scale first. You must determine how you expect to grow. Is content going to be created constantly and the database is going to grow or will it remain stable over time?
A product design that doesn’t require a lot of data to be added to the databases will often benefit from a system that replicates databases and makes use of caching. Write-heavy applications will take the approach of growing their infrastructure vertically by splitting content up between shards and adding shards over time.
When dealing with partitioning, you will need to determine the key on which the data will be partitioned. With that key, an algorithm can be created that will be used to determine which shard the data will go to upon a read or write. For example: dealing with user registrations you have 3 shards. The 1st user is saved on the 1st shard, the 2nd user on the 2nd shard and so on. I often recommend storing values to identified shards based on a hash value of the key (like uid). One thing to keep in mind is that you will be growing and you likely will increase the number of shards, you will need to rebalance data when that happens.
There are a great many alternatives when trying to scale out your databases. Some are complex and take a lot of time to plan while others are as simple as setting up replication. I’ve been researching alternatives to scaling out SQL for years. With NoSQL coming up as a great solution, it did not meet my goals. There is a reason why relational databases are popular today. Joining tables, subqueries, stats and various other functions are often required. NoSQL attempts to accomplish several of those features through map-reduce, but it isn’t the same. This is why I have been working more closely with NewSQL solutions like VoltDB. They allow many SQL / relational database features all while being built from the ground up to scale.
__________________________________________________________________________________________
Francis Pelland is a born innovator and is experienced in building end-to-end technological solutions. He thrives on solving complex problems with elegant technical or product solutions, all the while improving user experience and building deeply embedded analytic solutions with Big Data.
One of our candidates here at P4Digital, if you want to contact him send a message to Archana Ravinder at aravinder@planet4it.com