• log out

General principles

Distributing data on shards is a powerful way to maintain huge databases using commodity hardware. This of course comes with its own trade-offs.

First of all, the shard routing is done based on some field values and strives to balance the distribution of rows among shards. These fields should satisfy some conditions. Having data placed to different tables database engine treats it as non-related and thus cannot process complex queries — in particular joins, limit/offset, or nested queries.

Sharding algorithm

Sharding basically partitions data storage to different tables based on the row's contents. We want our algorithm to:

  • Uniquely define where to put the new row
  • Query as the lowest number of shards possible
  • Having the primary key of the row determine the unique shard where the row is placed
  • On row modification keep the same shard — don't move rows around

To satisfy these requirements, the shard is determined by primary key value. Each PK field is hashed with some hashing method and truncated to a specified length. There are two hashing methods implemented - "md5" and "normalize". "md5" ensures a balanced distribution of data but does not allow to query data ranges. "normalize" allows to query data ranges but the data distribution will depend on the field values. It places the burden on the architect to make sure proper data distribution is maintained. But don't worry, when a shard gets too full, it can be split into two or more shards!

Shards partitions are labeled by the minimal hash values contained in the shard (points). For example, for fields 'alpha' and 'beta' some shard may contain database rows where alpha >= '000000' and beta >= '1234567'. Such a shard will be denoted as '0000000.1234567'.

Using shards

Qbix Platform database layer split data storage to shards based on the database connection config. Considering that we want be flexible with shards definition and be able to modify the configuration keeping application online, shards configuration is placed in variable config. "Q/config/bootstrap.json" config file in "Q/configFiles" array shall contain some file, let's say "Db/config/shards.json".

The file "Db/config/shards.json" has the following format:

{
  "Db": {
    "connections": {
      "CONNECTIONNAME": {
        "indexes": {
          "TABLENAME": {
            "partition": {
              "0000000.0000000...": "SHARDNAME",
              "XXXXXXX.XXXXXXX...": "SHARDNAME",
              ...
            },
            "fields": {
              "FIELDNAME": "md5",
              "FIELDNAME": "normalize%10",
              ...
            }
          }
        },
        "shards": {
          "SHARDNAME": {
            "prefix": "PREFIX",
            "dsn": "DSN",
            "username": "USERNAME",
            "password": "PASSWORD",
            "driver_options": {
              "3": 2
            }
          },
          "SHARDNAME": {
            "prefix": "PREFIX",
            "dsn": "DSN",
            ...
          },
          ...
        }
      }
    }
  }
}

"Db/connections/CONNECTIONNAME/shards" config section define the shards in the same way as connections to the main database are defined. It's not necessary to define all data for each shard - only that fields which differ from "Db/connections/CONNECTIONNAME" section.

"Db/connections/CONNECTIONNAME/indexes" config section provide mapping of actual data to the shards. "fields" object list all fields from the primary key of "TABLENAME" and hashes. If hash code is followed with "%" and number, the number is used as hash length. If "%" is skipped, lenth defaults to 7.

The "partition" section contains either array of partitioning points or object with keys as "SHARDNAME"s and values as partitioning points. If "partition" is an array, "SHARDNAME"s are considered equal to array values and "Db/connections/CONNECTIONNAME/shards" sertion shall be defined appropriately.

Naming "SHARDNAME"s is simpler however is more confusing if different sharded tables use the same connection. It's strongly advised to use some meaningful names for the shards not really related to partitioning point to avoid potential confusion.

Splitting shards

Whenever a table or shard becomes too large it can be split in parts. To identify the shard to split one shall provide information about plugin, connection, table, shard point (part). Also, to handle table content properly class name is needed. For initial sharding it's necessary to know which hash to use for each primary key field. And the last thing to know is where to put new shards.

It's not necessary to calculate splitting points and write config files by hand. There is a script available which helps to determine optimal parameters, move original data to new locations and generate applicable config. The script can be run while application is online and does not really disturb application performance. The script may cause short time blocking of the shard being split but only for writing.

Splitting process is split to several parts to optimize performance, memory use and server load. It's initiated with PHP script which verifies parameters, calculates additional data and passes control to node.js process. node.js, having all data ready makes the rest in much more efficient manner.

Prepare shards split

The splitting php script is located in "QP/scripts/shards.php" and can be run either directly or with "APP_DIR/scripts/Q/shards.php" launcher (if file "APP_DIR/scripts/Q/shards.php" does not exists just copy "APP_DIR/scripts/Q/install.php" to this name).

In simple case when connection name is the same as plugin name and class name is combined of pligin and table names it is enough to run:

shards.php --part PLUGIN/TABLE/PART --parts '{SHARDS_CONFIG}'

or if table is not sharded yet:

shards.php --part PLUGIN/TABLE --parts '{SHARDS_CONFIG}'

"SHARDS_CONFIG" is the JSON string which defines new shards:

{
  "SHARDNAME": {
    "prefix": "PREFIX",
    "dsn": "DSN",
    "username": "USERNAME",
    "password": "PASSWORD",
    "driver_options": {
      "3": 2
    }
  },
  "SHARDNAME": {
    "prefix": "PREFIX",
    "dsn": "DSN",
    ...
  },
  ...
}

and shall contain at least one definition. The original shard will be split to the number of pieces equal to the number of shards listed in "SHARDS_CONFIG".

"SHARDS_CONFIG" may be also plain array:

[
  {
    "prefix": "PREFIX",
    "dsn": "DSN",
    "username": "USERNAME",
    "password": "PASSWORD",
    "driver_options": {
      "3": 2
    }
  },
  {
    "prefix": "PREFIX",
    "dsn": "DSN",
    ...
  },
  ...
]

In first case shards are named, in second - identified with it's partition point.

It's enough to provide only those parameters which differ from parameters set in "Db/connections/CONNECTION". For example:

shards.php --part PLUGIN/TABLE --parts '[{"prefix": "shard1_"}, {"prefix": "shard2_"}]'

will split TABLE in two parts with different prefixes located on the same server as original table.

If extra parameters are provided script uses different connection, class or uses different hashes.

Script makes necessary checks and passes all split configuration to node.js. It reports if node has received the message and terminate. All further processing is done by node.js server. The node.js instance is selected by ballancer based on it's current state and algorithm. To avoid uncertainity add --node parameter with IP address of the particular node.js instance you want to use.

Copy data

We don't want to stop application while shard is being split, moreover, we want it to process data inserts/updates/delete as usually without losing queries. After receiving shard split message from PHP process node.js must address several issues:

  • Where to run queries which are run by application while data is being processed?
  • How to process data from original table in non-blocking way so the node.js server is also able to perform it's usual tasks?
  • How to make sure that all application processes (potentially run on different servers) are properly notified about any possible change in the configuration?

To handle necessary modifications in the application config while split process runs additional config file is created. The name of the file is defined by "Q/internal/sharding/upcoming" config key and defaults to "Db/config/upcoming.json".

The "upcoming" config contains the data necessary to keep track of queries run during split process and block writing to the shard which is being split when it's necessary.

When the Qbix Platform executes any non-SELECT query, it checks "Db/upcoming" config key and if the query touches the table/shard being split, send notification to the node process which is handling the split.

When "upcoming" config is ready node.js waits for necessary timeout to make sure that all running processes "know" about new config values (see: variable config).

When logging is set up node.js forks new process which reads the shard table as stream and passes data back to node.js for processing. Main node.js process combines rows read from original table to batches per new shard and saves them o new tables.

When copying is finished the original data snapshot run at the beginning of copying is separated to the new tables according to sharding algorithm and all queries run after snapshot are recorded in the log file.

Process logs

Log file may be huge and processing it may take quite a while. To follow our non-blocking policy we create another log file and log queries to another file while processing first log file. We can repeat this as many times as we want to have the last log file reasonably small. Number of logs created for split is defined by "Q/internal/sharding/iterations" config value and defaults to 2.

When processing the last log it is important to make sure that new queries are not run. To ensure this we add "Db/upcoming/CONNECTIONNAME/block": true and wait for config update. After all processes receive update, attempt to make non-select query to this shard raises exception Db_Exception_Blocked. Now it's safe to quickly process the last log and be sure that no more logging is necessary.

Reset config

After last log is processed new indexes and shard connection information shall provided to application. Information about shardsing is stored in the files named after "Q/internal/sharding/upcoming" config key. In the end of file name right before extension script adds current timestamp in the format "yyyymmddThhmmss". So typical file name would be "Db/config/shards20150229T184321.json".

New shards information as well as new partitions are merged to previos config and written with the new name. Be careful as some shard names can be overwritten with new shards config. In any case old shards config file remains ontouched.

The old shards config file is removed from "Q/configFiles" config key in "Q/config/bootstrap.json" and the new is added. Also, "Db/config/upcoming.json" is removed from "Q/configFiles" config key and cleared. The last action also removes blocks from shards and let application operate with new shards.

After short timeout the application is ready to work with new shards and the old one can be safely dropped in the database.

Partition schemes

We developed two basic schemes to organize shards - direct product of arrays of points and linear order.

Direct product allows us to work with shards in a very efficient manner. We can separate fields and determine applicable points ranges per field and then get direct product of found separated points. Different implementations of such method work very fast and provide exact result for unique row, array queries and range queries. On the other hand, if some shard is overloaded and you want to split it to two new shards you need to split the full "layer", including the shards which may not be full. In other words, if you organize shards as a matrix, then to split single shard you have to split all the shards from the entire some matrix column.

The linear sharding algorithm orders all points linearly in lexicographical order considering each hash as single letter. This method is slower and in rare cases when querying ranges it touches more shards than necessary. However, whenever single shard is overloaded you can easily split is to two ormore without worrying about other shards.

The speed of sharding algorithm is negligible compared to speed of the actual query, which requires I/O. Also, the number of shards used in production should be kept minimal to save on hardware. For these reasons, this linear sharding algorithm was chosen.

Different methods and implementations can be examined in Query.php.xxx files in the QP/classes folder of Qbix Platform.