An Introduction to Sharding
Sharding is a method of data distribution across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations. This article comprehensively discusses various methods of MongoDB sharding in relation to chunk splitting and migration.
Note: Content in this article is based on MongoDB 3.2.
Let us understand a primary shard in detail.
Primary Shard
After using MongoDB sharding, data will scatter into one or more shards as chunks (64 MB by default) based on the shardKey.
Each database has a primary shard allocated at database creation. Let us now discuss the procedures that a shard follows to migrate and write data.
● The collection that initiates the shard (that is, invoked the shardCollectionshardCollection command) in the database, first generates a [minKey, maxKey] chunk stored on the primary shard. The chunk then constantly splits and migrates with data writes. The figure below is a depiction of the process.
● For the collection that does not initiate the shard in the database, all the data is stored on the primary shard.
When is Chunk Split Triggered?
There is a sharding.autoSplit configuration item on Mongos that automatically triggers chunk splitting. It is generally enabled by default. We strongly recommend you to refrain from disabling autoSplit without the guidance of professionals. A safer method would be to use "pre-sharding" to split the chunk in advance.
MongoDB's automatic splitting of chunks only occurs when Mongos writes data. When the data size written to the chunk exceeds a certain value, the chunk split will trigger. There are specific rules that the splitting process must follow which are stated as follows:
int ChunkManager::getCurrentDesiredChunkSize() const {
// split faster in early chunks helps spread out an initial load better
const int minChunkSize = 1 << 20; // 1 MBytes
int splitThreshold = Chunk::MaxChunkSize; // default 64MB
int nc = numChunks();
if (nc <= 1) {
return 1024;
} else if (nc < 3) {
return minChunkSize / 2;
} else if (nc < 10) {
splitThreshold = max(splitThreshold / 4, minChunkSize);
} else if (nc < 20) {
splitThreshold = max(splitThreshold / 2, minChunkSize);
}
return splitThreshold;
}
bool Chunk::splitIfShould(OperationContext* txn, long dataWritten) const {
dassert(ShouldAutoSplit);
LastError::Disabled d(&LastError::get(cc()));
try {
_dataWritten += dataWritten;
int splitThreshold = getManager()->getCurrentDesiredChunkSize();
if (_minIsInf() || _maxIsInf()) {
splitThreshold = (int)((double)splitThreshold * .9);
}
if (_dataWritten < splitThreshold / ChunkManager::SplitHeuristics::splitTestFactor)
return false;
if (!getManager()->_splitHeuristics._splitTickets.tryAcquire()) {
LOG(1) << "won't auto split because not enough tickets: " << getManager()->getns();
return false;
}
......
}
The chunkSize is 64 MB by default and the split threshold is as follows:
Number of chunks in a collection | Split threshold value |
---|---|
1 | 1024B |
[1, 3) | 0.5MB |
[3, 10) | 16MB |
[10, 20) | 32MB |
[20, max) | 64MB |
During data writes, when the data size written to the chunk exceeds the split threshold value, the chunk split will trigger. Further, it may also trigger when the chunks unevenly distribute on the shard after the chunk split.
When is Chunk Migration Triggered?
Under a default scenario, MongoDB will enable the balancer to migrate chunks between various shards to balance the load of these shards. You can also manually call the moveChunk command to migrate data between shards.When the balancer is working, it decides whether there is need to migrate a chunk. Such decision depends on the shard tag, the number of chunks in a collection and the difference in numbers of chunks between shards.
The three basic aspects of migration are given as follows:
(1) Migration Based on Shard Tag
MongoDB sharding supports the shard tag feature which enables tagging of a shard and a range in the collection. It will then ensure that "the range with a tag is allocated to the shard with the same tag" through data migration via balancer.(2) Migration Based on Numbers of Chunks on Different Shards
You can execute the following code for migrating based on the number of chunks on different shards:int threshold = 8;
if (balancedLastTime || distribution.totalChunks() < 20)
threshold = 2;
else if (distribution.totalChunks() < 80)
threshold = 4;
Number of chunks in a collection | Migration threshold value |
---|---|
[1, 20) | 2 |
[20, 80) | 4 |
[80, max) | 8 |
This feature targets all the collections that have initiated shards. If the difference in the number of chunks on the “shard with the most chunks” and the “shard with the least chunks” exceeds the threshold value, it triggers chunk migration. Owing to this mechanism, the balancer will automatically balance the data distribution when you call addShard to add a new shard, or write data to shards unevenly.
(3) Trigger Migration with removeShard
Another case that triggers migration is when you call the removeShard command to remove a shard from a cluster, the balancer will automatically migrate the chunks in charge of the shard to other nodes. Now that, we have looked into the triggers of chunk migration, let us now discuss the impact of chunkSize on Chunk split and migration.Impact of chunkSize on Chunk Split and Migration
The default chunkSize in MongoDB is 64 MB. We recommend the default value unless you have specific requirements. The chunkSize value will directly affect the chunk split and migration actions. Below are pointers which describe in brief, the impact of chunkSize based on its size.● The smaller the chunkSize, the more the chunk split and migration actions and the more balanced is the data distribution. On the other hand, the greater the chunkSize, the fewer the chunk split and migration actions. However, this may result in uneven data distribution.
● When the chunkSize value is too small, a jumbo chunk happens easily (that is, a certain shardKey value appears so frequently that it is only possible to store these documents in one chunk and cannot be split) and impossible to migrate as well. The larger the chunkSize value, the more likely it is that you may encounter the case where a chunk contains too many documents (the number of documents in a chunk cannot exceed 250,000) for migration.
● The chunk automatic split only triggers at data writes. If you change the chunkSize to a smaller value, the system will need some time to split the chunk to the specified size.
● A chunk can only be split but not merged. Even if you increase the chunkSize value, the number of existing chunks will not reduce. However, the chunk size will grow with new data writes until the chunk size reaches the target value.
How to Reduce the Impact of Chunk Split and Migration?
There might be cases where the automatic chunk split and migration impact the service during MongoDB sharding. To reduce the impact of chunk split and migration, you can consider the following measures.(1) Pre-sharding to Split the Chunk in Advance
● You can perform "pre-sharding" on the collection to create a specified number of chunks directly. You can achieve this when you use hash sharding for a collection during shardCollection. You scatter them to various backend shards.● You can specify the numInitialChunks parameter to specify the number of chunks for initialization in shardCollection. However, the value cannot exceed 8192.
● As an optional measure, you can specify the number of chunks to create initially, when sharding an empty collection with a hashed shard key. MongoDB will then create and balance chunks across the cluster. The numInitialChunks must be less than 8192 per shard. If the collection is not empty, numInitialChunks has no effect.
If you use range sharding, pre-sharding is not effective as the system will not have determined the value of shardKey. Hence, it is easier to have empty chunks. Due to this reason, range sharding only supports hash sharding.
(2) Configure Balancer Reasonably
MonogDB balancer can support flexible configuration policies to adapt to various user requirements.● Dynamically turn on and off balancer;
● Turn on and off balancer for a specified collection.
● Balancer supports the configuring of time window to migrate chunks only during a specified period.
Conclusion
MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations. You can scale both read and write workloads horizontally across the cluster by adding more shards.Sharded cluster infrastructure requirements and complexity require careful planning, execution, and maintenance. Careful consideration in choosing the shard key is necessary to ensure cluster performance and efficiency. You cannot alter the shard key after sharding, nor can you unshard a sharded collection. Therefore, taking proper steps while sharding and reducing the impact of automatic chunk split and migration MongoDB's sharding can meet the demands of data growth.References
● Manage Sharded Cluster Balancer
● shardCollection command
● sharding
● Migration Thresholds
● shard tag
Original article: https://yq.aliyun.com/articles/91625