An Insight into MongoDB Sharding Chunk Splitting and Migration

本文涉及的产品
云数据库 MongoDB,独享型 2核8GB
推荐场景:
构建全方位客户视图
简介: Sharding is a method of data distribution across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

Alibaba_Cloud_Whitepaper_Securing_the_Data_Center_in_a_Cloud_First_World_v2

An Introduction to Sharding

Sharding is a method of data distribution across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations. This article comprehensively discusses various methods of MongoDB sharding in relation to chunk splitting and migration.

1

Note: Content in this article is based on MongoDB 3.2.
Let us understand a primary shard in detail.

Primary Shard

After using MongoDB sharding, data will scatter into one or more shards as chunks (64 MB by default) based on the shardKey.

Each database has a primary shard allocated at database creation. Let us now discuss the procedures that a shard follows to migrate and write data.

● The collection that initiates the shard (that is, invoked the shardCollectionshardCollection command) in the database, first generates a [minKey, maxKey] chunk stored on the primary shard. The chunk then constantly splits and migrates with data writes. The figure below is a depiction of the process.
● For the collection that does not initiate the shard in the database, all the data is stored on the primary shard.

2

When is Chunk Split Triggered?

There is a sharding.autoSplit configuration item on Mongos that automatically triggers chunk splitting. It is generally enabled by default. We strongly recommend you to refrain from disabling autoSplit without the guidance of professionals. A safer method would be to use "pre-sharding" to split the chunk in advance.

MongoDB's automatic splitting of chunks only occurs when Mongos writes data. When the data size written to the chunk exceeds a certain value, the chunk split will trigger. There are specific rules that the splitting process must follow which are stated as follows:

int ChunkManager::getCurrentDesiredChunkSize() const {
    // split faster in early chunks helps spread out an initial load better
    const int minChunkSize = 1 << 20;  // 1 MBytes

    int splitThreshold = Chunk::MaxChunkSize;  // default 64MB

    int nc = numChunks();

    if (nc <= 1) {
        return 1024;
    } else if (nc < 3) {
        return minChunkSize / 2;
    } else if (nc < 10) {
        splitThreshold = max(splitThreshold / 4, minChunkSize);
    } else if (nc < 20) {
        splitThreshold = max(splitThreshold / 2, minChunkSize);
    }

    return splitThreshold;
} 

bool Chunk::splitIfShould(OperationContext* txn, long dataWritten) const {
    dassert(ShouldAutoSplit);
    LastError::Disabled d(&LastError::get(cc()));

    try {
        _dataWritten += dataWritten;
        int splitThreshold = getManager()->getCurrentDesiredChunkSize();
        if (_minIsInf() || _maxIsInf()) {
            splitThreshold = (int)((double)splitThreshold * .9);
        }

        if (_dataWritten < splitThreshold / ChunkManager::SplitHeuristics::splitTestFactor)
            return false;

        if (!getManager()->_splitHeuristics._splitTickets.tryAcquire()) {
            LOG(1) << "won't auto split because not enough tickets: " << getManager()->getns();
            return false;
        }
        ......
}

The chunkSize is 64 MB by default and the split threshold is as follows:

Number of chunks in a collection Split threshold value
1 1024B
[1, 3) 0.5MB
[3, 10) 16MB
[10, 20) 32MB
[20, max) 64MB


During data writes, when the data size written to the chunk exceeds the split threshold value, the chunk split will trigger. Further, it may also trigger when the chunks unevenly distribute on the shard after the chunk split.

When is Chunk Migration Triggered?

Under a default scenario, MongoDB will enable the balancer to migrate chunks between various shards to balance the load of these shards. You can also manually call the moveChunk command to migrate data between shards.
When the balancer is working, it decides whether there is need to migrate a chunk. Such decision depends on the shard tag, the number of chunks in a collection and the difference in numbers of chunks between shards.
The three basic aspects of migration are given as follows:

(1) Migration Based on Shard Tag

MongoDB sharding supports the shard tag feature which enables tagging of a shard and a range in the collection. It will then ensure that "the range with a tag is allocated to the shard with the same tag" through data migration via balancer.

(2) Migration Based on Numbers of Chunks on Different Shards

You can execute the following code for migrating based on the number of chunks on different shards:

int threshold = 8;
if (balancedLastTime || distribution.totalChunks() < 20)
    threshold = 2;
else if (distribution.totalChunks() < 80)
    threshold = 4;



Number of chunks in a collection Migration threshold value
[1, 20) 2
[20, 80) 4
[80, max) 8


This feature targets all the collections that have initiated shards. If the difference in the number of chunks on the “shard with the most chunks” and the “shard with the least chunks” exceeds the threshold value, it triggers chunk migration. Owing to this mechanism, the balancer will automatically balance the data distribution when you call addShard to add a new shard, or write data to shards unevenly.

(3) Trigger Migration with removeShard

Another case that triggers migration is when you call the removeShard command to remove a shard from a cluster, the balancer will automatically migrate the chunks in charge of the shard to other nodes. Now that, we have looked into the triggers of chunk migration, let us now discuss the impact of chunkSize on Chunk split and migration.

Impact of chunkSize on Chunk Split and Migration

The default chunkSize in MongoDB is 64 MB. We recommend the default value unless you have specific requirements. The chunkSize value will directly affect the chunk split and migration actions. Below are pointers which describe in brief, the impact of chunkSize based on its size.

● The smaller the chunkSize, the more the chunk split and migration actions and the more balanced is the data distribution. On the other hand, the greater the chunkSize, the fewer the chunk split and migration actions. However, this may result in uneven data distribution.

● When the chunkSize value is too small, a jumbo chunk happens easily (that is, a certain shardKey value appears so frequently that it is only possible to store these documents in one chunk and cannot be split) and impossible to migrate as well. The larger the chunkSize value, the more likely it is that you may encounter the case where a chunk contains too many documents (the number of documents in a chunk cannot exceed 250,000) for migration.

● The chunk automatic split only triggers at data writes. If you change the chunkSize to a smaller value, the system will need some time to split the chunk to the specified size.

● A chunk can only be split but not merged. Even if you increase the chunkSize value, the number of existing chunks will not reduce. However, the chunk size will grow with new data writes until the chunk size reaches the target value.

How to Reduce the Impact of Chunk Split and Migration?

There might be cases where the automatic chunk split and migration impact the service during MongoDB sharding. To reduce the impact of chunk split and migration, you can consider the following measures.

(1) Pre-sharding to Split the Chunk in Advance

● You can perform "pre-sharding" on the collection to create a specified number of chunks directly. You can achieve this when you use hash sharding for a collection during shardCollection. You scatter them to various backend shards.

● You can specify the numInitialChunks parameter to specify the number of chunks for initialization in shardCollection. However, the value cannot exceed 8192.

● As an optional measure, you can specify the number of chunks to create initially, when sharding an empty collection with a hashed shard key. MongoDB will then create and balance chunks across the cluster. The numInitialChunks must be less than 8192 per shard. If the collection is not empty, numInitialChunks has no effect.

If you use range sharding, pre-sharding is not effective as the system will not have determined the value of shardKey. Hence, it is easier to have empty chunks. Due to this reason, range sharding only supports hash sharding.

(2) Configure Balancer Reasonably

MonogDB balancer can support flexible configuration policies to adapt to various user requirements.

● Dynamically turn on and off balancer;

● Turn on and off balancer for a specified collection.

● Balancer supports the configuring of time window to migrate chunks only during a specified period.

Conclusion

MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations. You can scale both read and write workloads horizontally across the cluster by adding more shards.Sharded cluster infrastructure requirements and complexity require careful planning, execution, and maintenance. Careful consideration in choosing the shard key is necessary to ensure cluster performance and efficiency. You cannot alter the shard key after sharding, nor can you unshard a sharded collection. Therefore, taking proper steps while sharding and reducing the impact of automatic chunk split and migration MongoDB's sharding can meet the demands of data growth.

References

Manage Sharded Cluster Balancer

shardCollection command

sharding

Migration Thresholds

shard tag



Original article: https://yq.aliyun.com/articles/91625
相关实践学习
MongoDB数据库入门
MongoDB数据库入门实验。
快速掌握 MongoDB 数据库
本课程主要讲解MongoDB数据库的基本知识,包括MongoDB数据库的安装、配置、服务的启动、数据的CRUD操作函数使用、MongoDB索引的使用(唯一索引、地理索引、过期索引、全文索引等)、MapReduce操作实现、用户管理、Java对MongoDB的操作支持(基于2.x驱动与3.x驱动的完全讲解)。 通过学习此课程,读者将具备MongoDB数据库的开发能力,并且能够使用MongoDB进行项目开发。 &nbsp; 相关的阿里云产品:云数据库 MongoDB版 云数据库MongoDB版支持ReplicaSet和Sharding两种部署架构,具备安全审计,时间点备份等多项企业能力。在互联网、物联网、游戏、金融等领域被广泛采用。 云数据库MongoDB版(ApsaraDB for MongoDB)完全兼容MongoDB协议,基于飞天分布式系统和高可靠存储引擎,提供多节点高可用架构、弹性扩容、容灾、备份回滚、性能优化等解决方案。 产品详情: https://www.aliyun.com/product/mongodb
目录
相关文章
|
8月前
|
关系型数据库 MySQL 数据处理
TiDB Data Migration (DM):高效数据迁移的实战应用
【2月更文挑战第28天】随着企业对数据处理需求的不断增长,数据库迁移成为一项关键任务。TiDB Data Migration (DM) 作为一款专为TiDB设计的数据迁移工具,在实际应用中表现出色。本文将结合具体案例,详细介绍TiDB DM的应用场景、操作过程及最佳实践,帮助读者更好地理解和运用这一工具,实现高效的数据迁移。
|
SQL 存储 关系型数据库
TiDB Data Migration 术语表
本文档介绍 TiDB Data Migration (TiDB DM) 相关术语。 B Binlog 在 TiDB DM 中,Binlog 通常指 MySQL/MariaDB 生成的 binary log 文件,具体请参考 MySQL Binary Log 与 MariaDB Binary Log。 Binlog event MySQL/MariaDB 生成的 Binlog 文件中的数据变更信息,具体请参考 MySQL Binlog Event 与 MariaDB Binlog Event。 Binlog event filter 比 Black & white table list 更
153 0
|
NoSQL
How to Create Highly Available MongoDB Databases with Replica Sets
Find out how you can create MongoDB databases with high availability by backing up data through replica set elections.
4135 0
How to Create Highly Available MongoDB Databases with Replica Sets
|
NoSQL 安全
Using ApsaraDB to Build Scalable MongoDB Instances
MongoDB is a star in the non-SQL (NoSQL) database world.
2522 0
|
NoSQL MongoDB 数据安全/隐私保护