An Insight into MongoDB Sharding Chunk Splitting and Migration

本文涉及的产品
云数据库 MongoDB,独享型 2核8GB
推荐场景:
构建全方位客户视图
简介: Sharding is a method of data distribution across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

Alibaba_Cloud_Whitepaper_Securing_the_Data_Center_in_a_Cloud_First_World_v2

An Introduction to Sharding

Sharding is a method of data distribution across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations. This article comprehensively discusses various methods of MongoDB sharding in relation to chunk splitting and migration.

1

Note: Content in this article is based on MongoDB 3.2.
Let us understand a primary shard in detail.

Primary Shard

After using MongoDB sharding, data will scatter into one or more shards as chunks (64 MB by default) based on the shardKey.

Each database has a primary shard allocated at database creation. Let us now discuss the procedures that a shard follows to migrate and write data.

● The collection that initiates the shard (that is, invoked the shardCollectionshardCollection command) in the database, first generates a [minKey, maxKey] chunk stored on the primary shard. The chunk then constantly splits and migrates with data writes. The figure below is a depiction of the process.
● For the collection that does not initiate the shard in the database, all the data is stored on the primary shard.

2

When is Chunk Split Triggered?

There is a sharding.autoSplit configuration item on Mongos that automatically triggers chunk splitting. It is generally enabled by default. We strongly recommend you to refrain from disabling autoSplit without the guidance of professionals. A safer method would be to use "pre-sharding" to split the chunk in advance.

MongoDB's automatic splitting of chunks only occurs when Mongos writes data. When the data size written to the chunk exceeds a certain value, the chunk split will trigger. There are specific rules that the splitting process must follow which are stated as follows:

int ChunkManager::getCurrentDesiredChunkSize() const {
    // split faster in early chunks helps spread out an initial load better
    const int minChunkSize = 1 << 20;  // 1 MBytes

    int splitThreshold = Chunk::MaxChunkSize;  // default 64MB

    int nc = numChunks();

    if (nc <= 1) {
        return 1024;
    } else if (nc < 3) {
        return minChunkSize / 2;
    } else if (nc < 10) {
        splitThreshold = max(splitThreshold / 4, minChunkSize);
    } else if (nc < 20) {
        splitThreshold = max(splitThreshold / 2, minChunkSize);
    }

    return splitThreshold;
} 

bool Chunk::splitIfShould(OperationContext* txn, long dataWritten) const {
    dassert(ShouldAutoSplit);
    LastError::Disabled d(&LastError::get(cc()));

    try {
        _dataWritten += dataWritten;
        int splitThreshold = getManager()->getCurrentDesiredChunkSize();
        if (_minIsInf() || _maxIsInf()) {
            splitThreshold = (int)((double)splitThreshold * .9);
        }

        if (_dataWritten < splitThreshold / ChunkManager::SplitHeuristics::splitTestFactor)
            return false;

        if (!getManager()->_splitHeuristics._splitTickets.tryAcquire()) {
            LOG(1) << "won't auto split because not enough tickets: " << getManager()->getns();
            return false;
        }
        ......
}

The chunkSize is 64 MB by default and the split threshold is as follows:

Number of chunks in a collection Split threshold value
1 1024B
[1, 3) 0.5MB
[3, 10) 16MB
[10, 20) 32MB
[20, max) 64MB


During data writes, when the data size written to the chunk exceeds the split threshold value, the chunk split will trigger. Further, it may also trigger when the chunks unevenly distribute on the shard after the chunk split.

When is Chunk Migration Triggered?

Under a default scenario, MongoDB will enable the balancer to migrate chunks between various shards to balance the load of these shards. You can also manually call the moveChunk command to migrate data between shards.
When the balancer is working, it decides whether there is need to migrate a chunk. Such decision depends on the shard tag, the number of chunks in a collection and the difference in numbers of chunks between shards.
The three basic aspects of migration are given as follows:

(1) Migration Based on Shard Tag

MongoDB sharding supports the shard tag feature which enables tagging of a shard and a range in the collection. It will then ensure that "the range with a tag is allocated to the shard with the same tag" through data migration via balancer.

(2) Migration Based on Numbers of Chunks on Different Shards

You can execute the following code for migrating based on the number of chunks on different shards:

int threshold = 8;
if (balancedLastTime || distribution.totalChunks() < 20)
    threshold = 2;
else if (distribution.totalChunks() < 80)
    threshold = 4;



Number of chunks in a collection Migration threshold value
[1, 20) 2
[20, 80) 4
[80, max) 8


This feature targets all the collections that have initiated shards. If the difference in the number of chunks on the “shard with the most chunks” and the “shard with the least chunks” exceeds the threshold value, it triggers chunk migration. Owing to this mechanism, the balancer will automatically balance the data distribution when you call addShard to add a new shard, or write data to shards unevenly.

(3) Trigger Migration with removeShard

Another case that triggers migration is when you call the removeShard command to remove a shard from a cluster, the balancer will automatically migrate the chunks in charge of the shard to other nodes. Now that, we have looked into the triggers of chunk migration, let us now discuss the impact of chunkSize on Chunk split and migration.

Impact of chunkSize on Chunk Split and Migration

The default chunkSize in MongoDB is 64 MB. We recommend the default value unless you have specific requirements. The chunkSize value will directly affect the chunk split and migration actions. Below are pointers which describe in brief, the impact of chunkSize based on its size.

● The smaller the chunkSize, the more the chunk split and migration actions and the more balanced is the data distribution. On the other hand, the greater the chunkSize, the fewer the chunk split and migration actions. However, this may result in uneven data distribution.

● When the chunkSize value is too small, a jumbo chunk happens easily (that is, a certain shardKey value appears so frequently that it is only possible to store these documents in one chunk and cannot be split) and impossible to migrate as well. The larger the chunkSize value, the more likely it is that you may encounter the case where a chunk contains too many documents (the number of documents in a chunk cannot exceed 250,000) for migration.

● The chunk automatic split only triggers at data writes. If you change the chunkSize to a smaller value, the system will need some time to split the chunk to the specified size.

● A chunk can only be split but not merged. Even if you increase the chunkSize value, the number of existing chunks will not reduce. However, the chunk size will grow with new data writes until the chunk size reaches the target value.

How to Reduce the Impact of Chunk Split and Migration?

There might be cases where the automatic chunk split and migration impact the service during MongoDB sharding. To reduce the impact of chunk split and migration, you can consider the following measures.

(1) Pre-sharding to Split the Chunk in Advance

● You can perform "pre-sharding" on the collection to create a specified number of chunks directly. You can achieve this when you use hash sharding for a collection during shardCollection. You scatter them to various backend shards.

● You can specify the numInitialChunks parameter to specify the number of chunks for initialization in shardCollection. However, the value cannot exceed 8192.

● As an optional measure, you can specify the number of chunks to create initially, when sharding an empty collection with a hashed shard key. MongoDB will then create and balance chunks across the cluster. The numInitialChunks must be less than 8192 per shard. If the collection is not empty, numInitialChunks has no effect.

If you use range sharding, pre-sharding is not effective as the system will not have determined the value of shardKey. Hence, it is easier to have empty chunks. Due to this reason, range sharding only supports hash sharding.

(2) Configure Balancer Reasonably

MonogDB balancer can support flexible configuration policies to adapt to various user requirements.

● Dynamically turn on and off balancer;

● Turn on and off balancer for a specified collection.

● Balancer supports the configuring of time window to migrate chunks only during a specified period.

Conclusion

MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations. You can scale both read and write workloads horizontally across the cluster by adding more shards.Sharded cluster infrastructure requirements and complexity require careful planning, execution, and maintenance. Careful consideration in choosing the shard key is necessary to ensure cluster performance and efficiency. You cannot alter the shard key after sharding, nor can you unshard a sharded collection. Therefore, taking proper steps while sharding and reducing the impact of automatic chunk split and migration MongoDB's sharding can meet the demands of data growth.

References

Manage Sharded Cluster Balancer

shardCollection command

sharding

Migration Thresholds

shard tag



Original article: https://yq.aliyun.com/articles/91625
相关实践学习
MongoDB数据库入门
MongoDB数据库入门实验。
快速掌握 MongoDB 数据库
本课程主要讲解MongoDB数据库的基本知识,包括MongoDB数据库的安装、配置、服务的启动、数据的CRUD操作函数使用、MongoDB索引的使用(唯一索引、地理索引、过期索引、全文索引等)、MapReduce操作实现、用户管理、Java对MongoDB的操作支持(基于2.x驱动与3.x驱动的完全讲解)。 通过学习此课程,读者将具备MongoDB数据库的开发能力,并且能够使用MongoDB进行项目开发。 &nbsp; 相关的阿里云产品:云数据库 MongoDB版 云数据库MongoDB版支持ReplicaSet和Sharding两种部署架构,具备安全审计,时间点备份等多项企业能力。在互联网、物联网、游戏、金融等领域被广泛采用。 云数据库MongoDB版(ApsaraDB for MongoDB)完全兼容MongoDB协议,基于飞天分布式系统和高可靠存储引擎,提供多节点高可用架构、弹性扩容、容灾、备份回滚、性能优化等解决方案。 产品详情: https://www.aliyun.com/product/mongodb
目录
相关文章
|
10月前
|
Kubernetes 安全 API
Cilium 系列 -3-Cilium 的基本组件和重要概念
Cilium 系列 -3-Cilium 的基本组件和重要概念
|
10月前
|
安全 API 开发工具
阿里云如何开通子账号
阿里云如何开通子账号
5857 1
|
5月前
|
存储 应用服务中间件 PHP
设置nginx中文件上传的大小限制度
设置nginx中文件上传的大小限制度
|
7月前
|
JSON 测试技术 API
Python开发解析Swagger文档小工具
文章介绍了如何使用Python开发一个解析Swagger文档的小工具,该工具可以生成符合httprunner测试框架的json/yaml测试用例,同时还能输出Excel文件,以方便测试人员根据不同需求使用。文章提供了详细的开发步骤、环境配置和使用示例,并鼓励读者为该开源项目贡献代码和建议。
189 1
Python开发解析Swagger文档小工具
|
8月前
|
分布式计算 Java Serverless
EMR Serverless Spark 实践教程 | 通过 spark-submit 命令行工具提交 Spark 任务
本文以 ECS 连接 EMR Serverless Spark 为例,介绍如何通过 EMR Serverless spark-submit 命令行工具进行 Spark 任务开发。
491 7
EMR Serverless Spark 实践教程 | 通过 spark-submit 命令行工具提交 Spark 任务
|
9月前
|
SQL 分布式计算 NoSQL
使用Spark高效将数据从Hive写入Redis (功能最全)
使用Spark高效将数据从Hive写入Redis (功能最全)
505 1
|
9月前
|
自然语言处理 数据可视化 Python
卡方分布和 Zipf 分布模拟及 Seaborn 可视化教程
卡方分布是统计学中的一种连续概率分布,用于假设检验,形状由自由度(df)决定。自由度越大,分布越平缓。NumPy的`random.chisquare()`可生成卡方分布随机数。Seaborn能可视化卡方分布。练习包括模拟不同自由度的卡方分布、进行卡方检验。瑞利分布描述信号处理中幅度分布,参数为尺度(scale)。Zipf分布常用于自然语言等幂律特征数据,参数a控制形状。NumPy的`random.zipf()`生成Zipf分布随机数。
181 0
|
10月前
|
存储 缓存 安全
【C/C++ 项目优化实战】 分享几种基础且高效的策略优化和提升代码性能
【C/C++ 项目优化实战】 分享几种基础且高效的策略优化和提升代码性能
447 0
|
10月前
|
存储 流计算
在Flink CDC中,Checkpoint的清理策略通常有两种设置方式
在Flink CDC中,Checkpoint的清理策略通常有两种设置方式
389 5
|
关系型数据库 MySQL Java
mysql连接池的设计与实现
mysql连接池的设计与实现
299 0