How to Create Highly Available MongoDB Databases with Replica Sets

本文涉及的产品
云数据库 MongoDB,独享型 2核8GB
推荐场景:
构建全方位客户视图
简介: Find out how you can create MongoDB databases with high availability by backing up data through replica set elections.

One of the most common ways to create a high-availability MongoDB database, such as ApsaraDB for MongoDB, is by using replica sets. A MongoDB replica set is composed of a group of MongoDB instances (processes), including one primary node and multiple secondary nodes. All the data on the MongoDB Driver (client) is written to the primary node, and the secondary node synchronizes the written data from the primary node to ensure that all the members in the replica set store the same datasets to achieve high availability of data. Alibaba Cloud ApsaraDB for MongoDB provides a convenient way for you to download backup data for replica set instances.

In this article, we will explore in detail the various configurations of MongoDB replica sets.

Primary Election

The replica set is initialized through the replSetInitiate command (or rs.initiate() for mongo shell). After the initialization, various members start to send heartbeat messages and initialize the primary node election. The node that wins the majority of votes is elected as the primary node while the remaining ones become the secondary nodes.

Initialize the replica set

config = {
    _id : "my_replica_set",
    members : [
         {_id : 0, host : "rs1.example.net:27017"},
         {_id : 1, host : "rs2.example.net:27017"},
         {_id : 2, host : "rs3.example.net:27017"},
   ]
}
 
rs.initiate(config)

Definition of Majority

Suppose that the number of voting members (to be introduced later) in the replica set is N. Majority is then defined by the formula (N/2) + 1, or (N+1)/2, for even and odd values of N, respectively. If the number of members alive in the replica set is less than the majority, the replica set cannot elect the primary node, and is not able to provide write services. The replica set is in the read-only status.

Number of voting members Majority Number of tolerable failures
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3

You are recommended to set the number of members in a replica set to an odd number for better failure tolerance. From the table above, we can see that the replica sets with two nodes and three nodes need a majority of two nodes, but only the odd-numbered set can tolerate a failed node. This means that the odd-numbered set has better service availability and can provide more reliable data storage.

Special Secondary Node

In normal cases, the secondary node of the replica set will participate in the primary node election (it may also be elected as the primary node), and synchronize the data last written from the primary node to ensure its data consistency with the primary node.

The secondary node can provide the reading service. Increasing the number of secondary nodes can enhance the reading service capability of the replica set and improve the availability of the replica set. In addition, the MongoDB supports flexible configurations on the secondary nodes in the replica set to adapt to the demands of a variety of scenarios.

Arbiter

The arbiter node only participates in the voting. It cannot be elected as the primary node and does not synchronize data from the primary node.

For example, if you deploy a replica set with two nodes, you will have a primary node and a secondary node. If either node fails, the replica set will not be able to provide services (cannot elect the primary node). However, if you add an arbiter node to the replica set, the primary can still be elected, even if a node fails.

The arbiter node itself does not store data and is a very lightweight service. When the number of members in the replica set is an even number, you should add an arbiter node to improve the availability of the replica set.

Priority0

The election priority of a Priority0 node is 0 and the Priority0 node will not be selected as the primary node.

For example, if you deploy a replica set across server rooms A and B, you can specify that the primary node must be in server room A. You can do this by setting the priority of replica set members in server room B to 0, so that the primary node must be a member in server room A. If you deploy the replica set like this, you should deploy the majority of nodes in server room A. Otherwise, the primary node may fail to be elected during network partitioning.

Vote0

In MongoDB 3.0, you can set a maximum number of 50 replica set members, and a maximum number of 7 members participating in the primary node election. The vote attributes of other members (Vote0) must be set to 0, that is, they do not participate in the voting.

Hidden

The hidden node cannot be selected as the primary node (its Priority is 0) and is invisible to the Driver.

Because the hidden node will not accept requests from the Driver, you can use the hidden node for data backup, offline computing, and other tasks without affecting the service provided by the replica set.

Delayed

The delayed node must be a hidden node, and its data lags behind that on the primary node for some time (this is configurable, such as one hour).
Because the data on the delayed node lags behind that on the primary node, you can recover data on the primary node by using historical data on the delayed node.

Data Synchronization

The primary node and the secondary node synchronize data through oplog. After the write operation is completed on the primary node, the primary node will write an oplog to the local.oplog.rs special set and the secondary node keeps pulling and applying the oplog from the primary node.

Because the oplog data will keep increasing, the local.oplog.rs is set to a capped set. When the capacity reaches the upper limit for configuration, it will delete the oldest data. In addition, considering that there may be repeated application of oplog on the secondary node, the oplog must be idempotent so that repeated application of oplog will produce the same results.

For example, the oplog format below contains the ts, h, op, ns, and o fields.

{
"ts" : Timestamp(1446011584, 2),
"h" : NumberLong("1687359108795812092"),
"v" : 2,
"op" : "i",
"ns" : "test.nosql",
"o" : { "_id" : ObjectId("563062c0b085733f34ab4129"), "name" : "mongodb", "score" : "100" }
}

  • ts: Operation time. The current timestamp + counter, and the counter is reset every second.
  • h: The global unique identifier of the operation.
  • v: The oplog version information.
  • op: Operation type.
  • i: Insert operation.
  • u: Update operation.
  • d: Delete operation.
  • c: Execute commands (such as createDatabase, and dropDatabase)
  • n: Null operation. It is for some special purposes.
  • ns: The targeted set of the operation.
  • o: Operation content. If it is an update operation:
  • o2: The operation query condition. Only the update operation contains this field.

When the secondary node initializes the data for the first time, it will first execute init sync to synchronize the full data from the primary node (or other secondary nodes with data updates). It then continues inquiring and applying the latest oplog to itself through tailable cursor from the local.oplog.rs set of the primary node.

The init sync process contains the following steps:

  1. At T1, the secondary node synchronizes the data of all the databases on the primary node (except local) through the sensitive command combination of listDatabases + listCollections + cloneCollection. We suppose all the operations are completed at T2.
  2. Apply all the oplogs from the period of [T1-T2] from the primary node. Some operations may have been included in Step 1. But because of the idempotence of the oplog, the oplog can be applied repeatedly.
  3. Create indexes for corresponding sets on the secondary node according to the indexing settings of various sets on the primary node. (The _id index of every set has been completed in Step 1.)

The size of the oplog set should be reasonably configured based on the database scale and application writing requirements. If the set is too large, it will waste storage space. If the set is too small, the "init sync" operation of secondary nodes may fail. For example, if there is too much data in the database in Step 1 and the size of the oplog is too small, the oplog cannot store all the oplogs during the period of [T1, T2]. As a result, the secondary node cannot fully synchronize with the datasets from the primary node.

Modifying Replica Set Configurations

When you need to modify the replica set, such as adding a member, deleting a member, or modifying the member configuration (such as priority, vote, hidden, and delayed among other attributes), you can use the replSetReconfig command (rs.reconfig()) to re-configure the replica set.

For example, to set the priority of the second member in the replica set to 2, you can execute the following commands:

cfg = rs.conf();
cfg.members[1].priority = 2;
rs.reconfig(cfg);

Details on Primary Node Election

Apart from at the replica set initialization, the primary node election may also occur in the following scenarios:

  • The replica set is re-configured
  • The secondary node will trigger a new round of primary node election when it detects the primary node failure.
  • When the primary node performs an active stepDown (actively downgrade to the secondary node), a new round of primary node election will be triggered.

The primary node election is affected by multiple factors including the inter-node heartbeats, priority, and the latest oplog time.

Inter-node Heartbeat

Members in a replica set will send a heartbeat message between each other every two seconds by default. If the heartbeat message of a node is not received for 10 seconds, the node is considered to have failed. If the failed node is the primary node, the secondary node (the premise is that it can be voted as the primary node) will initiate a new round of primary node election.

Node Priority

  • Every node is inclined to vote the node with the highest priority as the primary node.
  • A node with the priority of 0 will not take the initiative to trigger the primary node election.
  • When the primary node discovers a secondary node with a higher priority, and the data latency on the secondary node is within 10 seconds, the primary node will perform an active stepDown and make the secondary node with a higher priority eligible for being the primary node.

Optime

Only the node with the latest optime (the timestamp of the most recent oplog record) can be elected as the primary node.

Network partition

A node is eligible to be elected as the primary node only if it remains connected with a majority of voting nodes. If the primary node is disconnected from a majority of nodes, the primary node will take the initiative to downgrade to a secondary node. During network partitioning, multiple primary nodes may appear within a short period of time. To avoid this from happening, you should set the majority policy when you write data to the driver. This ensures that even if multiple primary nodes appear, only one primary node can successfully write data to the majority of nodes.

Read/Write Settings of the Replica Set

Read Preference

By default, all the read requests in the replica set are sent to the primary node and the driver can route the read requests to other nodes through setting the Read Preference.

  • Primary: The default rule is that all the read requests are sent to the primary node.
  • PrimaryPreferred: The primary enjoys priority. If the primary node is unreachable, the requests are sent to the secondary nodes.
  • Secondary: All the read requests are sent to the secondary node.
  • SecondaryPreferred: The secondary node enjoys priority. When all the secondary nodes are unreachable, the requests are sent to the primary node.
  • Nearest: The read requests are sent to the nearest reachable node (detected through the ping).

Write Concern

By default, the primary node returns the data as soon as it completes the write operation. The Driver can set the write success rules through setting Write Concern.

For example, the write concern rule below states that the write operation must be successful on a majority of nodes and the timeout value is 5 seconds.

db.products.insert(
{ item: "envelopes", qty : 100, type: "Clasp" },
{ writeConcern: { w: majority, wtimeout: 5000 } }
)

The setting above is for a single request. You can also modify the default write concern of the replica set so that you do not need to set it separately for every single request.

cfg = rs.conf()
cfg.settings = {}
cfg.settings.getLastErrorDefaults = { w: "majority", wtimeout: 5000 }
rs.reconfig(cfg)

Exception Handling (Rollback)

When the primary node is down and the primary node re-joins the set, if some data is not synchronized to the secondary node and there have been some write operations on the new primary node, the old primary node needs to roll back some operations to ensure the consistency of the dataset with the new primary node.

The old primary node writes the rollback data to the separate rollback directory and the database administrator can use mongorestore to recover the data as needed.

ApsaraDB for MongoDB

Alibaba Cloud ApsaraDB for MongoDB is fully compatible with the MongoDB protocol and offers a full range of database solutions for enterprises. It automatically creates a three-node MongoDB replica set for users, which encapsulates advanced functions such as DR switchover and failover, and provides complete transparency. In addition to its support for replica sets, ApsaraDB for MongoDB helps you to maintain a high-availability MongoDB database through its monitoring and alarms feature. You can also conveniently backup and recover data by leveraging its automatic backup feature.

Visit this link to purchase ApsaraDB for MongoDB or talk to our experts today!

相关实践学习
MongoDB数据库入门
MongoDB数据库入门实验。
快速掌握 MongoDB 数据库
本课程主要讲解MongoDB数据库的基本知识,包括MongoDB数据库的安装、配置、服务的启动、数据的CRUD操作函数使用、MongoDB索引的使用(唯一索引、地理索引、过期索引、全文索引等)、MapReduce操作实现、用户管理、Java对MongoDB的操作支持(基于2.x驱动与3.x驱动的完全讲解)。 通过学习此课程,读者将具备MongoDB数据库的开发能力,并且能够使用MongoDB进行项目开发。   相关的阿里云产品:云数据库 MongoDB版 云数据库MongoDB版支持ReplicaSet和Sharding两种部署架构,具备安全审计,时间点备份等多项企业能力。在互联网、物联网、游戏、金融等领域被广泛采用。 云数据库MongoDB版(ApsaraDB for MongoDB)完全兼容MongoDB协议,基于飞天分布式系统和高可靠存储引擎,提供多节点高可用架构、弹性扩容、容灾、备份回滚、性能优化等解决方案。 产品详情: https://www.aliyun.com/product/mongodb
目录
相关文章
|
NoSQL 算法 容灾
『MongoDB』MongoDB高可用部署架构——复制集篇(Replica Set)
读完这篇文章里你能收获到 1. MongoDB是如何通过复制集实现高可用的 2. 主节点宕机后如何通过选举做到故障恢复 3. 在复制集中常见的可调整参数有哪些 4. 在Linux原生环境搭建MongoDB复制集 5. 在Winodws环境搭建MongoDB复制集
1034 1
『MongoDB』MongoDB高可用部署架构——复制集篇(Replica Set)
|
存储 NoSQL MongoDB
mongodb搭建Replica Set
mongodb搭建Replica Set 简单高效
245 0
|
存储 NoSQL 测试技术
MongoDB系列-解决面试中可能遇到的MongoDB复制集(replica set)问题
MongoDB复制集(replica set):MongoDB复制集维护相同数据集的一组mongod进程,复制集是生产部署的基础,具有数据冗余以及高可用性。
409 0
MongoDB系列-解决面试中可能遇到的MongoDB复制集(replica set)问题
|
NoSQL MongoDB Docker
基于docker容器下mongodb 4.0.0 的Replica Sets+Sharded Cluster集群(1)
基于docker容器下mongodb 4.0.0 的Replica Sets+Sharded Cluster集群(1)
191 0
基于docker容器下mongodb 4.0.0 的Replica Sets+Sharded Cluster集群(1)
|
存储 NoSQL Shell
(1)解锁MongoDB replica set核心姿势
本文倒腾目前大热的MongoDB Replica Set集群,在倒腾的同时串讲一些 MongoDB特性。
(1)解锁MongoDB replica set核心姿势
|
存储 NoSQL 网络协议
MongoDB系列-复制集(Replica Set)应用部署(生产、测试、开发环境)
通过在不同的计算机上托管mongod实例来尽可能多地保持成员之间的分离。将虚拟机用于生产部署时,应将每个mongod实例放置在由冗余电源电路和冗余网络路径提供服务的单独主机服务器上,而且尽可能的将副本集的每个成员部署到自己的计算机绑定到标准的MongoDB端口27017。
588 0
|
NoSQL Java MongoDB
基于docker容器下mongodb 4.0.0 的Replica Sets+Sharded Cluster集群(3)
基于docker容器下mongodb 4.0.0 的Replica Sets+Sharded Cluster集群(3)
189 0
|
NoSQL MongoDB Docker
基于docker容器下mongodb 4.0.0 的Replica Sets+Sharded Cluster集群(2)
基于docker容器下mongodb 4.0.0 的Replica Sets+Sharded Cluster集群(2)
203 0
|
NoSQL MongoDB 数据安全/隐私保护