CockroachDB——类似spanner的开源版,底层使用rocksdb存储,mvcc,支持事务,raft一致性,licence是CockroachDB Community License Agreement

简介:

摘自:https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

CockroachDB is a distributed SQL database. The primary design goals are scalability, strong consistency and survivability(hence the name). CockroachDB aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. CockroachDB nodes are symmetric; a design goal is homogeneous deployment (one binary) with minimal configuration and no required external dependencies.

The entry point for database clients is the SQL interface. Every node in a CockroachDB cluster can act as a client SQL gateway. A SQL gateway transforms and executes client SQL statements to key-value (KV) operations, which the gateway distributes across the cluster as necessary and returns results to the client. CockroachDB implements a single, monolithic sorted mapfrom key to value where both keys  and values are byte strings.

The KV map is logically composed of smaller segments of the keyspace called ranges. Each range is backed by data stored in a local KV storage engine (we use RocksDB, a variant of LevelDB). Range data is replicated to a configurable number of additional CockroachDB nodes. Ranges are merged and split to maintain a target size, by default 64M. The relatively small size facilitates quick repair and rebalancing to address node failures, new capacity and even read/write load. However, the size must be balanced against the pressure on the system from having more ranges to manage.

CockroachDB achieves horizontally scalability:

  • adding more nodes increases the capacity of the cluster by the amount of storage on each node (divided by a configurable replication factor), theoretically up to 4 exabytes (4E) of logical data;
  • client queries can be sent to any node in the cluster, and queries can operate independently (w/o conflicts), meaning that overall throughput is a linear factor of the number of nodes in the cluster.
  • queries are distributed (ref: distributed SQL) so that the overall throughput of single queries can be increased by adding more nodes.

CockroachDB achieves strong consistency:

  • uses a distributed consensus protocol for synchronous replication of data in each key value range. We’ve chosen to use the Raft consensus algorithm; all consensus state is stored in RocksDB.
  • single or batched mutations to a single range are mediated via the range's Raft instance. Raft guarantees ACID semantics.
  • logical mutations which affect multiple ranges employ distributed transactions for ACID semantics. CockroachDB uses an efficient non-locking distributed commit protocol.

CockroachDB achieves survivability:

  • range replicas can be co-located within a single datacenter for low latency replication and survive disk or machine failures. They can be distributed across racks to survive some network switch failures.
  • range replicas can be located in datacenters spanning increasingly disparate geographies to survive ever-greater failure scenarios from datacenter power or networking loss to regional power failures (e.g. { US-East-1a, US-East-1b, US-East-1c }{ US-East, US-West, Japan }{ Ireland, US-East, US-West}{ Ireland, US-East, US-West, Japan, Australia }).

CockroachDB provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes--both from a historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. CockroachDB implements a limited form of linearizability, providing ordering for any observer or chain of observers.

Similar to Spanner directories, CockroachDB allows configuration of arbitrary zones of data. This allows replication factor, storage device type, and/or datacenter location to be chosen to optimize performance and/or availability. Unlike Spanner, zones are monolithic and don’t allow movement of fine grained data on the level of entity groups.

Architecture

CockroachDB implements a layered architecture. The highest level of abstraction is the SQL layer (currently unspecified in this document). It depends directly on the SQL layer, which provides familiar relational concepts such as schemas, tables, columns, and indexes. The SQL layer in turn depends on the distributed key value store, which handles the details of range addressing to provide the abstraction of a single, monolithic key value store. The distributed KV store communicates with any number of physical cockroach nodes. Each node contains one or more stores, one per physical device.

Architecture

Each store contains potentially many ranges, the lowest-level unit of key-value data. Ranges are replicated using the Raft consensus protocol. The diagram below is a blown up version of stores from four of the five nodes in the previous diagram. Each range is replicated three ways using raft. The color coding shows associated range replicas.

Ranges

Each physical node exports two RPC-based key value APIs: one for external clients and one for internal clients (exposing sensitive operational features). Both services accept batches of requests and return batches of responses. Nodes are symmetric in capabilities and exported interfaces; each has the same binary and may assume any role.

Nodes and the ranges they provide access to can be arranged with various physical network topologies to make trade offs between reliability and performance. For example, a triplicated (3-way replica) range could have each replica located on different:

  • disks within a server to tolerate disk failures.
  • servers within a rack to tolerate server failures.
  • servers on different racks within a datacenter to tolerate rack power/network failures.
  • servers in different datacenters to tolerate large scale network or power outages.

Up to F failures can be tolerated, where the total number of replicas N = 2F + 1 (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).

















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6999703.html,如需转载请自行联系原作者

相关文章
|
4月前
|
存储 缓存 关系型数据库
⑩⑧【MySQL】InnoDB架构、事务原理、MVCC多版本并发控制
⑩⑧【MySQL】InnoDB架构、事务原理、MVCC多版本并发控制
110 0
|
6天前
|
存储 关系型数据库 MySQL
MVCC:深入解析多版本并发控制机制
【4月更文挑战第20天】MVCC是数据库并发控制的关键技术,通过保存数据多个版本,使读写操作无锁并发,减少锁竞争,提高并发性能。它保证事务看到一致数据快照,避免并发问题,并支持事务回滚与恢复。MVCC广泛应用于PostgreSQL、InnoDB等,提供时间旅行查询和无锁读等功能,对于构建高性能、高并发数据库系统至关重要。
23 13
|
1月前
|
SQL 关系型数据库 MySQL
深入理解MySQL事务特性:保证数据完整性与一致性
深入理解MySQL事务特性:保证数据完整性与一致性
95 1
|
8月前
|
算法 关系型数据库 MySQL
MySQL事务隔离实现原理,多版本并发控制MVCC
MySQL事务隔离实现原理,多版本并发控制MVCC
143 0
|
9月前
|
SQL Oracle 关系型数据库
深度解析 MySQL 事务、隔离级别和 MVCC 机制:构建高效并发的数据交响乐(一)
深度解析 MySQL 事务、隔离级别和 MVCC 机制:构建高效并发的数据交响乐
317 0
|
6月前
|
存储 关系型数据库 Go
深入理解 PostgreSQL 中的 MVCC(多版本并发控制)机制
深入理解 PostgreSQL 中的 MVCC(多版本并发控制)机制
90 0
|
9月前
|
SQL NoSQL 关系型数据库
深度解析 MySQL 事务、隔离级别和 MVCC 机制:构建高效并发的数据交响乐(三)
深度解析 MySQL 事务、隔离级别和 MVCC 机制:构建高效并发的数据交响乐(三)
342 0
|
9月前
|
SQL 存储 NoSQL
深度解析 MySQL 事务、隔离级别和 MVCC 机制:构建高效并发的数据交响乐(二)
深度解析 MySQL 事务、隔离级别和 MVCC 机制:构建高效并发的数据交响乐(二)
339 0
|
10月前
|
存储 Oracle 关系型数据库
高性能 MySQL(四):多版本并发控制(MVCC)
MVCC(Multiversion Concurrency Control)即多版本并发控制,它是数据库系统常用的一种并发控制,用于提升事务内数据的并发性。可以认为 MVCC 是行锁的一个变种,在很多种情况下避免了加锁操作,因此开销更低。 MVCC 的实现,是通过保存数据在某个时间点的**快照**来实现的。也就是说,每个事务读到的数据都是一个历史快照,不管这个事务执行多长时间,事务内看到的数据总是一致的。
106 0
|
11月前
|
存储 关系型数据库 MySQL
MySQL - 多版本控制 MVCC 机制初探
MySQL - 多版本控制 MVCC 机制初探
79 0