HybridTime - Accessible Global Consistency with High Clock Uncertainty

简介:

Amazon’s Dynamo [9] and Facebook’s Cassandra [13], relax the consistency model,and offer only eventual consistency.

Others such as HBase [1] and BigTable [4] offer strong consistency only for operations touching a single partition, but not across the database as a whole.

Dynamo和Cassandra都是基于Vector Clocks和Gossip算法,强调高可用性,可以达到eventual consistency

而HBase,可以提供单分区的强一致性,但对于跨分区,或跨数据库,就无法保证;

for example, in a geo-replicated service, a user may first be routed to a local datacenter and perform some action such as sending an email message. 
If they then reload their browser and are routed to another datacenter backend, they would still expect to see their sent message in their email outbox.

这个例子比较典型,要达到异地的一致性

 

一般要解决分布式问题的一致性问题的最简单的办法就是用synchronized clocks across all servers,但这基本是达不到的

the traditional approach to global consistency has been to use logical clocks. 
Logical clocks, such as Lamport Clocks [15] or Vector Clocks [10, 20].

所以传统的做法就是用逻辑时钟

https://www.zhihu.com/question/19994133

贴个回答,写的不错

当然可以,但是不适合在分布式数据库里面。为什么呢? 首先,需要先确定你这个时间戳取自哪里。1、所有的时间戳都取自一个中心点节点,这样让时间戳得到一个全序关系,这是最简单的实现,如果你的分布式数据库只是部署在一个数据中心的话,但是这样就增加了一个中心节点,影响数据库的扩展性,而且任何一个取时间戳的动作都要增加上网络延迟的开销。如果分布式数据库部署需要跨数据中心,这种方案变的不可行,跨数据中心网络时延太大。从而影响整体性能2、所有的时间戳取自已所在的节点。这就涉及到时钟同步的问题,就像Dongdong说的,如果取的物理时间那么这些节点并没有一个一致的时间,总会有一点误差,这种解决方案就有两种思路: 
一、就是现在以Leslie Lamport 提出的逻辑时钟的方法,重新定义了一种分布式的全序关系。vector clock就是以逻辑时钟为基础上发展而来的。 
二、使用物理时钟,就是google spanner使用gps 加原子钟进行时间同步,不同节点之间的时间是不能精确同步的,只能把不同节点之间的时间同步到一个范围之内,spanner一个节点上获得一个时间戳,不是一个时间值,而是一个时间值加上一个误差范围,也就是一个时间窗,如果两个时间窗有重合,就无法比较大小,从而无法比较出来时间窗代表的事件的先后关系。而时间同步算法就是让分布式各节点之间的时间的误差尽量的小,这样取时间戳这个动作效率是最高的,只在本节点取一下物理时间即可,但是为了同步时间,各个节点肯定在周期与其它节点进行通讯来同步物理时间(google时间同步的算法好像没有开源)。时间同步算法只能让时间尽量的精确。分布式数据库里面是不能直接使用NTP来同步时间的,因为NTP时间同步协议里面允许节点的时间能够回退,而数据库里面要求本地的时钟必须是递增的。 
三、还有一种混合逻辑时钟(Logical Physical Clocks),也是由逻辑时钟演变而来,但是时钟的值比较接近于物理时钟,而且不依赖于下面物理时钟的同步的算法,它的时间戳都可以在本地取,而且取出来是一个时间值,而不像spanner一样,取出来是一个时间窗。详细可以参见Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases这篇论文

关于Lamport或向量时钟,

Lamport时钟就是逻辑时钟,即不依赖于物理时钟,而依赖event之间的因果关系来定义的一种偏序

参考,全序, 分布式一致性的本质

而Vector时钟是,逻辑时钟的一种实现,具体的逻辑看下文,

Why Vector Clock are Easy or Hard?

摘自上面的知乎回答

“dynamo paper是七、八年前写的了,现在的amazon dynamo早已摒弃了version vector,而采用了synchronous replication(类似paxos的protocol),每一个partition会有一个leader负责写入。其实version vector并不scale,因为对于一个key来说,随着writer数量的增加,version vector数量成指数级的增长” 

逻辑时钟的缺点,

first, they require that all clients must propagate clock data to achieve consistent views, 
and second, the assigned timestamps have no relation to physical time.

 

Spanner introduced commit-wait, a way of ensuring physical-time based consistent global state by forcing operations to wait long enough so that all participants agree that the operation’s timestamp has passed based on worst case synchronization error bounds.

While innovative, the system performance becomes highly dependent on the quality of the time synchronization infrastructure, and thus may have unacceptable performance absent specialized hardware such as atomic clocks and GPS receivers.

Spanner利用原子钟和GPS接收器,实现了一个较为精确的时钟,这个时钟叫做TrueTime,每次调用TrueTime API返回的是一个时间区间,而不是一个具体的值,这个TrueTime保证的是真实时间(absolute time/real time)一定在这个区间内,这个区间范围通常大约14ms,甚至更小。

所以只有每个transaction的时间区间,都不重合,那么就可以比较容易的做到全局有序;

所以这里需要commit-wait,

image

可以看到,我在开始transaction的时候,取一次true time,[t1.earliest, t1.latest]

那么什么时候,我可以把这个transaction commit掉,即释放掉锁,即别人可以开始新的transaction

当我再取一次true time,[t2.earliest, t2.latest],当t2.earliest > t1.latest的时候,即可

因为这样两个transaction的时间区间就不可能重叠了,当然代价就是每个transaction都有等待大概2个error bound的时间

当然,spanner的实现依赖硬件的infra设备,普通用户无法复制;

 

HybridTime(HT), a hybrid between physical and logical clocks and we show how HT can be used to achieve the same semantics as Spanner, but with good performance even with commonly available time synchronization.

Like Spanner, our approach relies on physical time measurements with bounded error to assign HybridTime timestamps to events that occur in the system. 
However, unlike Spanner, our approach does not usually require the error to be waited out, thus allowing for usage in common deployment scenarios where clocks are synchronized through common protocols such as the Network Time Protocol, in which clock synchronization error is often higher than with Spanner’s TrueTime.

The trade-off is that, in order to avoid commit-wait, HybridTime requires that timestamps be propagated across machines to achieve the same consistency semantics as Spanner. 
Contrary to vector clocks, which can expand as the number of participants in the cluster grows, HybridTime timestamps have constant and small size.

HybridTime说是physical and logical clocks的结合,和spanner一样,也可以使用带error bound的物理时间 
但由于他没有spanner的硬件设施,所以做不到低延迟的时间同步,所以他依然使用逻辑时钟来达到一致性;并且他解决了vector clocks随着client的数量增加而会size膨胀的问题

HybridTime clocks follow similar update rules to Lamport clocks, but the time values are not purely logical: 
each time value has both a logical component, which helps in guaranteeing the same properties as a Lamport Clocks, and a physical component which allows the event to be associated with a physical point-in-time.

 

One of Spanner’s key benefits is that is externally consistent, which is defined as fully linearizable, even in the presence of hidden channels.

Additionally we use the term globally consistent to describe a system which provides the same linearizability semantics, provided that there are no hidden channels present.

区分为,是否有hidden channels?意思是clients间存在因果或先后关系,比如通过发message,都是这个因果关系对于系统是不可知的,比如A write,然后发送消息给B,然后B write,那么因果关系上,B write一定后于A write,但是对于数据库而言他并不知道,所以并不能保证A write先写入;

spanner是可以保证externally consistent, 由于他通过 commit wait,来保证每个transaction之间一定是全局有序的

相对的globally consistent,更简单些,因为并没有hidden channels

 

Spanner’s key innovation is that timestamps assigned by the system can be used to achieve external consistency, but also have physical meaning.

HybridTime is always globally consistent, and through selective application of commit-wait is externally consistent.

Spanner的主要创新在于实现external一致性的,同时还保留了物理时间的

而HybridTime,默认情况下是可以实现globally consistent,即偏序,因为他本身就是使用lamport时钟,并且当你选择commit-wait时,也可以保证externally consistent;

 

HybridTime Assumptions

1. HybridTime assumes that machines have a reasonably accurate physical clock, represented by the PCi(e) function, which outputs the numeric timestamp returned by the physical clock as read by process i for event e, that is able to provide absolute time measurements (usually in milli- or microseconds since 1 January 1970).

2. keeps the physical clocks across different servers synchronized with regard to a reference server, the “reference” time, represented by the PCref(e) function which outputs the numeric timestamp returned by the “reference” process for event e;

Additionally, we assume that such a substrate is able to provide an error bound along with each time measurement, denoted by the Ei(e) function, which outputs the numeric value ε error of process i at the time e occurred

3. we make no assumptions on the actual accuracy of the clocks, i.e. the physical timestamps returned by server’s clocks may have an arbitrarily large but finite error, as long as this error’s bound is known

说白了,假设

有个相对靠谱的物理时钟,一个理想的参照时钟,以及他们之间相差的,error bound

最后,我们并不假设这个error bound会很小,只要有限就可以

假设1,我们有有限的error bound

image

 

假设2, 进程级别的物理时钟单调递增,注意是进程级别

image

 

基于以上假设,提出HybridTime Clock and Protocol

 

HybridTime Clock and Protocol

HybridTime clock(HTC) is a pair (physical,logical) where the first component is a representation of the physical time at which the event occurred and the second component is a logical sequence number.

定义其实显而易见。。。

Algorithm 2 depicts the HTC algorithm.

image

HTC算法如上,两部分,NOW和UPDATE

now就是取当前的HybridTime clock

upate就是根据in 来更新当前的clock

Algorithm 2 implements a Lamport Clock, with the additional advantage that generated timestamps have physical meaning and are accurate representations of physical time within a bound error.

算法本身,其实就是Lamport Clock,只是增加了physical clock的部分

给出一个例子,

image

 

To order the events timestamped using the HybridTime Clock algorithm we use Definition 1.

Definition 1. HCT(e) < HCT( f ) is defined as the lexicographical ordering of the timestamp two-tuple (physical,logical)

Theorem 1. The HybriTime clock happened-before relation forms a total order of events

Theorem 2. For any event in a causal chain f , the physical component of a HTC timestamp approximates the “real” time the event occurred, with a error defined and bounded by

image

 

Implementation

No Consistency - In this mode there are no external consistency guarantees, transactions are assigned timestamps from each server’s physical clock and no guarantee is made that reads are consistent or repeatable.

直接用local的物理时间,不保证一致性

HybridTime Consistency - In this mode our implementation guarantees the global consistency as Spanner, absent hidden channels, but using HybridTime instead of commit-wait. 
Clients choosing this consistency mode on writes must make sure that the timestamp that is received from the server is propagated to other servers and/or clients. 
Within the same client process, timestamps are automatically propagated on behalf of the user.

这个其实和逻辑时钟,没有区别,就是保证偏序

Commit-wait Consistency - In this mode our implementation guarantees the same external consistency semantics as Spanner by also using commit-wait in the way described in the original paper. 
However instead of using TrueTime, which is a proprietary and private API, we implemented commit-wait on top of the widely used Network Time Protocol (NTP). Hence, in this consistency mode we support hidden channels.

这个估计没啥用,在没有spanner硬件保证的情况下,commit-wait,性能不能忍吧

相关文章
|
7月前
|
Oracle 关系型数据库
Mixed Mode Auditing
Mixed Mode Auditing
26 1
|
前端开发
Warning: This synthetic event is reused for performance reasons.
Warning: This synthetic event is reused for performance reasons.
528 0
Warning: This synthetic event is reused for performance reasons.
PAT (Advanced Level) Practice - 1096 Consecutive Factors(20 分)
PAT (Advanced Level) Practice - 1096 Consecutive Factors(20 分)
147 0
|
C++
PAT (Advanced Level) Practice - 1038 Recover the Smallest Number(30 分)
PAT (Advanced Level) Practice - 1038 Recover the Smallest Number(30 分)
128 0