# 【原创】 熵与反熵

在 Riak 的众多概念中，存在一个称作 Active Anti-Entropy 的，由于从字面上很难理解其真实含义，所以稍微研究了一番。结果发现水还是比较深的~
下面通过一些相关资料的研读，展开这次的知识学习。

---

## Active Anti-Entropy

Amazon Dynamo 论文中给出一种高效的、基于 Merkle Tree's (hash tree) 比较数据是否相同的方法；
BitTorrent 的实现中同样使用了 merkle tree 。

Merkle Tree

merkle tree 的叶子节点值是针对目标 data 计算得到的 hash 值；

Anti-entropy

Active anti-entropy 可以理解成“在后台，连续性的、自动的，在不同副本间进行 merkle tree 比较，并修复不一致的过程”；

Joseph Blomstedt has an excellent video describing how active anti-entropy works in Riak. The video is also a really great example of how merkle tree's work so I suggest anyone using Cassandra or Riak or just interested in distributed systems to give it a watch. Or four.

---

## Why is the word “entropy” present in anti-entropy protocols?

Q: Anti-entropy protocols are a form of gossip protocols. http://en.wikipedia.org/wiki/Gossip_protocol. I was wondering if someone could explain, the the significance of word entropy here.

A: The noise creeping into the messages can be seen as a form of growing entropy in the message content. Cancelling the noise (in this case by comparing multiple replicas of the original message) is a form of entropy lowering process (i.e. an anti-entropy process, since entropy is never decreasing on its own).

Q: This answer is somewhat convincing, but eoht.info/page/Anti-entropy says that term anti-entropy is generally used to define process which are orderly and organized. I think there is still no orderly behavior in replica reconciliation. Do you think this use of anti-entropy has no significance here.

A: entropy itself is not a process, rather is it a statistical quality (usually in the field of physics or informatics), which never decreases (on it's own). I'm reading that page so that the a-e processes are working against the "natural" way entropy does. In the case of replica reconciliation it is not the process of comparing the copies itself but its output (something that seems to be the correct "average" closest to the original) that works against the noise (entropy) added during transmissions/copying earlier.

• 熵本身不是处理过程，而是一种统计量，通常用于物理学和信息学领域；
• 熵自身从不会减少；
• 反熵是指一种降低熵的处理过程；
• 针对多副本协调的场景，反熵处理并非指针对副本本身的比较处理，而是指针对传输或复制过程中附加上熵的输出内容（个人理解为原始数据的副本）的处理；

---

## Gossip protocol

gossip 协议是一种计算机间的通信协议，受启发于社交网络中的流言传播模型；

• 底层网络属于超大型异构网络（inconvenient structure）；
• 基于 gossip 的解决方案最为高效；

Gossip 的通信模型

The concept of gossip communication can be illustrated by the analogy of office workers spreading rumors. Let's say each hour the office workers congregate around the water cooler. Each employee pairs off with another, chosen at random, and shares the latest gossip. At the start of the day, Alice starts a new rumor: she comments to Bob that she believes that Charlie dyes his mustache. At the next meeting, Bob tells Dave, while Alice repeats the idea to Eve. After each water cooler rendezvous, the number of individuals who have heard the rumor roughly doubles (though this doesn't account for gossiping twice to the same person; perhaps Alice tries to tell the story to Frank, only to find that Frank already heard it from Dave). Computer systems typically implement this type of protocol with a form of random "peer selection": with a given frequency, each machine picks another machine at random and shares any hot rumors.

gossip 协议的强悍之处在于强劲的信息传播方式；因为相同的信息很可能会反复接收到多次；

• 协议的核心特性为：周期性，两两配对，进程间交互；
• 交互中的信息交换是大小受限的；
• 只要 agent 间发生了交互，则至少有一个 agent 发生了状态变更，以反应出对端的状态；
• 不假定底层一定采用了可靠通信方式；
• 交互的频率与典型的消息延迟相比仍旧算低的，所以协议本身的附加成本是可以忽略的；
peer 选举算法具有某种形式的随机性。可以从节点的全集中进行 peer 的选择，也可以从子集中进行 peer 的选择；

Gossip protocol 种类

Dissemination 协议（或者称为 rumor-mongering 协议）

Anti-entropy 协议

Tribler, BitTorrent peer to peer client using gossip protocol.

---

## Merkle tree

hash tree 就是 Merkle tree ；

hash tree 的用途：可以高效、安全的验证大数据结构中包含的内容；
hash tree 可以认为是 hash list 和 hash chain 的泛化形式；

hash tree 的概念由 Ralph Merkle 在 1979 年命名；

hash tree 被用于确定不同计算机之间保存的、处理后的，以及经过传输后的，任何类型的数据差异；

hash tree 被建议用在受信计算机系统中；
hash tree 已被用于 IPFS 和 ZFS 文件系统中，BitTorrent 协议中，Apache Wave 协议中，Git 分布式修正控制系统中，Tahoe-LAFS 备份系统中，Bitcoin P2P 网络中，Certificate Transparency 框架中，以及许多 NoSQL 系统（Apache Cassandra 和 Riak）中。

A hash tree is a tree of hashes in which the leaves are hashes of data blocks in, for instance, a file or set of files. Nodes further up in the tree are the hashes of their respective children. For example, in the picture hash 0 is the result of hashing the result of concatenating hash 0-0 and hash 0-1. That is, hash 0 = hash( hash 0-0 + hash 0-1 ) where + denotes concatenation.

Most hash tree implementations are binary (two child nodes under each node) but they can just as well use many more child nodes under each node.

Usually, a cryptographic hash function such as SHA-2 is used for the hashing. If the hash tree only needs to protect against unintentional damage, much less secure checksums such as CRCs can be used.

In the top of a hash tree there is a top hash (or root hash or master hash). Before downloading a file on a p2p network, in most cases the top hash is acquired from a trusted source, for instance a friend or a web site that is known to have good recommendations of files to download. When the top hash is available, the hash tree can be received from any non-trusted source, like any peer in the p2p network. Then, the received hash tree is checked against the trusted top hash, and if the hash tree is damaged or fake, another hash tree from another source will be tried until the program finds one that matches the top hash.

---

+ 订阅