Key-/value-stores have a simple data model in common: a map/dictionary, allowing clients to put and request values per key.
Besides the data-model and the API, modern key-value stores favor high scalability over consistency and therefore most of them also omit rich ad-hoc querying and analytics features (especially joins and aggregate operations are set aside).
Key-/value-stores have existed for a long time (e. g. Berkeley DB [Ora10d]) but a large number of this class of NoSQL stores that has emerged in the last couple of years has been heavily influenced by Amazon’s Dynamo which will be investigated thoroughly in this chapter. Among the great variety of free and opensource key-/value-stores, Project Voldemort will be further examined. At the end of the chapter some other notable key-/value-stores will be briefly looked at.
这个应该算最简单的一种NoSQL, 强调high scalability, 而在一致性和查询,分析feature上就比较弱了. 这种类型的数据库很早就出现了, 象berkeley DB, 但是那是数据量没有那么huge, 所以看不出这种简单类型的优势.
所以至到Amazon’s Dynamo, 才开始火起来...
4.1. Amazon’s Dynamo
Amazon Dynamo is one of several databases used at Amazon for different purposes (others are e. g. SimpleDB or S3, the Simple Storage Service, cf. [Ama10b], [Ama10a]).
DeCandia et al. note their contribution to the research community is that Dynamo as an “eventually consistent storage system can be used in production with demanding applications”.
直接读论文吧...
http://www.cnblogs.com/fxjwind/archive/2012/06/26/2563628.html
底下两点总结的不错, 笔记
4.1.4. Implementation and Optimizations
In addition to the system design, DeCandia et al. make concrete remarks on the implementation of Dynamo. Based on experience at Amazon, they also provide suggestions for its optimization (cf. [DHJ+07, p. 213f]):
- The code running on a Dynamo node consists of a request coordination, a membership and a failure detectioncomponent—each of them implemented in Java.
- Dynamo provides pluggable persistence components (e. g. Berkeley Database Transactional Data Store and BDB Java Edition [Ora10d], MySQL [Ora10c], an in-memory buffer with a persistent backing store). Experience from Amazon shows that “[applications] choose Dynamo’s local persistence engine based on their object size distribution”.
- The component in charge of coordinating requests is implemented “on top of an event-driven messaging substrate where the message processing pipeline is split into multiple stages”. Internode communication employs Java NIO channels (see [Ora10b]).
- When a client issues a read or write request, the contacted node will become its coordinator. It then creates a state machine that “contains all the logic for identifying the nodes responsible for a key, sending the requests, waiting for responses, potentially doing retries, processing the replies and packaging the response to the client. Each state machine instance handles exactly one client request.”
- If a request coordinator receives a stale version of data from a node, it will update it with the latest version. This is called read-repair “because it repairs replicas that have missed a recent update at an opportunistic time and relieves the anti-entropy protocol from having to do it”.
- To achieve even load-distribution write requests can address to “any of the top N nodes in the preference list”.
- As an optimization reducing latency for both, read and write requests, Dynamo allows client applications to be coordinator of these operations.
- As write requests are usually succeeding read requests, Dynamo is pursuing a further optimization:
in a read response, the storage node replying the fastest to the read coordinator is transmitted to the client; in a subsequent write request, the client will contact this node. “This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting “read-your-writes” consistency”. - A further optimization to reach Amazon’s desired performance in the 99.9th percentile is the introduction of an object buffer in the main memory of storage nodes.
In this buffer write requests are queued and persisted to disk periodically by a writer thread.
In read operations both the object buffer and the persisted data of storage nodes have to be examined.
Using an in-memory buffer that is being persisted asynchronously can result in data loss when a server crashes. To reduce this risk, the coordinator of the write request will choose one particular replica node to perform a durable write for the data item. This durable write will not impact the performance of the request as the coordinator only waits for W − 1 nodes to answer before responding to the client (cf. [DHJ+07, p. 215]). - Dynamo nodes do not only serve client requests but also perform a number of background tasks. For resource division between request processing and background tasks a component named admission controller is in charge that “constantly monitors the behavior of resource accesses while executing a “foreground” put/get operation”. Monitoring includes disk I/O latencies, lock-contention and transaction timeouts resulting in failed database access as well as waiting periods in the request queue. Via monitoring feedback, the admission controller will decide on the number of time-slices for resource access or consumption to be given to background tasks. It also coordinates the execution of background tasks which have to explicitly apply for resources with the admission controller (cf.[DHJ+07, p. 218]).
4.1.5. Evaluation
To conclude the discussions on Amazon’s Dynamo, a brief evaluation of Bob Ippolito’s talk “Drop ACID and think about Data” shall be presented in table 4.2.
Advantages | Disadvantages |
“No master” |
“Proprietary to Amazon” |
4.2. Project Voldemort
Project Voldemort is a key-/value-store initially developed for and still used at LinkedIn.
http://project-voldemort.com/design.php
http://www.54chen.com/document/dynamo-based-systems.html , 中文
没看出design有什么干货和新鲜的东西, 甚至都没有说一下, 他们为什么要开发这个系统, 直接用dynamo不行吗?
在设计上, 采用了分层结构, 保持每一层独立意味着可以混合和匹配使用以满足运行中不同的需求。例如,我们可以增加一个压缩层,将字节值的压缩水平降低到序列化之下。同样,在将数据路由到分区的时候我们可以做灵活的智能路由.
Following figure depicts options for the physical deployment of Project Voldemort with a focus on routing and load-balancing.
The trade-off depicted in figure is reducing latency by minimizing network hops vs. strong coupling by routing-logicthat moves up the software-stack towards the frontend
4.2.3. Versioning and Consistency
Like Amazon’s Dynamo Project Voldemort is designed to be highly available for write operations, allows concurrent modifications of data and uses vector clocks to allow casual reasoning about different versions (see sections 3.1.3 and 4.1.3). If the datastore itself cannot resolve version conflicts, client applications are requested for conflict resolution at read time.
4.2.4. Persistence Layer and Storage Engines
Project Voldemort provides pluggable persistency as the lowest layer of the logical architecture allows for different storage engines. Out of the box Berkeley DB (default storage engine), MySQL as well as in-memory storage are delivered with Project Voldemort.
To use another existing or self-written storage engine, the operations get, put and delete besides an iterator for values will have to be implemented.
具有扩展性, 只要你实现get, put and delete 3个接口就ok
4.2.5. Data Model and Serialization
On the lowest level keys and values in Project Voldemort are simply byte-arrays. In order to allow applications a more sophisticated notion of keys and values, data models can be configured for each Voldemort store. These data models define serializers that are responsible to convert byte-arrays into the desired data structures and formats. Project Voldemort already contains serializers for the following data structures and formats:
JSON (JavaScript Object Notation) is a binary and typed data model which supports the data types list, map, date, boolean as well as numbers of different precision (cf. [Cro06]).
String to store uninterpreted strings, which can also be used for XML blobs.
Java Serialization provided by Java classes implementing the java.io.Serializable interface (cf.[Ora10a], [Blo01, p. 213ff]).
Protocol Buffers are “Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data” which also contains an interface description language to generate code for custom data interchange. Protocol Buffers are widely used at Google “for almost all of its internal RPC protocols and file formats” (cf. [Goo10a], [Goo10b]).
Identity does no serialization or deserialization at all but simply hands over byte-arrays.
4.2.6. Index Precalculation
Project Voldemort allows to prebuild indexes offline (e. g. on Hadoop), upload them to the datastore and transparently swap to them. This is especially useful for batch operations inserting or updating large amounts of data causing index rebuilds if the data is uploaded as a whole or index fragmentation if it is inefficiently inserted in small portions. In Project Voldemort large amounts of data can be inserted as a whole and the offline index building feature disburdens live systems from full index rebuilding or index fragmentation.
4.3. Other Key-/Value-Stores
4.3.1. Tokyo Cabinet and Tokyo Tyrant
Tokyo Cabinet ([FAL10a], http://fallabs.com/tokyocabinet/) is the core library of this datastore persisting data and exposing a key-/value-interface to clients and thereby abstracting from internal data structures such as hash-tables or B+tree-indexes.
Tokyo Tyrant ([FAL10b],network interface of Tokyo Cabinet, http://fallabs.com/tokyotyrant/) provides access to this database library via network which is possible via a proprietary binary protocol, HTTP as well as through the memcached protocol.
Tokyo Cabinet manages data storage on disk and in memory in a fashion similar to paging / swapping. Memory pages are flushed to disk periodically “which leaves an open data loss hole”, as Hoff comments (cf. [Hof09a]).
And the winner is: MySQL or Memcached or Tokyo Tyrant?, http://highscalability.com/blog/2009/10/28/and-the-winner-is-mysql-or-memcached-or-tokyo-tyrant.html
Tokyo Tyrant is a disk based key-value store. Tyrant was nearly 2x faster than MySQL + memcached and about 20% slower than a memcached only solution. Keep in mind one concern about Tyrant is that it is not distributed, it's not a scale-out solution, so it will face sharding issues similar to MySQL. It also flushes data to disk periodically which leaves an open data loss hole and variable performance during syncs.
Tokyo Cabinet allows to compress pages by the LZW-algorithm which can achieve good compression ratios (see e. g. [Lin09]).
Tokyo Cabinet does partition data automatically and therefore has to be replicated with a strategy similar to MySQL. In addition to lookups by key it can match prefixes and ranges if keys are ordered.
Regarding transactional features worth mentioning Tokyo Cabinet provides write-ahead logging and shadow paging. The Toyko suite is developed actively, well documented and and widely regarded as high-performant: 1 million records can be stored in 0.7 seconds using the hash-table engine and in 1.6 seconds using the b-tree according to North (cf. [Nor09], [Ipp09], [See09], [Lin09], [Hof09a]).
4.3.2. Redis
Redis is a relatively new datastore which its developers unspecifically refer to as a “data structure store”; it is commonly subsumed under key-/value-stores because of its map/dictionary-API. Special about Redis is that it allows matching for key-ranges e. g. matching of numeric ranges or regular expressions. In contrast to other key-/value-stores, Redis does not only store bytes as values but also allows lists and sets in values by supporting them directly.
A major disadvantage about Redis is that the amount of main memory limits the amount of data that is possible to store. This cannot be expanded by the usage of hard-disks. Ippolito comments that for this reason Redis is probably a good fit for a caching-layer (cf. [Ipp09]).
4.3.3. Memcached and MemcacheDB
Memcached, the popular and fast memory-caching solution widely used among large and ultra-large scale web sites to reduce database-load.
Though not intended for persistent storage (cf. [F+10b]) there is an existing solution named MemcacheDB (cf. [Chu09]) that conforms to the memcached protocol (cf.[F+10c]) and adds persistence based on Berkeley DB to it (cf. [Nor09], [Ipp09]).
As memcached does not provide any replication between nodes and is also not tolerant towards machine failures, simply adding a persistent storage to it is not enough, as blogger Richard Jones of Last.fm remarks. He notes that solutions like repcached can replicate whole memcached servers (in a master slave setup) but without fault-tolerant partitioning they will cause management and maintenance efforts (cf. [Jon09]).
4.3.4. Scalaris
Scalaris is a key-/value-store written in Erlang taking profit of this language’s approach e. g. in implementing a non-blocking commit protocol for atomic transactions.
Scalaris uses an adapted version of the chord service (cf. [SMK+01]) to expose a distributed hash table to clients. As it stores keys in lexicographic order, range queries on prefixes are possible.
In contrast to other key-/value-stores Scalaris has a strict consistency model, provides symmetric replication and allows for complex queries (via programming language libraries). It guarantees ACID properties also for concurrent transactions by implementing an adapted version of the Paxos consensus protocol (cf. [Lam98]). Like memcached, Scalaris is a pure in-memory key-/value-store (cf. [Nor09], [Jon09]).
本文章摘自博客园,原文发布日期:2012-06-27