NoSQL Databases --Key-/Value-Stores

本文涉及的产品
Redis 开源版,标准版 2GB
推荐场景:
搭建游戏排行榜
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
云数据库 Tair(兼容Redis),内存型 2GB
简介:

Key-/value-stores have a simple data model in common: a map/dictionary, allowing clients to put and request values per key. 
Besides the data-model and the API, modern key-value stores favor high scalability over consistency and therefore most of them also omit rich ad-hoc querying and analytics features (especially joins and aggregate operations are set aside).

Key-/value-stores have existed for a long time (e. g. Berkeley DB [Ora10d]) but a large number of this class of NoSQL stores that has emerged in the last couple of years has been heavily influenced by Amazon’s Dynamo which will be investigated thoroughly in this chapter. Among the great variety of free and opensource key-/value-stores, Project Voldemort will be further examined. At the end of the chapter some other notable key-/value-stores will be briefly looked at.

这个应该算最简单的一种NoSQL, 强调high scalability, 而在一致性和查询,分析feature上就比较弱了. 这种类型的数据库很早就出现了, 象berkeley DB, 但是那是数据量没有那么huge, 所以看不出这种简单类型的优势.

所以至到Amazon’s Dynamo, 才开始火起来...

 

4.1. Amazon’s Dynamo

Amazon Dynamo is one of several databases used at Amazon for different purposes (others are e. g. SimpleDB or S3, the Simple Storage Service, cf. [Ama10b], [Ama10a]).

DeCandia et al. note their contribution to the research community is that Dynamo as an “eventually consistent storage system can be used in production with demanding applications”.

直接读论文吧...

http://www.cnblogs.com/fxjwind/archive/2012/06/26/2563628.html

底下两点总结的不错, 笔记

4.1.4. Implementation and Optimizations

In addition to the system design, DeCandia et al. make concrete remarks on the implementation of Dynamo. Based on experience at Amazon, they also provide suggestions for its optimization (cf. [DHJ+07, p. 213f]):

  • The code running on a Dynamo node consists of a request coordination, a membership and a failure detectioncomponent—each of them implemented in Java.
  • Dynamo provides pluggable persistence components (e. g. Berkeley Database Transactional Data Store and BDB Java Edition [Ora10d], MySQL [Ora10c], an in-memory buffer with a persistent backing store). Experience from Amazon shows that “[applications] choose Dynamo’s local persistence engine based on their object size distribution”.
  • The component in charge of coordinating requests is implemented “on top of an event-driven messaging substrate where the message processing pipeline is split into multiple stages”. Internode communication employs Java NIO channels (see [Ora10b]).
  • When a client issues a read or write request, the contacted node will become its coordinator. It then creates a state machine that “contains all the logic for identifying the nodes responsible for a key, sending the requests, waiting for responses, potentially doing retries, processing the replies and packaging the response to the client. Each state machine instance handles exactly one client request.”
  • If a request coordinator receives a stale version of data from a node, it will update it with the latest version. This is called read-repair “because it repairs replicas that have missed a recent update at an opportunistic time and relieves the anti-entropy protocol from having to do it”.
  • To achieve even load-distribution write requests can address to “any of the top N nodes in the preference list”.
  • As an optimization reducing latency for both, read and write requests, Dynamo allows client applications to be coordinator of these operations.
  • As write requests are usually succeeding read requests, Dynamo is pursuing a further optimization: 
    in a read response, the storage node replying the fastest to the read coordinator is transmitted to the client; in a subsequent write request, the client will contact this node. “This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting “read-your-writes” consistency”.
  • A further optimization to reach Amazon’s desired performance in the 99.9th percentile is the introduction of an object buffer in the main memory of storage nodes
    In this buffer write requests are queued and persisted to disk periodically by a writer thread. 
    In read operations both the object buffer and the persisted data of storage nodes have to be examined
    Using an in-memory buffer that is being persisted asynchronously can result in data loss when a server crashes. To reduce this risk, the coordinator of the write request will choose one particular replica node to perform a durable write for the data item. This durable write will not impact the performance of the request as the coordinator only waits for W − 1 nodes to answer before responding to the client (cf. [DHJ+07, p. 215]).
  • Dynamo nodes do not only serve client requests but also perform a number of background tasks. For resource division between request processing and background tasks a component named admission controller is in charge that “constantly monitors the behavior of resource accesses while executing a “foreground” put/get operation”. Monitoring includes disk I/O latencies, lock-contention and transaction timeouts resulting in failed database access as well as waiting periods in the request queue. Via monitoring feedback, the admission controller will decide on the number of time-slices for resource access or consumption to be given to background tasks. It also coordinates the execution of background tasks which have to explicitly apply for resources with the admission controller (cf.[DHJ+07, p. 218]).

 

4.1.5. Evaluation

To conclude the discussions on Amazon’s Dynamo, a brief evaluation of Bob Ippolito’s talk “Drop ACID and think about Data” shall be presented in table 4.2.

Advantages Disadvantages

No master” 
"Highly available for write” operations 
“Knobs for tuning reads” (as well as writes) 
“Simple”

“Proprietary to Amazon” 
"Clients need to be smart“ (support vector 
clocks and conflict resolution, balance 
clusters) 
“No compression” (client applications may 
compress values themselves, keys cannot be 
compressed) 
“Not suitable for column-like workloads” 
"Just a Key/Value store“ (e. g. range 
queries or batch operations are not possible)

 

4.2. Project Voldemort

Project Voldemort is a key-/value-store initially developed for and still used at LinkedIn.

http://project-voldemort.com/design.php

http://www.54chen.com/document/dynamo-based-systems.html , 中文

没看出design有什么干货和新鲜的东西, 甚至都没有说一下, 他们为什么要开发这个系统, 直接用dynamo不行吗?

在设计上, 采用了分层结构, 保持每一层独立意味着可以混合和匹配使用以满足运行中不同的需求。例如,我们可以增加一个压缩层,将字节值的压缩水平降低到序列化之下。同样,在将数据路由到分区的时候我们可以做灵活的智能路由.

Following figure depicts options for the physical deployment of Project Voldemort with a focus on routing and load-balancing.

The trade-off depicted in figure is reducing latency by minimizing network hops vs. strong coupling by routing-logicthat moves up the software-stack towards the frontend

 

4.2.3. Versioning and Consistency

Like Amazon’s Dynamo Project Voldemort is designed to be highly available for write operations, allows concurrent modifications of data and uses vector clocks to allow casual reasoning about different versions (see sections 3.1.3 and 4.1.3). If the datastore itself cannot resolve version conflicts, client applications are requested for conflict resolution at read time.

 

4.2.4. Persistence Layer and Storage Engines 
Project Voldemort provides pluggable persistency as the lowest layer of the logical architecture allows for different storage engines. Out of the box Berkeley DB (default storage engine), MySQL as well as in-memory storage are delivered with Project Voldemort.

To use another existing or self-written storage engine, the operations get, put and delete besides an iterator for values will have to be implemented.

具有扩展性, 只要你实现get, put and delete 3个接口就ok

 

4.2.5. Data Model and Serialization 
On the lowest level keys and values in Project Voldemort are simply byte-arrays. In order to allow applications a more sophisticated notion of keys and values, data models can be configured for each Voldemort store. These data models define serializers that are responsible to convert byte-arrays into the desired data structures and formats. Project Voldemort already contains serializers for the following data structures and formats:

JSON (JavaScript Object Notation) is a binary and typed data model which supports the data types list, map, date, boolean as well as numbers of different precision (cf. [Cro06]).

String to store uninterpreted strings, which can also be used for XML blobs. 
Java Serialization provided by Java classes implementing the java.io.Serializable interface (cf.[Ora10a], [Blo01, p. 213ff]). 
Protocol Buffers are “Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data” which also contains an interface description language to generate code for custom data interchange. Protocol Buffers are widely used at Google “for almost all of its internal RPC protocols and file formats” (cf. [Goo10a], [Goo10b]). 
Identity does no serialization or deserialization at all but simply hands over byte-arrays.

 

4.2.6. Index Precalculation 
Project Voldemort allows to prebuild indexes offline (e. g. on Hadoop), upload them to the datastore and transparently swap to them. This is especially useful for batch operations inserting or updating large amounts of data causing index rebuilds if the data is uploaded as a whole or index fragmentation if it is inefficiently inserted in small portions. In Project Voldemort large amounts of data can be inserted as a whole and the offline index building feature disburdens live systems from full index rebuilding or index fragmentation.

 

4.3. Other Key-/Value-Stores

4.3.1. Tokyo Cabinet and Tokyo Tyrant

Tokyo Cabinet ([FAL10a], http://fallabs.com/tokyocabinet/) is the core library of this datastore persisting data and exposing a key-/value-interface to clients and thereby abstracting from internal data structures such as hash-tables or B+tree-indexes.

Tokyo Tyrant ([FAL10b],network interface of Tokyo Cabinet, http://fallabs.com/tokyotyrant/) provides access to this database library via network which is possible via a proprietary binary protocol, HTTP as well as through the memcached protocol.

Tokyo Cabinet manages data storage on disk and in memory in a fashion similar to paging / swapping. Memory pages are flushed to disk periodically “which leaves an open data loss hole”, as Hoff  comments (cf. [Hof09a]).

And the winner is: MySQL or Memcached or Tokyo Tyrant?, http://highscalability.com/blog/2009/10/28/and-the-winner-is-mysql-or-memcached-or-tokyo-tyrant.html

Tokyo Tyrant is a disk based key-value store. Tyrant was nearly 2x faster than MySQL + memcached and about 20% slower than a memcached only solution. Keep in mind one concern about Tyrant is that it is not distributed, it's not a scale-out solution, so it will face sharding issues similar to MySQL. It also flushes data to disk periodically which leaves an open data loss hole and variable performance during syncs.

 

Tokyo Cabinet allows to compress pages by the LZW-algorithm which can achieve good compression ratios (see e. g. [Lin09]).

Tokyo Cabinet does partition data automatically and therefore has to be replicated with a strategy similar to MySQL. In addition to lookups by key it can match prefixes and ranges if keys are ordered.

Regarding transactional features worth mentioning Tokyo Cabinet provides write-ahead logging and shadow paging. The Toyko suite is developed actively, well documented and and widely regarded as high-performant: 1 million records can be stored in 0.7 seconds using the hash-table engine and in 1.6 seconds using the b-tree according to North (cf. [Nor09], [Ipp09], [See09], [Lin09], [Hof09a]).

 

4.3.2. Redis 
Redis is a relatively new datastore which its developers unspecifically refer to as a “data structure store”; it is commonly subsumed under key-/value-stores because of its map/dictionary-API. Special about Redis is that it allows matching for key-ranges e. g. matching of numeric ranges or regular expressions. In contrast to other key-/value-stores, Redis does not only store bytes as values but also allows lists and sets in values by supporting them directly.

A major disadvantage about Redis is that the amount of main memory limits the amount of data that is possible to store. This cannot be expanded by the usage of hard-disks. Ippolito comments that for this reason Redis is probably a good fit for a caching-layer (cf. [Ipp09]).

 

4.3.3. Memcached and MemcacheDB 
Memcached
, the popular and fast memory-caching solution widely used among large and ultra-large scale web sites to reduce database-load.

Though not intended for persistent storage (cf. [F+10b]) there is an existing solution named MemcacheDB (cf. [Chu09]) that conforms to the memcached protocol (cf.[F+10c]) and adds persistence based on Berkeley DB to it (cf. [Nor09], [Ipp09]).

As memcached does not provide any replication between nodes and is also not tolerant towards machine failures, simply adding a persistent storage to it is not enough, as blogger Richard Jones of Last.fm remarks. He notes that solutions like repcached can replicate whole memcached servers (in a master slave setup) but without fault-tolerant partitioning they will cause management and maintenance efforts (cf. [Jon09]).

 

4.3.4. Scalaris 
Scalaris is a key-/value-store written in Erlang taking profit of this language’s approach e. g. in implementing a non-blocking commit protocol for atomic transactions.

Scalaris uses an adapted version of the chord service (cf. [SMK+01]) to expose a distributed hash table to clients. As it stores keys in lexicographic order, range queries on prefixes are possible.

In contrast to other key-/value-stores Scalaris has a strict consistency model, provides symmetric replication and allows for complex queries (via programming language libraries). It guarantees ACID properties also for concurrent transactions by implementing an adapted version of the Paxos consensus protocol (cf. [Lam98]). Like memcached, Scalaris is a pure in-memory key-/value-store (cf. [Nor09], [Jon09]).


本文章摘自博客园,原文发布日期:2012-06-27

相关实践学习
基于Redis实现在线游戏积分排行榜
本场景将介绍如何基于Redis数据库实现在线游戏中的游戏玩家积分排行榜功能。
云数据库 Redis 版使用教程
云数据库Redis版是兼容Redis协议标准的、提供持久化的内存数据库服务,基于高可靠双机热备架构及可无缝扩展的集群架构,满足高读写性能场景及容量需弹性变配的业务需求。 产品详情:https://www.aliyun.com/product/kvstore     ------------------------------------------------------------------------- 阿里云数据库体验:数据库上云实战 开发者云会免费提供一台带自建MySQL的源数据库 ECS 实例和一台目标数据库 RDS实例。跟着指引,您可以一步步实现将ECS自建数据库迁移到目标数据库RDS。 点击下方链接,领取免费ECS&RDS资源,30分钟完成数据库上云实战!https://developer.aliyun.com/adc/scenario/51eefbd1894e42f6bb9acacadd3f9121?spm=a2c6h.13788135.J_3257954370.9.4ba85f24utseFl
目录
相关文章
|
30天前
|
存储 监控 NoSQL
九大核心NoSQL数据库及使用场景详解
【10月更文挑战第6天】在当今大数据与云计算飞速发展的时代,NoSQL数据库以其灵活的数据模型、可扩展性和高性能,成为了众多应用场景下的首选。本文将为您详细介绍九大核心NoSQL数据库及其典型使用场景,帮助您在工作和学习中更好地选择和应用。
48 3
|
14天前
|
存储 SQL JSON
介绍一下RDBMS和NoSQL数据库之间的区别
【10月更文挑战第21天】介绍一下RDBMS和NoSQL数据库之间的区别
43 2
|
15天前
|
存储 SQL NoSQL
数据库技术深度探索:从关系型到NoSQL的演变
【10月更文挑战第21天】数据库技术深度探索:从关系型到NoSQL的演变
20 1
|
22天前
|
存储 NoSQL 搜索推荐
nosql
【10月更文挑战第14天】nosql
17 2
|
1月前
|
NoSQL MongoDB 数据库
MongoDB是一个NoSQL数据库,有着多种不同的命令和操作。以下是一些常见的MongoDB命令:
一些常用的MongoDB命令,如数据库和集合的管理、数据的插入、查询、更新、删除以及聚合操作等。
20 1
|
16天前
|
NoSQL 前端开发 MongoDB
前端的全栈之路Meteor篇(三):运行在浏览器端的NoSQL数据库副本-MiniMongo介绍及其前后端数据实时同步示例
MiniMongo 是 Meteor 框架中的客户端数据库组件,模拟了 MongoDB 的核心功能,允许前端开发者使用类似 MongoDB 的 API 进行数据操作。通过 Meteor 的数据同步机制,MiniMongo 与服务器端的 MongoDB 实现实时数据同步,确保数据一致性,支持发布/订阅模型和响应式数据源,适用于实时聊天、项目管理和协作工具等应用场景。
|
25天前
|
存储 SQL 分布式计算
NoSQL 简介
10月更文挑战第10天
28 0
|
27天前
|
存储 NoSQL API
.NET NoSQL 嵌入式数据库 LiteDB 使用教程
.NET NoSQL 嵌入式数据库 LiteDB 使用教程~
|
1月前
|
存储 NoSQL Java
Java中NoSql 数据库的使用
Java中NoSql 数据库的使用
20 0
|
1月前
|
SQL 存储 NoSQL
SQL和NoSQL数据库的全面比较
不可否认,已有越来越多开发人员愿意使用NoSQL数据库,并且在不断地壮大着其相应的社区。但是,相对于成熟的SQL社区,该领域的专家和顾问可能需要更多的时间,去解决那些未曾被记录的NoSQL问题。
47 0
下一篇
无影云桌面