[Erlang 0068] Erlang dict

简介:
  dict是动态哈希表实现的字典.在接口上和 orddict保持一致,在实现上和 array动态扩展的思路类似, 与proplists,orddict相比它能够支持更大的数据量,你可以在数据量膨胀的时候从orddict转为dict.dict使用的是动态哈希技术实现,理论依据是论文: "The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon" ,论文地址:  http://www.2007.cccg.ca/~morin/teaching/5408/refs/a99.pdf
 
 数组寻址容易,插入和删除困难;链表寻址困难,插入和删除容易;哈希表插入和删除的时间均取决于查找时间.哈希表在数据和数据存储位置之间建立了确定的函数关系,所以获得了高效的查询效率,而线性表和树,数据项在结构中的位置是随机的,和数据项取值没有确定的关系,这种结构上进行查找数据项是基于"比较",查找效率依赖比较次数.
 

Segment,Slot,bucket

 
  在维基百科中Hash表中slot和bucket是同等的概念:
   The hash function is used to transform the key into the index (the  hash) of an  array element (the  slot or  bucket) where the corresponding value is to be sought.
   
  在dict的实现中,Segment,Slot,bucket是三个逐渐逐渐变小的概念,从fetch可以看出它们的关系:
复制代码
fetch(Key, D) ->
    Slot = get_slot(D, Key),
    Bkt = get_bucket(D, Slot),
    try fetch_val(Key, Bkt)
    catch
           badarg -> erlang:error(badarg, [Key, D])
    end.
%% get_slot(Hashdb, Key) -> Slot.
%%  Get the slot.  First hash on the new range, if we hit a bucket
%%  which has not been split use the unsplit buddy bucket.

get_slot(T, Key) ->
    H = erlang:phash(Key, T#dict.maxn),
    if
     H > T#dict.n -> H - T#dict.bso;
     true -> H
    end.

%% get_bucket(Hashdb, Slot) -> Bucket.
get_bucket(T, Slot) -> get_bucket_s(T#dict.segs, Slot).
复制代码

 

     Segment大小固定我们只需要随着数据大小不断修改顶层tuple的size就可以.Segments tuple的最后一个元素是一个空的segment用于后续扩展.Segments每次成倍扩展收缩,这对性能并无很大损害.注意dict对外暴露的接口并没有包含数据实际的位置信息.store/3,append/3,append_list/3,update/3,update/4,update_counter/3这些接口都检查是否需要进行扩展,

filter/2 erase/2会检查是否需要进行收缩.由于dict能够随着数据量的变化动态调整进行缩放,兼顾了内存消耗和访问效率.

复制代码
%% Note: mk_seg/1 must be changed too if seg_size is changed.
-define(seg_size, 16).
-define(max_seg, 32).
-define(expand_load, 5).
-define(contract_load, 3).
-define(exp_size, (?seg_size * ?expand_load)).
-define(con_size, (?seg_size * ?contract_load)).

%% Define a hashtable.  The default values are the standard ones.
-record(dict,
     {size=0                 %元素的数量
     n=?seg_size          % 已经激活的slot数量
     maxn=?seg_size     % 最大slots数
     bso=?seg_size div 2  %最大bucket数散列表中当前允许的最大 bucket 数量,扩张操作需要据此判断是否要增加新的 bucket 区段,初始为 16;
     exp_size=?exp_size    %扩张阈值 初始值为16*5=80
     con_size=?con_size    %收缩阈值 初始值为16*3=48
     empty                :: tuple(),          % Empty segment
     segs                :: tuple()                % Segments 所有的数据存放的地方
     }).
复制代码

   在新建一个dict的时候,Empty会初始化成为一个数据模版

new() ->
    Empty = mk_seg(?seg_size),
    #dict{empty=Empty,segs={Empty}}.

mk_seg(16) -> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}. %%16估计也是测试经验值

 

K-V格式

-define(kv(K,V), [K|V]).               % Key-Value pair format
 
dict的键值存储不是improper list,看下面append_bkt的实现,猜测这样做的目的是把Bag当做一个整体处理.
复制代码
Eshell V5.9.1  (abort with ^G)
1> dict:new().
{dict,0,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
      {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}
2> dict:store(k,v,v(1)).
{dict,1,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
      {{[],[],[],[],[],[],[],[],[],[],[],[[k|v]],[],[],[],[]}}}
3> dict:store(k2,v2,v(2)).
{dict,2,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
      {{[],[],
        [[k2|v2]],
        [],[],[],[],[],[],[],[],
        [[k|v]],
        [],[],[],[]}}}
4>
复制代码

 

复制代码
%% append_bkt(Key, Val, Bucket) -> {NewBucket,PutCount}.
append_bkt(Key, Val, [?kv(Key,Bag)|Bkt]) -> {[?kv(Key,Bag ++ [Val])|Bkt],0};
append_bkt(Key, Val, [Other|Bkt0]) ->
    {Bkt1,Ic} = append_bkt(Key, Val, Bkt0),
    {[Other|Bkt1],Ic};
append_bkt(Key, Val, []) -> {[?kv(Key,[Val])],1}.

%% app_list_bkt(Key, L, Bucket) -> {NewBucket,PutCount}.

app_list_bkt(Key, L, [?kv(Key,Bag)|Bkt]) -> {[?kv(Key,Bag ++ L)|Bkt],0};
app_list_bkt(Key, L, [Other|Bkt0]) ->
    {Bkt1,Ic} = app_list_bkt(Key, L, Bkt0),
    {[Other|Bkt1],Ic};
app_list_bkt(Key, L, []) -> {[?kv(Key,L)],1}.
复制代码

 

When should you use gb_trees over dicts? Well, it's not a clear decision. As the benchmark module I have written will show, gb_trees and dicts have somewhat similar performances in many respects. However, the benchmark demonstrates that dicts have the best read speeds while the gb_trees tend to be a little quicker on other operations. You can judge based on your own needs which one would be the best.

Oh and also note that while dicts have a fold function, gb_trees don't: they instead have aniterator function, which returns a bit of the tree on which you can call gb_trees:next(Iterator) to get the following values in order. What this means is that you need to write your own recursive functions on top of gb_trees rather than use a generic fold. On the other hand, gb_trees let you have quick access to the smallest and largest elements of the structure with gb_trees:smallest/1and gb_trees:largest/1.

link: http://learnyousomeerlang.com/a-short-visit-to-common-data-structures

相关资料

 
由于接口和orddict一致,所以不再赘述,常用接口可以看下下面的两篇文章
[1] ERLANG DICTIONARY EXAMPLE
 
[2] Working with dictionaries in Erlang
 

 

 

目录
相关文章
|
Shell Perl
erlang 小技巧总结
开一个页面总结一些erlang的使用技巧,随时添加新的技巧。
|
消息中间件 网络协议 Go
[Erlang 0123] Erlang EPMD
 epmd进程和Erlang节点进程如影随形,在Rabbitmq集群,Ejabberd集群,Couchbase集群产品文档中都会有相当多的内容讲epmd,epmd是什么呢?   epmd 是Erlang Port Mapper Daemon的缩写,全称足够明确表达它的功能了(相比之下,OTP就.
2076 0
|
Shell 自然语言处理 网络协议
|
编解码 计算机视觉 存储
|
API 数据建模 计算机视觉