Concept of Key in Data Warehouse

简介:

Keys and history
In a star schema, each dimension table is given a surrogate key. This column is a unique identifier, created exclusively for the data warehouse. Surrogate keys are assigned and maintained as part of the process that loads the star schema. The surrogate key has no intrinsic meaning; it is typically an integer. Surrogate keys are sometimes referred to as warehouse keys. The surrogate key is the primary key of the dimension table.
Figure 1-5 A simple star schema for the orders process

In this article, surrogate keys will be easily identifiable by the suffix “_key” in the column name. In Figure 1-5, for example, the surrogate key for the customer dimension is called customer_key, the surrogate key for the salesperson dimension is called salesperson_key, and so forth. Illustrations in this book will always list the surrogate key for a dimension table as its first attribute.
Dimension tables also contain key columns that uniquely identify something in an operational system.Examples in Figure 1-5 include customer_id, sku, and salesperson_id. In the operational systems, these columns identify specific customers, products, and salespeople, respectively. These key columns are referred to as natural keys.
The separation of surrogate keys and natural keys allows the data warehouse to track changes, even if the originating operational system does not. For example, suppose that customer ABC Wholesalers is identified by customer_id 10711 in an operational system.
If the customer changes its headquarters location, the operational system may simply overwrite the address for customer_id 10711. For analytic purposes, however, it may be useful to track the history of ABC Wholesalers. Since the star schema does not rely on customer_id to identify a unique row in the customer dimension, it is possible to store multiple versions of ABC Wholesalers, even though both have the same customer_id—10711. The two versions can be distinguished by different surrogate key values. While it would also be possible to support change tracking by supplementing a natural key with a sequence number, the surrogate key allows fact and dimension tables to be joined based on a single column.
Surrogate Keys and Natural Keys
In the star schema in Figure, surrogate keys for each dimension table are labeled “SK.” Separate and distinct from surrogate keys, one or more natural keys will also be present in most dimension tables.

The natural keys are identifiers carried over from source systems. They may not uniquely identify a row in the data warehouse, but they do identify a corresponding entity in the source system. In above Figure, natural key columns are designated “NK.”
Unlike surrogate key values, the values in natural key columns may have meaning to users of the data warehouse. Even when they do not carry significant meaning, their presence is necessary for the ETL routines that load fact tables.
Sometimes, the natural key for a dimension table consists of more than one column.This occurs when the source system uses a multi-part key to identify the entity. For example,a purchasing contract may be identified by a type code and sequence number. When more than one system can be the source for a dimension, the natural key may be composed of the identifier from the source system and an additional identifier that indicates which source it came from. For example, a bank may have more than one system for deposit accounts after acquiring another bank. The natural key for a customer dimension might,therefore, consist of an identifier used in a source system, in conjunction with a column that indicates the system from which the identifier came.
The use of surrogate keys as unique identifiers allows the data warehouse to respond to changes in source data in whatever manner best fits analytic requirements. Because the dimension table does not rely on the natural key to identify unique rows, it can maintain history even if the source system does not. For example, an order entry system might contain a record for customer_id 404777, which includes the customer’s address. If the system overwrites the address when a customer moves, it is not tracking history. Were the
customer dimension table to use the same customer_id to identify unique rows, it would be able to store only one row for customer_id 404777. It would be unable to maintain the history of the address. By using a surrogate key, it becomes possible to maintain two versions of customer_id 404777. This technique is known as a type 2 slow change.
A possible alternative to the use of a surrogate key is to supplement the natural key with a sequence number.For example, the primary key of the customer dimension table might consist of the customer_id and a version_number column that contains a sequence number. Like the use of surrogate keys, this technique permits the data warehouse to track history independently of the source system, allowing the table to store multiple versions of a customer. This approach provides no value in simplifying the schema design or load
process, however, which must still identify and maintain version history. More importantly,this technique requires multi-part foreign keys to be maintained in the fact table. If customers are identified by customer_id and sequence_no, this pair of columns must be present in order_facts as well. This multi-column foreign key complicates the join process, makes SQL more difficult to read, and in some cases may befuddle efforts to optimize the RDBMS for star join query execution.
Another theoretical alternative to the use of a surrogate key is to supplement a natural key with time stamps.While time stamps may be useful, a multi-part foreign key would still be required in fact tables, potentially leading to the same difficulties as the sequence number. In addition, a designer may be tempted to eliminate multi-column joins by storing only the natural key in the fact table; however, this severely complicates queries and risks error. For example, assume that customer_id 404777 has moved, and therefore has two
rows in the dimension table. Each fact table row contains only the natural key 404777. To identify which version of the customer corresponds to each fact table row, it is necessary to compare order date with the time stamps in the dimension table. This process can be onerous, particularly if one is constructing a report that aggregates a large number of orders. It also becomes difficult for database administrators to tune the system, preventing them, for example, from declaring a foreign key in the fact table and potentially leading to
poor query optimization. Worse, if the date qualification is omitted, facts associated with customer_id 404777 will be double-counted.
It is not common practice to use version numbers or time stamps as part of a unique identifier. Surrogate keys simplify the schema design and allow for clean, single-column joins. Time stamps are frequently included in dimension tables, but not as part of the unique identifier.

 

参考至:《Star Schema The Complete Reference》

如有错误,欢迎指正

邮箱:czmcj@163.com

作者:czmmiao  文章出处:http://czmmiao.iteye.com/blog/2199475
相关文章
|
机器学习/深度学习 弹性计算 运维
ECS阿里云监控服务
ECS阿里云监控服务
371 3
|
运维 Prometheus 监控
自动化运维的魔法:使用Python脚本简化日常任务
【8月更文挑战第50天】在数字化时代的浪潮中,自动化运维成为提升效率、减少人为错误的利器。本文将通过一个实际案例,展示如何利用Python脚本实现自动化部署和监控,从而让运维工作变得更加轻松和高效。我们将一起探索代码的力量,解锁自动化运维的神秘面纱,让你的工作环境焕然一新。
277 81
|
10月前
|
API
周公解梦免费API接口教程
该接口提供数万个解梦数据,支持通过用户ID、用户KEY和关键词查询梦的解析。支持POST/GET请求,返回状态码、消息内容、结果集等信息。示例请求地址:https://cn.apihz.cn/api/mingli/zhougong.php?id=88888888&key=88888888&word=捞鱼。返回数据包括梦的标题、内容等。
|
存储 弹性计算 固态存储
阿里云企业级云服务器价格,企业级云服务器活动价格与收费标准参考
阿里云企业级云服务器多少钱?阿里云服务器有多种实例分类,其中通用型、计算型、内存型、通用算力型、大数据型、本地SSD、高主频型和增强型均属于企业级云服务器,目前在阿里云活动中,通用算力型、通用型、计算型均有优惠,下面是阿里云企业级云服务器价格表,包含活动报价表和最新收费标准价格表。
679 0
阿里云企业级云服务器价格,企业级云服务器活动价格与收费标准参考
|
存储 NoSQL 算法
如何借助Redis更高效统计UV?——Hyperloglog篇
Redis的HyperLogLog数据类型是用于近似计算大规模数据集中不重复元素基数的工具,它以低空间开销(约12KB)提供高精度的估算(误差率约0.81%)。通过`pfadd`添加元素,`pfcount`统计数量,`pfmerge`合并多个HyperLogLog,实现去重计数。尽管内部存储为字符串,但它是概率数据结构,适合高效UV统计和其他大数据场景。
195 0
|
存储 SQL NoSQL
深入理解数据库管理系统(DBMS)及其在现代应用中的重要性
一、引言 随着信息技术的飞速发展,数据已成为现代社会中不可或缺的资源
1419 3
|
机器学习/深度学习 算法
【机器学习】十大算法之一 “随机森林”
随机森林算法(Random Forest, RF)是由Leo Breiman和Adele Cutler于2001年提出的一种集成学习(Ensemble Learning)算法。它是由多个决策树构成的分类器,通过对每个决策树的投票结果来确定最终的预测结果。随机森林算法可以用于分类和回归分析。在分类问题中,每个决策树的输出结果为一个类别标签,通过投票来确定样本所属的类别。在回归问题中,每个决策树的输出结果为一个连续值,取所有决策树输出结果的平均值作为最终结果。可以处理高维度数据;可以处理不平衡的数据集。
1857 1
【机器学习】十大算法之一 “随机森林”
|
JSON 数据安全/隐私保护 开发者
uos系统如何设置开发者模式并获取和使用root权限
uos系统如何设置开发者模式并获取和使用root权限
1757 0
|
缓存 负载均衡 监控
Dubbo介绍、原理
Dubbo是一个分布式服务框架,以及SOA治理方案。其功能主要包括:高性能NIO通讯及多协议集成,服务动态寻址与路由,软负载均衡与容错,依赖分析与降级等。
Dubbo介绍、原理
|
存储 芯片
内存条和主板的关系是什么?底层原理是什么?
内存条和主板的关系是什么?底层原理是什么?
1514 0

热门文章

最新文章