第五课|学习笔记-阿里云开发者社区

开发者学堂课程【高校精品课-西安交通大学-数据库理论与技术：第五课】学习笔记，与课程紧密联系，让用户快速学习知识。

课程地址：https://developer.aliyun.com/learning/course/12/detail/15861

第五课

内容介绍

一、Data Guides

二、Existence of Multiple DataGuides

三、Annotations

四、Strong Data Guides

五、Mapping Long Records to Pages

六、Allocation and Free-Space Management

七、文件组织

八、文件中的记录组织

一、Data Guides

首先将前面剩下的内容讲完，有这样一个定义， OEM 源对象 s 的数据向导是 OEM 对象 d ，使得 s 的每个标签路径在 d 中恰好有一个数据路径实例，并且 d 的每个标签路径是 s 的标签路径。这也是Data Guide的一个模式模式图，如图所示：

图1：

图片91.png

图2：

图片90.png

图2显示了图1所示的源OEM数据库的数据指南。

图1中的1、2、3、4都是根结点，2、3结点共用8结点，中间结点是没有原始值的。模式图将所有相同的路径都取消了，即压缩了（图2所示）。在 Data Guide 里只有数据路径，没有标记路径，到达节点后，可以发现它没有原子值了。 Data Guide 就是用于结构汇总，将复杂的路径压缩简化。

使用 Data Guide ，我们可以通过考虑数据向导中最多 n 个对象来检查长度为 n 的给定标签路径是否存在于原始数据库中。假如写一个查询路径，这个路径在原本的数据库中占比较大，当我们查询 F 是否存在时，在原数据库中可能较为繁琐，而在 Data Guide 里检查，可以很快的看到结果，提高了查询速度。

例如，在图2中，我们只需要检查对象12和13的外出边缘，用来验证路径餐馆。数据库中存在所有者。 Data Guide 的结构信息可以更好的帮助查询处理。类似地，如果我们遍历数据指南中标签路径的单个实例，并到达一些对象 O ，则 O 的输出边上的标签表示可能跟随源数据库所有可能的标签。例如，在图2中，对象13的五个不同的带标签的外出边，它们表示源中所有可能的跟随餐馆的标签。

需要注意的是，在 Data Guide 的叶结点上不包含原子值，这是它与普通数据图的区别，普通数据图的叶结点是有原子值的。由于 Data Guide 旨在反映数据库的结构，因此原子值是不必要的。由于 Data Guide 中的任何标签路径都只有一个数据路径实例，因此目标集合只包含一个对象，即该数据路径中的最后一个对象。 Data Guide 没有原子值，这样就可以在叶节点上存放一些特殊的信息，可以更好地帮助查询和产品优化。

这里要注意的是， Data Guide 中的每个目标集合都是单例集合，单例集合里只有一个元素。

Data Guide 也是有理论基础的，可以在1997年的一篇论文中找到。那篇论文证明了创造一个源数据库上的 Data Guide ，等效于非确定性有限自动机（ NFA ）转化到确定性有限自动机（ DFA ）的有效时间，它存在于自动机理论里，是一个被充分研究的问题。

二、Existence of Multiple DataGuides

From automata theory, we know that a single NFA may have many equivalent DFAs

Similarly, as shown in Figure 3,one OEM source database may have multiple DataGuides.Figures 3(b) and (c) are both DataGuides of the source in Figure 3(a).

图片89.png

Each label path in the source appears exactly once in each DataGuide, and neither DataGuide introduces any label paths that do not exist in the source.

Figure 3(c) is in fact minimal: the smallest possible DataGuide.( Well-known state minimization algorithms can be used to convert any DataGuide into a minimal one [Hop71].

Given the existence of multiple DataGuides for a source, it is important to decide what kind of DataGuide should be built and maintained in a semistructured database system

Intuitively, a minimal DataGuide might seem desirable, furthering our goal of having as concise a summary as possible; But, as we now explain,a minimal DataGuide is not always best.

三、Annotations

First, incremental maintenance of a minimal DataGuide can be very difficult.

For example,in Figure 3(a), suppose we add a new child object to 10,via the label E.

To correctly reflect this source insertion in Figure 3(b), we simply add a new object via label E to object 17.

图片88.png

But to reflect the same insertion in the minimal DataGuide in Figure 3(c), we must do more work in order to somehow generate the same DataGuide as our updated version of Figure 3(b) ,since it now is the minimal DataGuide for the source.

In general, maintaining a minimal DataGuide in response to a source update may require much of the original database to be reexamined.

Beyond using a DataGuide to summarize the structure of a source, we may wish to keep additional information in a DataGuide.

For example, consider a source with a label path l.To aid query formulation, we might want to present to a user sample database values that are reachable via 1.(Such a feature is very useful in OEM , since there are no constraints on the type or format of atomic data.)

As another example, we may wish to provide the user or the query processor with the statistical odds than an object reachable via l has any outgoing edges with a specific label.

Finally, for query processing, direct access through the DataGuide to all objects reachable via l can be very useful,The following definition classifies all of these examples.

Definition 6: In a source database sgiven a label path l, a property of the set of objects that comprise the target set of l in s is said to be an annotation of l.

That is, an annotation of a label path is a statement about the set of obiects in the database reachable by that path.

A DataGuide should quarantees that each source label path l reaches exactly one object o in the DataGuide.Object o seems like an ideal place to store annotations for l,since we can access all annotations of l simply by traversing the DataGuide's single data path instance of l.

Unfortunately, nothing in our definition of a DataGuide prevents multiple label paths from reaching the same object in a DataGuide,even if the label paths have different target sets in the source.

Referring to Figure 3(c), we see that label paths A.C and B.C both reach the same object.Thus,if we store an annotation on object 20, we cannot know if the annotation applies to label path A.C ,label path B.C, or both.

In the DataGuide in Figure 3(b), however, we have two distinct objects for the two label paths, so we can correctly separate the annotations.

Next, we formalize DataGuide characteristics that enable unambiguous annotation storage.

四、Strong Data Guides

We define a class of DataGuides that supports annotations as described in the previous subsection.

Intuitively , we are interested in DataGuides where each set of label paths that share the same ( singleton ) target set in the DataGuide is exactly the set of label paths that share the same target set in the source.

Definition 7.:Consider OEM objects s and d where d is a DataGuide for a sources.Given a label path l of slet Ts(l) be the target set of l in s, and let Td(l) be the ( singleton ) target set of l in d. Let Ls(l) = { m | Ts(m) =Ts(l)}. That is, Ls(l) is the set of all label paths in s that share the same target set as l.Similarly,let Ld(l)={ m | Td(m)=Td(l) }. That is, Ld(l) is the set of all label paths in d that share the same target set as l. If, for all label paths l of s, Ls(l) = Ld(l), then d is a strong DataGuide for s.

For example , Figure 3(c) is not a strong DataGuide for Figure 3(a). The source target set Ts(B.C)is {6,7},and the DataGuide target set Td(B.C) is {20}.

In the source, Ls(B.C) is {B.C}, since no other source label paths have the same target set.

In the DataGuide(c),however,Ld(B.C)is{B.C,A.C}. SinceLs(B.C)# Ld(B.C), the DataGuide is not strong.The reader may verify that Figure 3(b)is in fact a strong DateGuide

Next, we prove that a strong DataGuide is sufficient for storage of annotations.

Theorem 1:Suppose d is a strong DataGuide for a source s. lf an annotation p of some label path l is stored on the object o reachable via l in d,then p describes the target set in s of each label path that reaches o.

We also prove that a strong DataGuide induces a straightforward one-to-one correspondence between source target sets and DataGuide objects.This property is useful for incremental maintenance and query processing.

Theorem 2:Suppose d is a strong DataGuide for a source s.Given any target sett of s,t is by definition the target set of some label path l.Compute Td(l),the target set of l in d,which has a single element o.Let F describe this procedure, which takes a source target set as input and yields a DataGuide object as output. In a strong DataGuide,F induces a one-to-one correspondence between source target sets and DataGuide objects.

五、Mapping Long Records to Page

1.Long Pages

（1）Allows long pages (up to 64K bytes).

（2）Internally, long records are managed just as short ones.

（3）Simple solution, but inadequate for very large records.

（4）Used in DB2 and Oracle.

2.Separate Pages for Long Fields

（1）Instead of keeping entire long field in a record.

（2）maintains only a pointer to the long field stored in another page.

（3）The main record is then treated as a short tuple.

（4）Good results if long attributes are accessed infrequently.

（5）Unfortunately, performance suffers in situations when ong attributes are accessed frequently.

图片87.png

六、Allocation and Free-Space Management

1.Free-Space Information Pages

Advantages:

（1）FSIPs need not always be updated during allocation and deallocation.

（2）Deallocations are relatively fast and simple

（3）Provides good storage utilization and reclamation(回收) of free and unused space.

Disadvantages:

Allocations may require several disk accesses.

部分习题如下：

图片86.png

图片85.png

七、文件组织

文件组织是数据库保存为若干文件的集合。其中文件是记录的序列，记录是字段的序列。记录方法有定长记录、变长记录、指针法。

1.定长记录

简单方法：

（1）记录 i 从字节 n * ( i - 1 ) 开始存储,其中 n 是记录大小。

（2）记录存取简单，但记录可能跨块。唯一的变化是不允许记录跨越块边界。

图片84.png

删除记录 i 可选则的方法:

（1）将记录 i+1 ,...， n 前移成 i ...， n-1 。

（2）将记录 n 前移成 i 。

（3）不移动记录，而是将所有自由记录链成一条自由链表。

2.自由链表

自由链表是指在文件头中存储第一条被删记录的地址，然后在第一条被删记录中，存储第二条被删记录的地址，等等。这些存储的地址可视为指针，因为它们“指向”记录的位置。它是更有效利用空间的表示法，可以重新利用自由记录的正常字段空间来存储指针（使用中的记录则不存储指针）。

图片83.png

3.变长记录

数据库系统中在多种情况下需要变长记录。一个文件中需要存储多种记录类型，允许一个或多个变长字段的记录类型，允许重复字段的记录类型，以上几种情况都需要用到变长记录。

它的方法有预留空间法、字节串表示法、指针法、 Slotted Page 结构.

（1）预留空间法

它是利用具有已知最大长度的定长记录，较小记录中的未用空间用 null 或 end - of - record 符号填充。

图片82.png

（2）字节串表示法

它在每条记录末尾附加控制字符 end - of - record ( L )，缺点是难以删除或增长。

图片81.png

（3）指针法

变长记录可用若干通过指针念在一起的定长记录表示，即使是最大记录，长度未知时也可以使用。缺点是除了链中首记录，其他所有记录中都可能有空间浪费。解决方法是文件中采用两种块，锚块和溢出块，锚块包含链中的首记录，溢出块包含链中其他记录。

图片80.png

解决方法是文件中采用两种块，锚块和溢出块，锚块包含链中的首记录，溢出块包含链中其他记录。

图片79.png

（4） Slotted Page 结构

Slotted page 页头包含记录登记项数目、块中自由空间末端、每条记录的位置和大小。记录可以在页内移动以便保持记录间的连续存储，但页头中的登记项必须更新。指针不能直接指向记录，而是指向该记录在页头中的登记项。

图片78.png

八、文件中的记录组织

文件中的记录组织有堆、顺序、散列、聚簇、动机。堆，记录可至于文件中的任何有空间的地方。顺序，记录按顺序存储，基于每条记录在搜索键上的值。散列，对记录的某些属性 k 计算散列函数 h（ k ），计算结果决定该记录应该置于文件的哪个块中。每一个关系的记录可存储在单独的文件中，但在聚簇文件组织下，多个关系的记录可存储于同一文件中。动机，相关联的记录存储在同一块中可减少磁盘 I/O 。

1.顺序文件组织

它适用于需要顺序处理整个文件的应用，文件中的记录按搜索键排序。

图片77.png

删除时可利用指针链。插入时需要确定带插入记录的位置。若有自由空间，则在该处插入，若无自由空间，则插入到溢出块，两种情况都有，则需要更新指针链。需要不时重组文件，以恢复物理存储的顺序。

2.聚簇文件组织

简单的文件结构将每个关系存储在一个单独的文件中，利用聚簇文件组织可以将多个关系存储在一个文件中。

图片92.png

有利于涉及 depositor 和 customer 的查询，以及涉及单个客户和他的账户的查询，对只涉及 customer 的查询不好(可用指针链，见上图)，容易导致变长记录。

第五课|学习笔记