The Hybrid Response
When requirements call for changed data to result in both type 1 and type 2 behavior, the solution is disarmingly simple: provide for both. This cannot be achieved with a single attribute, but it can be achieved by providing a pair of attributes. Each attribute has the same source, but one will react to changes as a type 1 response and the other as type 2. This solution is known as a type 1/2 hybrid. It’s a good idea to name these attributes clearly, since each serves a different function. Use the suffix _current to identify the type 1 attribute, and _historic to identify the type 2 attribute. When the source for the pair of attributes changes, the ETL process must do two things:
- Update the _current value for all records that share the same natural key.
- Insert a new row, initializing both _current and _historic values to the new value.
When reporting on the data, developers choose which column to use based on the requirements for the report.
- Use the _current column to group all facts under the current values.
- Use the _historic column to group all facts under the historic values.
A Type 1/2 Hybrid Response in Action
A hybrid design for handling changes in company names is illustrated in Below figure. The customer dimension in the figure records company names in two columns:
- company_name_current will capture the current name of the company. This attribute will exhibit type 1 behavior. When a company name changes, it will be updated.
- company_name_historic will capture the detailed history of company names. This attribute will exhibit type 2 behavior. When a company name changes, a new row will be created to capture the new value. The new row will be associated with any subsequent transactions added to a fact table.
Processing a Change
Suppose one of the companies in the customer table is Apple Computer. You may recall that in 2007 the company changed its name from “Apple Computer, Inc.” to “Apple Inc.” Imagine that products have been ordered by Apple, both before and after the name change. Below figure illustrates the two-step process that occurs when the name of the company changes.
The top of the figure shows the row for Apple that was in place prior to the name change. The natural key for Apple is BB770. It has one row in the customer table, with a surrogate key value of 1011. The column company_name_current shows the name of the company as “Apple Computer, Inc.” There have been no changes as of yet, so company_name_historic contains the same value. The lower half of Figure shows what happens when the name of the company changes:
- The first row shows the previously existing record, with surrogate key value 1011. The company_name_current column is updated with the new company name “Apple Inc.” This is the type 1 response. The company_name_historic value is untouched, since it is treated as type 2.
- To process the type 2 change, a new row is added. It contains the new name “Apple Inc.” in both the company_name_current and company_name_historic columns.
This new row is assigned a new surrogate key, 2822, which will be used for new rows in associated fact tables.
This may appear convoluted, but it is no different from processing any table that contains both type 1 and type 2 attributes. The only difference is that both attributes happen to have the same source.
Using the Company Name Columns
Facts can now be studied in two ways. When you want to be able to capture all orders under the new name (Apple Inc.), you can use the column company_name_current. This might be done to filter a query or to group aggregated facts. All activity with Apple is associated with the same name: Apple Inc. This is particularly useful for producing this year versus last year comparisons; all orders will have the same name, even if it recently changed. When you want to group orders under historically accurate company names, you can use the company_name_historic. Transactions that took place before Apple’s name change are grouped under “Apple Computer, Inc.” Transactions that took place after the name
change are grouped under “Apple Inc.”Each time a company changes,the same two-step process is followed. First, all existing rows with the same natural key have their _current value updated. Second, a new row is added, with _current and _historic initialized to the new value.
Suppose that Apple decides to change its name again. The iPod and iPhone have been
so successful they might rename the company “iApple Inc.” Below figure shows how this
change would be processed.
First, the existing records in the dimension table for Apple, which is company BB770, are updated with the new company_name_current. This time around, there are two records to be updated: those identified by surrogate keys 1011 and 2822. That takes care of the type 1 processing.
Second, a new row is added to the table for iApple. This row records the new name, iApple Inc.,in both the company_name_current and company_name_historic columns.
That takes care of the type 2 processing. This new row has surrogate key value 3100. Any new orders from iApple will use this key value in the fact table.
This process can be repeated as many times as the name of a given company changes.
It will always be possible to use the current name of the company to study all facts by using company_name_current. It will always be possible to study all facts with historically accurate company names by using company_name_historic.
Evaluating and Extending the Hybrid Approach
The hybrid approach addresses conflicting slow change requirements at the expense of
understandability. Hybrid solutions generate no end of confusion among end users and
analysts, who are often uncertain about which column to use in a given report. When the
wrong column is used for a situation, the results can be perplexing.
It is, therefore, important to use hybrid solutions judiciously. Evaluate the requirements carefully. In the case of the company name changes, for example, why do some people want access to the historically accurate name? If it is so they can reproduce a historic invoice, that purpose is better served by the operational system. On the other hand, if it is needed for certain forms of statutory reporting, the requirement may be valid.
When a hybrid approach is implemented, most analysis tends to focus on one of the columns. In the company example, the column of primary interest contains the current company name. To simplify the model exposed to end users, you can create a view of the customer dimension that hides the historic version. Developers trained in the complexities of the star can be provided with the full version of the table, for use in creating canned reports.
In extremely rare situations, you may find requirements that call for a type 1/2/3 hybrid. In this case, the attribute in question will be represented by three columns: the current value, the previous value, and the historic value. This situation will compound the confusion of users trying to choose which column to show in a report. Again, you will want to try to insulate end users from this complexity, exposing them only to the version required most often.
At this point, the reader may be thinking about other ways to combine these techniques. There is no need to describe a type 1/3 hybrid, because a type 3 change already includes a column that exhibits type 1 behavior. For the same reason, a type 2/3 hybrid is no different from the 1/2/3 hybrid described in the previous paragraph. It is possible to incorporate time stamps with any hybrid technique that involves a type 2 component. Remember that complexity increases the ETL workload and reduces usability; be sure that any solution is truly warranted by the requirements.
参考至:《Star Schema The Complete Reference》
本文原创,转载请注明出处、作者
如有错误,欢迎指正
邮箱:czmcj@163.com