Pages

Monday, 30 November 2009

NOTE: Star Schemas and Slowly Changing Dimensions in Data Warehouses

Most data warehouses include some kind of star schema in their data model. If the dimensional data in the warehouse is likely to change over time, i.e. you have a slowly changing dimension, then you should consider this whilst designing your data model. This article provides a simple introduction to star schemas and slowly changing dimensions (SCDs). These terms are important if you are to understand some of the more powerful transforms in DI Studio.

The star schema, at its simplest, consists of a fact table surrounded by a number of associated dimension tables. The visual representation of this derives the name.

The fact table contains the facts, i.e. the central data of the warehouse (typically numeric), along with keys to the various dimension tables. A simple data model with one fact table and three dimension tables is shown below.

The keys in the dimension tables are called “Primary keys”; they uniquely identify each row. The same keys in the fact table are known as “Foreign keys”; they do not necessarily uniquely identify each row. Often the fact table will have an extra column called the “Compound Key”. This will contain a combination of the dimension keys, making it unique to each row of the fact table.

Consider the situation whereby the supplier named Beta (the supplier of both transactions in the fact table) moved its location sometime between August and October 2009, i.e. between the first and the second transaction. We have a slowly changing dimension; we can reasonably expect our suppliers to occasionally change their location. There are a number of standard patterns for dealing with this in our data model. The most commonly used are Type 0 through to Type 3.

SCD Type 0 – The null option, i.e. do nothing. Data models that adopt SCD Type 0 simply retain the values as they were at the time that they were loaded. Thus, SCD Type 0 provides little or no control over your historic information.

SCD Type 1 – Using this option you will overwrite old data with new, so your dimension tables always contain the most up-to-date information. Clearly there is no historic record in this case. In our example we would overwrite Beta’s record in the supplier dimension with the new location, thereby causing the earlier transaction’s supplier reference to specify the supplier’s current location rather than their location at the time that the August transaction occurred.

SCD Type 2 – With Type 2 SCDs we track changes by adding rows to the dimension table for each change. Typically we add start and end columns to allow us to track the periods in which the values were in use. Leaving the end date as missing, or as a special value such as 31-Dec2999, highlights the current value. Using SCD Type 2 our supplier dimension table might look as follows.

SUPPLIER_ID  NAME   COUNTRY  REGION  Start       End
1            Alpha  England  S.East  1-Jan-2000
2            Beta   England  N.West  1-Jan-2000  10-Sep-2009
3            Beta   England  N.East  10-Sep-2009

Notice how each supplier has one (only one) row with a missing end-date. These rows indicate the current values. With the above supplier table, the fact table would look as follows (using foreign supplier key values of 2 and 3).

SUPPLIER_ID  DATE_ID  PRODUCT_ID  Price  Profit
2            1        2            4.99  1.25
3            2        1           10.50  3.50

SCD Type 3 – Finally, we have a type that records a subset of historical values. By creating one column in the dimension table for each historical (and current) value that we want to keep, we can retain the values in a denormalised fashion. See the (simplified) supplier table below.

SUPPLIER_ID  NAME   REGION_DATE  FIRST_REGION  CURRENT_REGION …
1            Alpha   1-Jan-2000  S.East        S.East
2            Beta   10-Sep-2009  N.West        N.East

In summary, you need to consider the volatility of your dimensional values when designing your data model. Include the design pattern that is appropriate for your data.

In a subsequent post I’ll talk about business keys and DI Studio transforms that support star schemas and slowly changing dimensions. As is so often the case, it’s a lot easier to incorporate the correct design from the beginning than it is to reengineer your work later in the process.

For further reading, consider: