The star schema, at its simplest, consists of a fact table surrounded by a number of associated dimension tables. The visual representation of this derives the name.
The fact table contains the facts, i.e. the central data of the warehouse (typically numeric), along with keys to the various dimension tables. A simple data model with one fact table and three dimension tables is shown below.
The keys in the dimension tables are called “Primary keys”; they uniquely identify each row. The same keys in the fact table are known as “Foreign keys”; they do not necessarily uniquely identify each row. Often the fact table will have an extra column called the “Compound Key”. This will contain a combination of the dimension keys, making it unique to each row of the fact table.
Consider the situation whereby the supplier named Beta (the supplier of both transactions in the fact table) moved its location sometime between August and October 2009, i.e. between the first and the second transaction. We have a slowly changing dimension; we can reasonably expect our suppliers to occasionally change their location. There are a number of standard patterns for dealing with this in our data model. The most commonly used are Type 0 through to Type 3.
SCD Type 0 – The null option, i.e. do nothing. Data models that adopt SCD Type 0 simply retain the values as they were at the time that they were loaded. Thus, SCD Type 0 provides little or no control over your historic information.
SCD Type 1 – Using this option you will overwrite old data with new, so your dimension tables always contain the most up-to-date information. Clearly there is no historic record in this case. In our example we would overwrite Beta’s record in the supplier dimension with the new location, thereby causing the earlier transaction’s supplier reference to specify the supplier’s current location rather than their location at the time that the August transaction occurred.
SCD Type 2 – With Type 2 SCDs we track changes by adding rows to the dimension table for each change. Typically we add start and end columns to allow us to track the periods in which the values were in use. Leaving the end date as missing, or as a special value such as 31-Dec2999, highlights the current value. Using SCD Type 2 our supplier dimension table might look as follows.
SUPPLIER_ID NAME COUNTRY REGION Start End
1 Alpha England S.East 1-Jan-2000
2 Beta England N.West 1-Jan-2000 10-Sep-2009
3 Beta England N.East 10-Sep-2009
Notice how each supplier has one (only one) row with a missing end-date. These rows indicate the current values. With the above supplier table, the fact table would look as follows (using foreign supplier key values of 2 and 3).
SUPPLIER_ID DATE_ID PRODUCT_ID Price Profit
2 1 2 4.99 1.25
3 2 1 10.50 3.50
SCD Type 3 – Finally, we have a type that records a subset of historical values. By creating one column in the dimension table for each historical (and current) value that we want to keep, we can retain the values in a denormalised fashion. See the (simplified) supplier table below.
SUPPLIER_ID NAME REGION_DATE FIRST_REGION CURRENT_REGION …
1 Alpha 1-Jan-2000 S.East S.East
2 Beta 10-Sep-2009 N.West N.East
In summary, you need to consider the volatility of your dimensional values when designing your data model. Include the design pattern that is appropriate for your data.
In a subsequent post I’ll talk about business keys and DI Studio transforms that support star schemas and slowly changing dimensions. As is so often the case, it’s a lot easier to incorporate the correct design from the beginning than it is to reengineer your work later in the process.
For further reading, consider:
- SAS(R) Data Integration Studio 3.4: User's Guide, Working with Slowly Changing Dimensions…
- What’s new in DI Studio 4.2, Eric Hunley and Nancy Rausch, SGF 2009
- Stars and Models – How to Build and Maintain Star Schemas, Nancy Rausch, SUGI 31 (SAS 9.1)