What is Slowly Changing Dimension

A data warehouse dimensional model comprises of facts and measures defined in the context of their dimensions.
Dimensions and their attributes, are relatively constant, but they do change over time.

The term Slowly Changing Dimension (SCD) is about tracking the variation in dimensional attributes over time.
The word slowly in this context might seem incorrect.
For example, a salesperson or store dimension, might change rapidly if a company reorganizes.
But in general, when compared to a measure in a fact table, changes to dimensional data occur slowly.

Within a data warehouse, you can handle slowly changing dimensions in several ways.

Type 0 – The passive method
In this method no special action is performed upon dimensional changes.
Some dimension data can remain the same as it was first time inserted, others may be overwritten.

Type 1 (overwrite):
A Type One change overwrites an existing dimensional attribute with new information.
In the customer name-change example, the new name overwrites the old name, and the value for the old version is lost.
A Type One change updates only the attribute, doesn’t insert new records, and affects no keys.
No history information is stored. Existing data is overridden by new values.

Before the change:

Customer_ID Customer_Name Customer_Type
1 Cust_1 Corporate

After the change:

Customer_ID Customer_Name Customer_Type
1 Cust_1 Retail

Type 2 (add a row):
A Type Two change writes a record with the new attribute information and preserves a record of the old dimensional data.
Type Two changes let us preserve historical data.
The history of data changes is preserved.
A new record is inserted each time a change is made.
Every data row has a valid from date and valid to date indicating the time period of the data’s validity, and each row usually has as isCurrent type of field that is set to Yes for the active record with the others set to No.
When a fact table record is inserted, it will be given the appropriate surrogate key of the dimension record.
Since Type Two changes add records, they can increase the database’s size.

Before the change:

Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate 22-07-2010 31-12-9999 Y

After the change:

Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate 22-07-2010 17-05-2012 N
2 Cust_1 Retail 18-05-2012 31-12-9999 Y

Type 3 (add a column):
This method traces changes using separate columns (but no new rows).
This means there is a limit to history preservation based on the number of columns in each row that are designated for storing historical data.
For example, a record may have the fields Territory1, Territory1EffectiveDate, Territory2, Territory2EffectiveDate, etc.
Type Three changes is implemented only if we have a limited need to preserve and accurately describe history, such as when someone gets married and you need to retain the previous name.
Instead of creating a new dimensional record to hold the attribute change, a Type Three change places a value for the change in the original dimensional record.
We can create multiple fields to hold distinct values for separate points in time.
In the case of a name change, you could create an OLD_NAME and NEW_NAME field and a NAME_CHANGE_EFF_DATE field to record when the change occurs.
This method preserves the change.
But how would we handle a second name change, or a third, and so on?
The side effects of this method are increased table size and, more important,
increased complexity of the queries that analyze historical values from these old fields.
After more than a couple of iterations, queries become impossibly complex, and ultimately you’re constrained by the maximum number of attributes allowed on a table.

Before the change:

Customer_ID Customer_Name Current_Type Previous_Type
1 Cust_1 Corporate Corporate

After the change:

Customer_ID Customer_Name Current_Type Previous_Type
1 Cust_1 Retail Corporate

Type 4 – Using historical table.
In this method a separate historical table is used to track all dimension’s attribute historical changes for each of the dimension.

Current table:

Customer_ID Customer_Name Customer_Type
1 Cust_1 Corporate

Historical table:

Customer_ID Customer_Name Customer_Type Start_Date End_Date
1 Cust_1 Retail 01-01-2010 21-07-2010
1 Cust_1 Oher 22-07-2010 17-05-2012
1 Cust_1 Corporate 18-05-2012 31-12-9999

Type 6 – Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:
current_type – for keeping current value of the attribute. All history records for given item of attribute have the same current value.
historical_type – for keeping historical value of the attribute. All history records for given item of attribute could have different values.
start_date – for keeping start date of ‘effective date’ of attribute’s history.
end_date – for keeping end date of ‘effective date’ of attribute’s history.
current_flag – for keeping information about the most recent record.

In this method to capture attribute change we add a new record as in type 2.
The current_type information is overwritten with the new one as in type 1.
We store the history in a historical_column as in type 3.
The ‘main’ dimension table keeps only the current data e.g. customer and customer_history tables.

Customer_ID Customer_Name Current_Type Historical_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate Retail 01-01-2010 21-07-2010 N
2 Cust_1 Corporate Other 22-07-2010 17-05-2012 N
3 Cust_1 Corporate Corporate 18-05-2012 31-12-9999 Y

References:

SQLMag Reference

Datawarehouse4u Reference

jamesserra blog reference

Advertisements

Data Warehouse Facts Dimensions

There are different definitions for a data warehouse.
Lets start with Bill Inmon, who provided the following:

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.

Ralph Kimball defines data warehouse:

A data warehouse is a copy of transaction data specifically structured for query and analysis.

This can be also considered as functional view of a data warehouse.

Facts and dimensions form the core of any business intelligence effort.
These tables contain the basic data used to conduct detailed analyses and derive business value.

Fact tables contain the data corresponding to a particular business process.
Each row represents a single event associated with that process and contains the measurement data associated with that event. For example, a retail organization might have fact tables related to customer purchases, customer service telephone calls and product returns. The customer purchases table would likely contain information about the amount of the purchase, any discounts applied and the sales tax paid.

In other words, a fact is a record in the fact table which consists of a number of measures along with multiple dimension keys which anchor the measures in multi dimensional space.

A measure is typically a numeric, additive value which is used to measure the performance of the business. We typically perform a number of aggregations on measures such as SUM, MIN, MAX and AVG. Measures could be split into 3 main categories:

Fully additive – These types of measures can be added across all dimensions. They are the most popular type of measures since we generally retrieve a large number of rows when we query a data warehouse and the most useful thing to do with the measures is to add them up. Examples include Profit and Price Paid.

Semi Additive – These types of measures can only be added across certain dimensions. If we try adding them across for example the Time dimension, the resultant value won’t make any sense. We can however, perform other aggregate functions on these measures such as AVG, MIN and Max. Examples include Account Balance and Inventory Level.

Non Additive – These types of measures cannot be added across any dimensions. These are typically the result of some mathematical calculation. Examples include Ratio and Percentage.

Fact Table Grain
When designing a fact table, developers must pay careful attention to the grain of the table — the level of detail contained within the table.

The developer designing the purchase fact table described above would need to decide, for example, whether the grain of the table is a customer transaction or an individual item purchase. In the case of an individual item purchase grain, each customer transaction would generate multiple fact table entries, corresponding to each item purchased.

Within a fact table, only facts consistent with the declared grain are allowed. For example, in a retail sales transaction, the quantity of a product sold and its extended price are good facts, whereas the store manager’s salary is disallowed.

As per Kimball, the four key decisions made during the design of a dimensional model include:
1.Select the business process.
2.Declare the grain.
3.Identify the dimensions.
4.Identify the facts.

What are Dimensions?
Dimensions describe the objects involved in a business intelligence effort. While facts correspond to events, dimensions correspond to people, items, or other objects. For example, in the retail scenario, we discussed that purchases, returns and calls are facts. On the other hand, customers, employees, items and stores are dimensions and should be contained in dimension tables.

Dimension tables contain details about each instance of an object. For example, the items dimension table would contain a record for each item sold in the store. It might include information such as the cost of the item, the supplier, color, sizes, and similar data.

As per Kimball, dimensions provide the “who, what, where, when, why, and how” context surrounding a business process event. Dimension tables contain the descriptive attributes used by BI applications for filtering and grouping the facts. With the grain of a fact table firmly in mind, all the possible dimensions can be identified. Whenever possible, a dimension should be single valued when associated with a given fact row.

Using dimensions, users can slice and dice the data in various ways. Every dimension in a data warehouse is generally made up of the following:

A Surrogate Key – these replace the production keys. They are used to uniquely identify the rows within each dimension. Dimension tables are joined to the fact table through these keys.

Informational Attributes – These are the attribute which are simply added to a dimension for informational purposes. Typically, the history of these attributes doesn’t need to be maintained. Examples of informational attributes within a Customer dimension include Customer Name and Customer Email.

Analytical Attributes – These are the attributes which we use to analyse (group and sub set) the data by. The history of these attributes needs to be preserved. Therefore, if an attribute value changes with time, we mustn’t simply over write the respective value. There are a number of techniques which could be employed in order to handle changing dimension attribute values. Such techniques are known as SCD (Slowly Changing Dimensions).

Thanks to following references:

Reference 1

Reference 2

Reference 3