There are different definitions for a data warehouse.
Lets start with Bill Inmon, who provided the following:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
Ralph Kimball defines data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
This can be also considered as functional view of a data warehouse.
Facts and dimensions form the core of any business intelligence effort.
These tables contain the basic data used to conduct detailed analyses and derive business value.
Fact tables contain the data corresponding to a particular business process.
Each row represents a single event associated with that process and contains the measurement data associated with that event. For example, a retail organization might have fact tables related to customer purchases, customer service telephone calls and product returns. The customer purchases table would likely contain information about the amount of the purchase, any discounts applied and the sales tax paid.
In other words, a fact is a record in the fact table which consists of a number of measures along with multiple dimension keys which anchor the measures in multi dimensional space.
A measure is typically a numeric, additive value which is used to measure the performance of the business. We typically perform a number of aggregations on measures such as SUM, MIN, MAX and AVG. Measures could be split into 3 main categories:
Fully additive – These types of measures can be added across all dimensions. They are the most popular type of measures since we generally retrieve a large number of rows when we query a data warehouse and the most useful thing to do with the measures is to add them up. Examples include Profit and Price Paid.
Semi Additive – These types of measures can only be added across certain dimensions. If we try adding them across for example the Time dimension, the resultant value won’t make any sense. We can however, perform other aggregate functions on these measures such as AVG, MIN and Max. Examples include Account Balance and Inventory Level.
Non Additive – These types of measures cannot be added across any dimensions. These are typically the result of some mathematical calculation. Examples include Ratio and Percentage.
Fact Table Grain
When designing a fact table, developers must pay careful attention to the grain of the table — the level of detail contained within the table.
The developer designing the purchase fact table described above would need to decide, for example, whether the grain of the table is a customer transaction or an individual item purchase. In the case of an individual item purchase grain, each customer transaction would generate multiple fact table entries, corresponding to each item purchased.
Within a fact table, only facts consistent with the declared grain are allowed. For example, in a retail sales transaction, the quantity of a product sold and its extended price are good facts, whereas the store manager’s salary is disallowed.
As per Kimball, the four key decisions made during the design of a dimensional model include:
1.Select the business process.
2.Declare the grain.
3.Identify the dimensions.
4.Identify the facts.
What are Dimensions?
Dimensions describe the objects involved in a business intelligence effort. While facts correspond to events, dimensions correspond to people, items, or other objects. For example, in the retail scenario, we discussed that purchases, returns and calls are facts. On the other hand, customers, employees, items and stores are dimensions and should be contained in dimension tables.
Dimension tables contain details about each instance of an object. For example, the items dimension table would contain a record for each item sold in the store. It might include information such as the cost of the item, the supplier, color, sizes, and similar data.
As per Kimball, dimensions provide the “who, what, where, when, why, and how” context surrounding a business process event. Dimension tables contain the descriptive attributes used by BI applications for ﬁltering and grouping the facts. With the grain of a fact table ﬁrmly in mind, all the possible dimensions can be identiﬁed. Whenever possible, a dimension should be single valued when associated with a given fact row.
Using dimensions, users can slice and dice the data in various ways. Every dimension in a data warehouse is generally made up of the following:
A Surrogate Key – these replace the production keys. They are used to uniquely identify the rows within each dimension. Dimension tables are joined to the fact table through these keys.
Informational Attributes – These are the attribute which are simply added to a dimension for informational purposes. Typically, the history of these attributes doesn’t need to be maintained. Examples of informational attributes within a Customer dimension include Customer Name and Customer Email.
Analytical Attributes – These are the attributes which we use to analyse (group and sub set) the data by. The history of these attributes needs to be preserved. Therefore, if an attribute value changes with time, we mustn’t simply over write the respective value. There are a number of techniques which could be employed in order to handle changing dimension attribute values. Such techniques are known as SCD (Slowly Changing Dimensions).
Thanks to following references: