A schema refers to the structure or organization of a database. It contains a logical description of the entire database, which includes names and descriptions of tables, records, views, and indexes. While a relational model is used to describe a database, data warehouse schemas get more specialized because the structure is optimized for reporting and analysis
The two types of schemas in a dimensional data warehouse that we’ll discuss in this article are:
- Star Schema
- Snowflake Schema
But before we go into detail about the two types, we must first understand what Fact and Dimension tables are.
What is a Dimension Table?
In the schema diagram, you can see dimension tables such as dim_time and dim_product. Every dimension contains attributes, which are grouped in the form of a dimension. They are essentially a collection of information that can be referenced to answer meaningful business questions when used together with fact tables.
For instance, the dim_time dimension table here contains time variants and dim_product contains product names and types. When these descriptive attributes are used with the fact_sales table, a business can find out the quantity of a specific product sold over a defined period, or revenue generated from a specific product. Joins between multiple fact and dimension tables are automatically performed to answer such business questions, and because dimension tables are generally denormalized, the number of joins needed to answer business queries is reduced.
What is a Fact Table?
As seen in the diagram above, multiple dimension tables are linked to one fact table, which contains ‘keys’ and ‘measures’. By ‘keys’, we’re referring to the foreign keys of every associated dimension. Keys are used to perform joins with dimension tables to run queries. ‘Measures’ refer to numeric data like price and quantity, which represents business events or transactions, used to add detail to dimension data, so that effective reports can be generated. Information in fact measures like price and quantity is useful, but on its own, this data doesn’t give any context to the business to analyze sales (see example in ‘what is a dimension table?’).
What is a Star Schema?
The star schema categorizes business data into facts and dimensions to optimize the structure for reporting. As discussed, fact tables contain quantitative data about a business, while dimension tables contain data that describes the facts, helping the user make sense of the quantitative data in context. In a star schema, these facts and dimensions are usually arrayed in a way that the diagram resembles a star, thus the name. This is the simplest yet most popular reporting schema.
Should You Use Star Schema in Your Data Warehouse?
The inherent simplicity of star schemas makes them ideal for reporting on smaller data sets, like when you’re creating a data mart that focuses only on similar data, grouped so a specific group of people can address their reporting needs. Data marts could be considered a mini data warehouse for, let’s say, the sales department in your organization. Using a star schema for such a data mart allows rapid access to data because of denormalization.
However, the problem is, when it is used as a foundation to build the entire data warehouse, reporting requirements change often. When requirements change, the star schema will need to be changed too, because that structure is designed to answer business questions from specific perspectives only. If you want a different perspective, then you’d have to add dimensions to the star schema. This is not possible in most cases, necessitating a complete redesign of the schema unless you’re building an agile data warehouse where you can quickly rebuild data models and schemas.
With data warehouse automation tools designed to manage and change structures with agility, you can take the base of the data warehouse and automatically reshape it however you want, in order to address whichever business intelligence issue you need to resolve.
If, however, you’re not sure that a star schema alone will fulfill your data warehousing requirements, you should consider exploring the snowflake schema.
What is a Snowflake Schema?
In the snowflake schema, dimensions are stored in multiple dimension tables instead of a single table per dimension. The multiple tables associated with a particular dimension branch out further, which you can see from the dim_time dimension in the above diagram. This “branching out” results in a diagram that resembles a snowflake, thus the name. In other words, it is an extension of a star schema.
When multiple tables for a single dimension are created in the schema, a certain degree of denormalization is involved. On the plus side, this allows you to reduce redundancy and minimize disk space that is typical in a star schema with duplicate records. But, on the other hand, this also means that more complex joins will be required to answer business queries, slowing down query performance.
Should You Use Snowflake Schema in Your Data Warehouse?
As with the star schema, the snowflake schema too makes its own case. If you have an attribute in a dimension whose value is NULL for the majority of dimension records, it would be advisable to create a separate dimension table for this attribute, thus transforming into the snowflake schema.
Another use case is when you have attributes that are part of a hierarchy but are generally queried independently. The dim_time dimension in our snowflake schema diagram is a prime example of this. Weekday, month, or year are all part of a natural hierarchy and could be grouped, but your sales team will generally want to focus on a certain attribute at one time. It would make more sense to separate these attributes, as shown in our diagram.
The key to utilizing the right reporting schema is to ensure you have an agile approach to data warehousing, where your schemas are not set in stone and you have the flexibility to reshape and redesign them as per your business intelligence requirements without having to rewrite the code.
One comment