Data Warehousing Concepts: Normalization and Denormalization

Introduction

Today, the most common argument among data warehouse managers is determining which schema is more performance-oriented. However, it’s critical to know that neither of the normalization or denormalization approaches can be written off since they both have pros and cons.

Therefore, before detailing their differences through use cases, let’s look at normalization and denormalization.

Normalization

Normalization is the act of data reorganization in a data warehouse to meet two fundamental requirements:

Remove data redundancy by storing all data strictly in one place
Ensure data dependency i.e. all corresponding data items are stockpiled together

Normalization is critical for several reasons, but primarily because it enables data warehouses to occupy as minimal disk space as possible. This results in improved performance.

Denormalization

This data warehousing strategy is used to enhance the functionality of a database infrastructure. Denormalization calls redundant data to a normalized data warehouse to minimize the running time of specific database queries that unite data from many tables into one.

In fact, the interpretation of denormalization depends on normalization, which is characterized as the act of arranging a database into tables by removing repetitions to implement a given use case. Remember, a denormalized database should never be mistaken for a database which was never normalized.

Normalization and Denormalization – Use Cases

Amazon DynamoDB

The option of a normalized or denormalized schematic in a NoSQL database, such as DynamoDB, depends on your use case. However, professionals recommend designing your DynamoDB tables with a denormalized schema due to the following two reasons:

DynamoDB by Amazon is schemaless, and when a table is created in this database, you’re required to specify the primary key attributes only, like sort key and partition key, or just the partition key. Also, you don’t have to define any other attributes beforehand.
DynamoDB doesn’t endorse Join operations through tables.

Nonetheless, there are certain situations where you can instead use a normalized schema.

Normalized Schema can be Considered When:

You’re required to store items with sizes greater than 400 KB. However, the maximum item size permitted in DynamoDB is 400 KB. Therefore, larger attributes can be stored in Amazon S3 or on an individual DynamoDB table.
You’re expecting various access patterns. For instance, take a product order table, which is accessed every time a client orders a product. Now take a product availability table, which is only accessed occasionally. Both tables have different read & write capacity prerequisites.
Your applications perform several updates. In Amazon DynamoDB, a WCU (write capacity unit) is outlined as one write/second, for an item 1 KB in size. Moreover, numerous writes of data items greater than 1 KB will influence the number of WCUs exhausted per write. Also, even if you’re updating just one attribute, the WCU calculation is dependent on the size of the complete item.

Denormalized Schema can be Considered When:

Small items with few attributes are required to be stored. Also, the item size shouldn’t exceed 4 KB for reads. Remember, an RCU (read capacity unit) is specified as one read/second for an item less than 4 KB in size. Alternatively, for writes, the item size shouldn’t exceed 1 KB.

Your applications are required to read as well as write data in a rich-traffic environment, without considering the synchronization and consistency of data across various tables.

Normalization Use Case in Healthcare Data Warehousing

Today, data normalization holds a broader role in an array of healthcare settings, as it offers a framework to interpret data to facilitate several use cases.

To simplify things, the following is a common area that employs data normalization:

Making Data Useful Within the Data Warehouse

A typical data warehouse fetches patient data from a host of IT and clinical systems. It’s because organizations want to centralize their reporting projects, quality programs, utilize data to engineer predictive models, and use claims and clinical data to supervise the care-giving process.

So, what are the applications of normalized data in healthcare?

Let’s say a hospital intends to trend A1C hemoglobin levels for a demographic of diabetic patients. However, it’s common for laboratory systems to put their personal proprietary lab codes in place that are not standardized on LOINC. The hospital’s data warehouse may well possess Lab A results codified as 4321/Hgb A1c blood. In the same repository, the hospital may have LAB B results standardized already as code 17855-8 on LOINC.

Both represent A1C lab results, but the data ought to be normalized to a standard like LOINC for the healthcare system to correctly interpret and analyze both the lab results.

Denormalization Use-Case

Consider a blogging platform where you would store posts and users. The posts are referencing their authors through an Author_ID field. You would store these two entities in separate collections. It’s because posts will contain views, likes, comments and other statistics, and therefore, will need more write throughput compared to the users.

Now, you would perhaps dish a feed of the most recent posts and the feed will directly aggregate the authors in each post. In data warehousing, it’s accomplished by setting the posts’ Author_ID as a foreign key, followed by querying both users and posts tables with a Join transformation to build the feed in real-time.

However, that option is not available here; instead, you will have to denormalize your data by preserving the feed in an individual container that is waiting to be queried. This container will carry a copy of the data stored in users and posts containers, with the authors consolidated in the posts.

Conclusion

The schema you choose for your data warehouse primarily depends on the access patterns of the applications and the size of your data items. You can consult our solution architects to ensure the usage of advanced automation techniques when normalizing or denormalizing your data warehouse.