referential integrity constraint

Implementing Referential Integrity in a Data Warehouse: A (Controversial) Decision with a Lasting Impact

No other feature in relational database management is, arguably, as integral as referential integrity (RI) constraints . The feature plays the all-important, two-fold role of ensuring the accuracy of data between tables in a database. Firstly, by preventing users from entering data that might not be entirely accurate, and secondly, limiting applications from pointing to nonexistent data.

Relational databases work by breaking the storage of data into elements that need to be put together to produce meaningful results. So, suffice to say that the absence of RI would lead to data getting dropped or duplicated, both of which means data losing its integrity.

The particular characteristic mentioned above requires you to be either all for it or all against it. This is one of the reasons why referential integrity continues to be a heated topic of debate, even for the champions of data warehousing.

Disagreements aside, there is no denying the criticality of referential integrity. So, we are here with some pros and cons to help you decide whether you should implement referential integrity in your data warehouse or rely on the Extract, Transform, and Load (ETL) processes in your application to prevent breaches of referential integrity.

Let the referencing begin!

What is Referential Integrity?

Referential Integrity in the data warehouse is a form of data integrity that is particularly important in relational databases. Relational databases break the storage of data down into elements that usually have to be joined back together again to produce meaningful results. If it is not implemented properly, data would get dropped or duplicated.

Referential Integrity & Data

Before moving forward, it is vital to make one thing clear: Referential integrity is a decision, NOT a standard practice. The choice you make depends on the data you are dealing with. Enabling referential integrity constraints, such as primary and foreign keys, in a relational data warehouse or database may, at times, increase the load-time. This is a valid concern, especially if you are talking about enforced constraints that are checked in real-time during the loading process.

At the other end of the spectrum, a valid argument for not enforcing referential integrity at the data warehouse level is that ETL processes sometimes have to load data in such a way as to violate referential integrity. In such situations, there’s no alternative than letting the ETL process enforce referential integrity constraints In cases such as these, you must have strong standards and control over your ETL processes to prevent breaches of referential integrity.

Referential Integrity & Business Intelligence

Many business intelligence (BI) professionals insist on implementing referential integrity constraints, but problems arise when loading facts and dimensions since we tend to dump all the data and do a full re-load of tables. This simplifies critical lookups and ensures that the keys line up correctly. However, things so south when it comes to truncating a table with RI enabled. The error informs that you cannot truncate the table because a foreign key is referencing it. So, most BI developers do not implement RI, as it becomes a bit of a pain at the end of the day.

Referential Integrity & SQL Server

Blindly implementing RI because you’re “supposed to” is a presumptuous, neglectful response to a complex problem. A vast majority of industry insiders will agree that leaving constraints at the application layer is a novice mistake. Nevertheless, it may be our best choice when it comes to data. Yet, having table-driven constraints is always preferred, and that is where SQL Server comes to your rescue.

Internally, SQL Server holds a lot of metadata about tables, indexes, relationships, and column types, to name a few. Better yet, all that information is query-able through the system catalog views, provided you have the appropriate permissions. This is how you can build referential integrity in your data warehouse and re-establish it to create your ETL application using SQL Server.

Yay or Nay for RI Constraints in Your Data Warehouse?

Despite all the problems stopping you from implementing RI, it still plays a vital role in saving your data from being permanently corrupted or deleted. Careless mistakes such as deleting data in one table, or even erroneously dropping a table can be avoided by using the Foreign Key constraint of RI. Talk about a job saver! But data deletion isn’t the only concern; performance is an equally important factor to consider.

Retaking SQL Server as an example; it doesn’t just enforce RI, it actually feeds this knowledge into the database optimizer. This means that certain queries can be more efficient when RI is in place. For instance, you might see a performance degradation on “writes”, you’ll see a performance gain on “reads”. Since data warehousing is our focus, having slow writes and fast reads is an OK trade-off.

A proven way to make an informed decision about whether to implement referential integrity constraints or not is to sit with your team and ask:

  • Will implementing RI improve or impede the team’s performance?
  • Does our data warehouse architecture have adequate data integrity checks?
  • What will the downstream impact be on our development and support lifecycles?

Putting the Pieces Together

Since RI is an important decision, DWA’s must test and question everything they implement. Researching and testing your solutions, even as you continue to implement them, is the best way to get your data warehouse to work efficiently.

For further assistance, feel free to get in touch with our experts to build your ETL.