Data Cleansing - Get Rid of Dirty Data in Your Data Warehouse

Data Warehouse Cleansing: Ensure Consistent, Trusted Enterprise Data

Over the past few decades, enterprise data warehouses have evolved to accommodate and process zettabytes of data. Businesses are revamping and evolving their legacy architecture with the help of emerging data warehousing techniques like data warehouse automation (DWA) and data virtualization. This is because the goal of every business is to work with high-quality, trusted data for well-grounded decision making, which can be achieved through data warehouse cleansing.

Yet still, only 27% of the corporations with business programs in place report any substantial success with analytics and reporting. The reason for such low success rates is because companies don’t have quality data to work with. According to a study published in the Harvard Business Review, only 3% of the enterprises surveyed met the standards of quality for business data. With erroneous and duplicate data to feed your analytics tools, effective business intelligence for analytics and reporting is nearly impossible.

When structuring your data warehouse architecture, you must design processes that churn out clean data for analysis. This blog post is focused on the importance and challenges associated with an effective data cleansing process for a successful BI experience.

Why Cleanse Your Data Warehouse?

Data cleansing is a process that enterprises employ to eliminate or “cleanse” poor quality information, sometimes referred to as coarse data, from a data repository. The process entails first detecting duplicate, corrupt, or inaccurate datasets within a data repository and then applying corrective actions like replacing, modifying, or even deleting data records to ensure data consistency and integrity.

When integrating data from a single source, inconsistencies often arise due to incomplete or erroneous information being fed into the databases. Not implementing an effective data cleansing strategy in time might result in these errors reflecting in your analytics reports, leading users to base their decisions on inaccurate insights. 

In scenarios where data comes in from disparate data sources, combining them to build a consolidated repository, like a data warehouse or a federated system of operational databases, can further complicate things. This is because when dealing with dissimilar sources, structural differences along with redundancy in data representations across multiple databases make it challenging to demonstrate enterprise data in a unified format for a data warehouse.

One way of fixing these errors is by traversing entire databases and correcting information along the way. But with the ever-increasing amounts of enterprise data, such a laborious and time-intensive technique is unfeasible to perform on a daily basis. 

Alternatively, you can use a data quality software specifically built to cleanse EDW data.

How to Choose a Data Warehouse Cleansing Strategy?

The availability of reliable, precise and unique data is vital for an operational business model because it is a critical aspect of intelligent decision making. Generally, you need to first load data into a staging area for performing data cleaning and scrubbing processes by applying various transformations.

To speed up the process, modern data cleansing tools provide the option to store and process data in a temporary memory, saving time and resources in physical data movement. This also enables you to rigorously test the data cleansing configurations before exporting it back to its destination.

When working with a data cleansing tool, make sure that it is able to:

  • Limit manual intervention with minimal programming effort
  • Perform data cleaning with schema-related transformations that depend on the available metadata
  • Identify and remove all data inconsistencies in single and multiple data source integrations
  • Accommodate additional data sources in the future
  • Provide mapping functions for data cleaning in a declarative way that can be reused for query processing along with other data sources
  • Support a work-flow infrastructure that transforms large amounts of data from multiple sources in an efficient manner

Common Data Cleaning Problems

Data transformations are often applied to address a myriad of issues with data sources like schema evolution, migration of data from a legacy information system to a new data source, or integrating data from diverse sources.

Generally, when integrating data for unified views, several schema-level and instance-level complications can arise. Considering the specific complexities of these issues, we have broadly categorized them into:

  • Single-source data integration
  • Multi-source data integration

Single-source data integration

For a source, data quality significantly depends on its schema and integrity constraints, which control the range of permissible values of the resources. For sources with well-defined schemas like relational databases, precise data model, and application-specific constraints, there are less frequent data quality problems. On the other hand, sources like files, with no schema, have insufficient constraints in place, which result in the poor quality of data in data warehouses.

Therefore, schema-related problems generally arise due to unaddressed model- or application-specific complications. Possible reasons could be:

  • A weak schema-design
  • Data model restrictions
  • Limited constraints defined to reduce the integrity control operating costs

Let’s understand schema-related issues with the help of an example:

Existing data entry Employee: ‘Eva Smith’, SSN: ‘987654’
New data entry Employee: ‘Emily Miller’, SSN: ‘987654’
Conflict Data uniqueness violated

Table 1

In the table above, notice how data uniqueness is violated when a new employee with the same social security number is inserted into a database. This violation generally happens due to a lack of schema constraints in place.

Another discrepancy that data professionals encounter in an EDW is related to the instances of the records where sometimes erroneous information is fed to the systems. Instance-specific issues arise because of irregularities (like incorrect spellings) which cannot be addressed at the schema-level.

Let’s look at an instance-specific example:

Record City = ‘Chicago’, zip = 77777
Conflict Violates attribute dependencies

Table 2

Notice in Table 2 how the city name and the zip code do not correspond, violating attribute dependency. This is an instance-specific error which has no relation to the data source schema.

The availability of the necessary constraints in a data source can considerably reduce the need for extensive data cleansing mechanisms. If, however, you limit your data integration constraints, you will need an efficient data cleaning infrastructure in place for quality source data.

Enterprises can ensure standard data quality by integrating data quality tools with their systems along with a pre-defined set of constraints to control the data that go into these structures.

Multi-source data integration

With multiple data sources, the aforementioned issues are further magnified. Data in stand-alone systems are designed to work independently following a specific infrastructure for a set purpose. So, when enterprises integrate data from diverse sources, where each source may contain bad data within itself or when data in different forms of representations are combined, it might cause the information to overlap. This could potentially result in inaccuracies and bad data within the data warehouse.

Schema-related conflicts in multiple sources generally arise due to naming conflicts like homonyms (similar names representing different entities) or synonyms (different names for related entities). In addition to this, there might be structural conflicts in varying data representations for dispersed data sources like integrity constraints or data types to name a few.

Instance-related complexities for data integration from diverse sources are mostly similar to the single-instance ones already discussed above. Additional problems with related data types and data attributes might arise when value representations for them defer (e.g. marital status) or difference in interpretation of values (e.g. US dollar vs. Euro) is encountered.

Another common problem in this scenario is data duplication. This happens when various records in several systems point to the same entity. With the surplus of data related to customers and vendors in business systems, duplicated data might enhance the object identity problems, decreasing functional efficiencies of the enterprise systems.

Therefore, deleting multiple records and merging them into the data warehouse system enables enhanced BI and analytics.

Let’s work through an example to grasp these concepts better.

Customer (source A)

CID Name Street City Sex
12 Kristen Smith 2 Hurley P1 South Fork, MN 48503 0
25 Summer Smith Hurley St 2 S Fort MN 1

 

Client (source B)

Cno LastName FirstName Gender Address Phone
25 Smith Sophia M 23 Harley St, Chicago IL, 60633-2394 444-333-6543
494 Smith Kris L. F 2 Hurley Place, South Fork MN, 48503-5998 555-666-7777

 

Customers (sample integrated solution)

No LName FName Gender Street City Zip Phone CID Cno
1 Smith Kristen L. F 2 Hurley Place South Fork 48503-5998 555-666-7777 12 494
2 Smith Summer M 2 Hurley Place South Fork 48503-5998   25  
3 Smith Sophia M 23 Harley St Chicago 60633-2394 444-333-6543   25

 

In the above example, notice that both sources follow the relational database model but have a different schema. Moreover, when trying to integrate them, several data conflicts may arise.

Schema-level conflicts here refer to naming differences (CID/Cno, sex/gender, etc.). Structural differences, on the other hand, are about the address representations which differ for both the scenarios. Furthermore, there appears to be a duplicate record (Kris L.).

These data representation issues can be resolved using the merge/purge technique which integrates data from multiple sources, detects and eliminates duplicates while purging the unnecessary records from the databases.

Integrating data from dispersed sources for BI and analytics is vital for robust decision making. But with incomplete, erroneous data plaguing information systems, investing in such data quality tools can help in yielding the desired results. 

Need help in incorporating a reliable data cleansing system into your enterprise data warehouse? Our solution experts can help you out.

Astera Centerprise