Over the past few decades, enterprise data warehouses have evolved to accommodate and process zettabytes of data. Businesses are revamping and evolving their legacy architecture with the help of emerging data warehousing techniques like data warehouse automation (DWA) and data virtualization. This is because the goal of every business is to work with high-quality, trusted data for well-grounded decision making, which can be achieved through data warehouse cleansing.
Yet still, only 27% of the corporations with business programs in place report any substantial success with analytics and reporting. The reason for such low success rates is because companies don’t have quality data to work with. According to a study published in the Harvard Business Review, only 3% of the enterprises surveyed met the standards of quality for business data. With erroneous and duplicate data to feed your analytics tools, effective business intelligence for analytics and reporting is nearly impossible.
When structuring your data warehouse architecture, you must design processes that churn out clean data for analysis. This blog post is focused on the importance and challenges associated with an effective data cleansing process for a successful BI experience.
Why Cleanse Your Data Warehouse?
Data cleansing is a process that enterprises employ to eliminate or “cleanse” poor quality information, sometimes referred to as coarse data, from a data repository. The process entails first detecting duplicate, corrupt, or inaccurate datasets within a data repository and then applying corrective actions like replacing, modifying, or even deleting data records to ensure data consistency and integrity.
When integrating data from a single source, inconsistencies often arise due to incomplete or erroneous information being fed into the databases. Not implementing an effective data cleansing strategy in time might result in these errors reflecting in your analytics reports, leading users to base their decisions on inaccurate insights.
In scenarios where data comes in from disparate data sources, combining them to build a consolidated repository, like a data warehouse or a federated system of operational databases, can further complicate things. This is because when dealing with dissimilar sources, structural differences along with redundancy in data representations across multiple databases make it challenging to demonstrate enterprise data in a unified format for a data warehouse.
One way of fixing these errors is by traversing entire databases and correcting information along the way. But with the ever-increasing amounts of enterprise data, such a laborious and time-intensive technique is unfeasible to perform on a daily basis.
Alternatively, you can use a data quality software specifically built to cleanse EDW data.
How to Choose a Data Warehouse Cleansing Strategy?
The availability of reliable, precise and unique data is vital for an operational business model because it is a critical aspect of intelligent decision making. Generally, you need to first load data into a staging area for performing data cleaning and scrubbing processes by applying various transformations.
To speed up the process, modern data cleansing tools provide the option to store and process data in a temporary memory, saving time and resources in physical data movement. This also enables you to rigorously test the data cleansing configurations before exporting it back to its destination.
When working with a data cleansing tool, make sure that it is able to:
- Limit manual intervention with minimal programming effort
- Perform data cleaning with schema-related transformations that depend on the available metadata
- Identify and remove all data inconsistencies in single and multiple data source integrations
- Accommodate additional data sources in the future
- Provide mapping functions for data cleaning in a declarative way that can be reused for query processing along with other data sources
- Support a work-flow infrastructure that transforms large amounts of data from multiple sources in an efficient manner
Common Data Cleaning Problems
Data transformations are often applied to address a myriad of issues with data sources like schema evolution, migration of data from a legacy information system to a new data source, or integrating data from diverse sources.
Generally, when integrating data for unified views, several schema-level and instance-level complications can arise. Considering the specific complexities of these issues, we have broadly categorized them into:
- Single-source data integration
- Multi-source data integration
Single-source data integration
For a source, data quality significantly depends on its schema and integrity constraints, which control the range of permissible values of the resources. For sources with well-defined schemas like relational databases, precise data model, and application-specific constraints, there are less frequent data quality problems. On the other hand, sources like files, with no schema, have insufficient constraints in place, which result in the poor quality of data in data warehouses.
Therefore, schema-related problems generally arise due to unaddressed model- or application-specific complications. Possible reasons could be:
- A weak schema-design
- Data model restrictions
- Limited constraints defined to reduce the integrity control operating costs
Let’s understand schema-related issues with the help of an example:
Existing data entry | Employee: ‘Eva Smith’, SSN: ‘987654’ |
New data entry | Employee: ‘Emily Miller’, SSN: ‘987654’ |
Conflict | Data uniqueness violated |
Table 1
In the table above, notice how data uniqueness is violated when a new employee with the same social security number is inserted into a database. This violation generally happens due to a lack of schema constraints in place.
Another discrepancy that data professionals encounter in an EDW is related to the instances of the records where sometimes erroneous information is fed to the systems. Instance-specific issues arise because of irregularities (like incorrect spellings) which cannot be addressed at the schema-level.
Let’s look at an instance-specific example:
Record | City = ‘Chicago’, zip = 77777 |
Conflict | Violates attribute dependencies |
Table 2
Notice in Table 2 how the city name and the zip code do not correspond, violating attribute dependency. This is an instance-specific error which has no relation to the data source schema.
The availability of the necessary constraints in a data source can considerably reduce the need for extensive data cleansing mechanisms. If, however, you limit your data integration constraints, you will need an efficient data cleaning infrastructure in place for quality source data.
Enterprises can ensure standard data quality by integrating data quality tools with their systems along with a pre-defined set of constraints to control the data that go into these structures.
Multi-source data integration
With multiple data sources, the aforementioned issues are further magnified. Data in stand-alone systems are designed to work independently following a specific infrastructure for a set purpose. So, when enterprises integrate data from diverse sources, where each source may contain bad data within itself or when data in different forms of representations are combined, it might cause the information to overlap. This could potentially result in inaccuracies and bad data within the data warehouse.
Schema-related conflicts in multiple sources generally arise due to naming conflicts like homonyms (similar names representing different entities) or synonyms (different names for related entities). In addition to this, there might be structural conflicts in varying data representations for dispersed data sources like integrity constraints or data types to name a few.
Instance-related complexities for data integration from diverse sources are mostly similar to the single-instance ones already discussed above. Additional problems with related data types and data attributes might arise when value representations for them defer (e.g. marital status) or difference in interpretation of values (e.g. US dollar vs. Euro) is encountered.
Another common problem in this scenario is data duplication. This happens when various records in several systems point to the same entity. With the surplus of data related to customers and vendors in business systems, duplicated data might enhance the object identity problems, decreasing functional efficiencies of the enterprise systems.
Therefore, deleting multiple records and merging them into the data warehouse system enables enhanced BI and analytics.
Let’s work through an example to grasp these concepts better.
Customer (source A)
CID | Name | Street | City | Sex |
12 | Kristen Smith | 2 Hurley P1 | South Fork, MN 48503 | 0 |
25 | Summer Smith | Hurley St 2 | S Fort MN | 1 |
Client (source B)
Cno | LastName | FirstName | Gender | Address | Phone |
25 | Smith | Sophia | M | 23 Harley St, Chicago IL, 60633-2394 | 444-333-6543 |
494 | Smith | Kris L. | F | 2 Hurley Place, South Fork MN, 48503-5998 | 555-666-7777 |
Customers (sample integrated solution)
No | LName | FName | Gender | Street | City | Zip | Phone | CID | Cno |
1 | Smith | Kristen L. | F | 2 Hurley Place | South Fork | 48503-5998 | 555-666-7777 | 12 | 494 |
2 | Smith | Summer | M | 2 Hurley Place | South Fork | 48503-5998 | 25 | ||
3 | Smith | Sophia | M | 23 Harley St | Chicago | 60633-2394 | 444-333-6543 | 25 |
In the above example, notice that both sources follow the relational database model but have a different schema. Moreover, when trying to integrate them, several data conflicts may arise.
Schema-level conflicts here refer to naming differences (CID/Cno, sex/gender, etc.). Structural differences, on the other hand, are about the address representations which differ for both the scenarios. Furthermore, there appears to be a duplicate record (Kris L.).
These data representation issues can be resolved using the merge/purge technique which integrates data from multiple sources, detects and eliminates duplicates while purging the unnecessary records from the databases.
Integrating data from dispersed sources for BI and analytics is vital for robust decision making. But with incomplete, erroneous data plaguing information systems, investing in such data quality tools can help in yielding the desired results.
Need help in incorporating a reliable data cleansing system into your enterprise data warehouse? Our solution experts can help you out.
2 comments