Are you embarking on a cleansing journey for your data warehouse initiative? In this article, we will cover the 3 phases of data cleansing to help you define a process within your organization. Let’s jump right into it.
1. Analyze Your Data
The story begins with data analysis. You can’t fix what you don’t know. Start by analyzing your data for both schema and instance-related issues to determine the scale of data cleansing and the inconsistencies that require fixing. In most cases, you will have to conduct both manual and programmatic analysis to uncover all data quality issues.
When using an automated approach to data analysis, you may be tempted to turn to metadata to assess data quality. You will find that the schema metadata is insufficient to let you measure data quality alone, more so when integrity constraints are far and few. You can still use metadata though, just not the one initially reflected in the schema. Dive deeper into instances of data to engineer your “new” metadata based on patterns of unusual values or data characteristics.
There are two approaches to data analysis:
- Data profiling
- Data mining
We’ve already talked about analyzing data at instance-level. This approach is called data profiling. By opting for the data profiling route, you can derive data type, value range, length, discrete values along with their frequency, uniqueness, variance, null value occurrences, and string patterns that are commonly found, like phone numbers, etc. This metadata helps provide a more in-depth view into the quality of attributes in your source system.
Data mining, on the other hand, is used when you’re working with big data to identify patterns. Techniques like summarization, clustering, and association discovery are applied for this purpose. By identifying patterns, the technique derives integrity constraints that are used to fill missing values in records, fix illegal values, and point out duplicates across multiple data sources.
2. Decide on Data Transformations
During the data transformation phase, you will have to decide on the type of operations you need to perform on your data to cleanse it and attain the required data quality. In each step, you will be applying different data transformations to instance and schema-related issues in your source systems. How do you specify data transformations though?
Before the ubiquity of Extract-Transform-Load (ETL) processes for data integration, ETL tools supported several rule-based languages. Users could specify rules and the ETL tool would generate transformation code in one of the supported proprietary languages. This approach has been rendered obsolete with newer versions of SQL, which gave way to a highly flexible and adaptable approach to writing the transformation code.
At present, you can either write your SQL code to apply data transformations on source data for data cleansing, or you can work with a data management tool to visually define business rules and select data transformations. The tool would automatically generate relevant SQL code to process your data automatically. Such tools often have built-in data quality and data profiling modules. Therefore, you can select quality constraints to apply on the source from a drop down and automatically assign error status to records that fail your defined checks. This approach all but eliminates manual data analysis, while allowing you to select data transformation from available libraries. For instance, if you want to normalize your data, you can select a data source, visually connect it with the normalization rule graphic, and choose a destination. Your integration flow is now complete. Choose any data transformation from the library available in your tool of choice and apply it in a similar manner.
3. Resolve Conflicts
When integrating data from multiple sources, you will face several schema-related conflicts. Often, this will require you to restructure individual source schemas to obtain a unified, integrated schema that works for all the different representations. To do so, you may need to perform merging, splitting, and folding/unfolding of tables and attributes. These operations may seem complex but can be performed easily if you’ve chosen the right tool to manage and cleanse your data while building your data warehouse.
On an instance-level, you need to apply further data transformations to deal with conflicting records. Resolve conflicts and work on eliminating record duplication across your disparate source systems. Ideally, you should have cleaned single source data before moving on to eliminate duplication in multiple systems. To do so, you will first need to identify all the records that point to the same entity, and then merge them into one that contains the attributes of all the records, but without redundancy. This helps enrich entities while minimizing data redundancy.
Work out a plan to scrub, clean, and validate data when building your data warehouse by getting in touch with our solution architects and ensure your business users get accurate analytics.
2 comments