After the initial phases of gathering requirements and conceptualizing data warehouse design, the execution phase begins. Extract-Transform-Load from source data systems to destination staging database or data warehouse. The process naturally begins with extraction, where you integrate data from source systems across the enterprise for further processing. This is one of the most intricate stages in data warehousing owing to the complexities of each source system. The extraction approach you choose will be determined primarily by your source systems and business requirements in terms of how current you need your data to be in your data warehouse.
In this article, we will look at data extraction methods that can be applied once you’ve identified your source databases.
Understanding Data Extraction Types
While you may hear terms like full extraction and incremental extractions commonly when researching extraction methods in data warehousing, the process actually begins with logically defining extraction and then physically performing the extraction. Approaches like full load or incremental load come later and will be discussed here too.
Logical Data Extraction
This is the part where you will be involved for the most part. Creating a visual integration flow is advisable when considering how you will extract data logically. It will help your developers create a physical data extraction plan.
With the logical map ready, you will need to decide on which extraction approach to take:
- Full Extraction
- Incremental Extraction
As the name says, in full extraction, data is copied from the source system in its entirety. There’s no need to consider any logical data like timestamps to be associated with source data, because you’re copying everything that the source system contains – entire tables in one go. This is the simplest data extraction approach and also the fastest in many cases.
Let’s say your source database has 500 or so records. It would probably be faster to use the SELECT and FROM database commands to copy the table. If, on the other hand, you add the WHERE clause on timestamps, extraction would take longer to start, depending on the size of the table and whether the timestamp column is indexed.
The latter approach would still be advisable if you have large tables in the source systems, with millions of records. A rule of thumb is to go with full extraction if your source table has less than 1000 records. One way to accurately determine whether to extract in full or incrementally is to do both for a smaller sample and then evaluate performance and feasibility as per your business needs.
While we discussed full load in terms of performance, there are other instances where full extraction is necessary, irrespective of how long it takes.
First, when you’re building the data warehouse and need to integrate data from all source systems for the first time, full extraction is the only possible route. You’d still have to decide whether to apply the full or incremental extraction technique in subsequent passes. Second, when you’re creating an export file for a specific table, to be used for offline physical loading.
Data is extracted in increments with this approach. This could mean extracting data that has been changed or added after a well-defined event in the source database. By well-defined event, we mean anything that can be tracked within the source system using timestamps, triggers, or a custom extraction logic that you have built within the source system.
By that definition, the event could be the change of day, the end of the fiscal year, or even the end of a discount period at your stores. This approach is ideal when you’re extracting large tables. In transactional systems, common master tables like Product and Customer contain millions of records, so it is impractical to perform full extraction every time and compare the previous extraction with the new copy to record the changed data. To perform incremental extraction, your data warehouse must support Change Data Capture.
A variety of methods exist to perform incremental extraction, but the two most common are:
If your source tables have update and create timestamp columns, and they’re indexed, you’re in luck. You can extract records based on these identification columns to keep your data as recent as possible. The only issue with this method is that you cannot identify deletions. If your source system marks records for deletion instead of removing them physically, then you don’t need to worry about this. But if direct physical removal is the case, you may have to resort to a different method for deletions to be reflected in your data warehouse.
It is the simplest incremental data extraction method if you’re lucky enough to be allowed to modify source tables and build triggers. In many cases where source tables have different owners, you won’t be able to modify the table though. But if you are, then you could implement triggers to execute with creation, updates, or deletions of records. With triggers, you typically need a ‘change table’ where primary keys or changed records are stored so you can incrementally load only these records.
While these two are the most common approaches for incremental data extraction, you can also use identity tables, transaction data, or a combination of all four methods to extract your data as per business requirements and available resources.
Physical Data Extraction
Once you’ve logically extracted your data and chosen approaches, it’s time to decide whether you want to extract data directly from the source system by connecting the extraction process to it or extract source data from a data structure like a flat file, which is then later used to load data to the destination.
Automated Data Extraction
While we have explained data extraction types and approaches in theory, modern tools exist that automate the creation of logical maps and allow you to visually choose the extraction approach for specific read and write operations in your data warehouse. Change Data Capture can also be handled automatically with such tools. If you’re unsure, you can always ask for a Live Demo of the features you’re interested in.