Starting off in the early 90s for data warehousing, large companies that ran substantial transactions and had huge user base used Extract-Transform-Load (ETL) processes to consolidate transactional data across all their systems for reporting and analysis. Fast forward to today, and ETL still remains at the core of data warehousing, albeit with vastly expanded functionality, applications, and variants. We’re going to take a look at one such variant in this article, designed to deliver enormous efficiencies in certain data management cases by reordering the Extract-Transform-Load process to Extract, Load, and Transform (ELT).
With ETL, data integration is performed by copying data from disparate sources into a staging server, also called ETL server, where a transformation engine runs computations and operations to transform data into required styles and formats. This “transformed” data, having been prepared to conform with destination data structures, is loaded into the target data warehouse. The presence of the middle transformation step means that:
- Data takes longer to load because it is first loaded for transformation into a staging server, and then loaded again to a target repository post-transformation
- Time taken to perform complex transformations on high volumes of data quickly adds up as the amount of data to be processed increases
- Maintenance is higher because data needs to be selected and loaded to be transformed
These are the issues that are addressed effectively by reordering the ETL process.
What Happens When We Switch “L” and “T” in ETL?
Once data is extracted from source systems, it can also be loaded directly into the target data warehouse without transforming it in a staging area first. The transformations and data scrubbing can be done directly in the database, utilizing its own resources post-load. Note that, in the cycle outlined, data is being Extracted, Loaded, and Transformed. The ETL process is reordered to ELT, which brings its own set of benefits.
Both data integration architectures offer practical benefits, but IT must consider internal capabilities of target databases and the transformations a business might want to perform in those databases. Recent evolutions in available processing power such as Massively Parallel Processing (MPP) and virtual clustering have made ELT increasingly viable when large amounts of data need to be processed fast. This is the primary goal of pushing the transformation logic down to the target database, which brings us to the term push-down optimization.
What is Push-Down Optimization?
While ELT has been around for quite some time and its benefits were clear, it was still not very common because implementation required deep knowledge of ETL and ELT and specialized tools, along with extensive custom scripting. Modern ETL tools with advanced automation capabilities are changing that, with some offering a built-in Push-Down Optimization mode that allows users to choose when to use ELT and push the transformation logic down to the database engine with a click of a button. This approach, as discussed before, offers enormous performance benefits by removing data movement to and from the ETL server.
When processing large amounts of data that requires minimal transformation, ELT should be the preferred approach and choosing Push-Down mode is advisable, if your ETL tool provides it. On the other hand, when your data requires extensive conversion and transformation before it is ready to be loaded onto the target data warehouse, leveraging a separate transformation engine to perform operations pre-load might be a better idea. This approach also allows you to perform integrity checks and validations before data is loaded onto its destination database.
When to Use ELT
As discussed above, ELT delivers data directly from source to target, but this also means that it will need further processing before it becomes useful. Let’s take a look at three instances where reordering the ETL process is more beneficial:
- When speed is critical: Data is ingested much faster with ELT because there’s no need for an ETL server, and transformation and loading can happen simultaneously in the target database.
- When more raw information is better: In cases when having refined, transformed data sets are not as important as having a large volume of raw data so it can be mined for hidden patterns, ELT automatically becomes the preferred approach, speeding up the loading process at the cost of delivering raw data.
- When using high-end processing engines: Modern cloud data warehouse appliances and databases offer native support for parallel processing. This allows your ELT processes to take advantage of more processing power for greater scalability.
The flip side: you might be loading poor quality data into your data warehouse unless you make sure that you’re using ELT for replication processes with massive workloads but limited requirements of transformation, scrubbing, and validation. You will no longer have an ETL server designed to process complex transformations; instead, you will have massive amounts of data with minimal latency, available readily for analysis by data scientists. If you have both use-cases and would like to choose whether to process your data with ETL or ELT at run-time, or even if you want to process data partially through ELT and the rest with ETL, make sure you choose an ETL tool with built-in push-down mode that offers this capability.