Integrating, reorganizing, and consolidating large amounts of data from a variety of different sources is a key consideration when planning your data warehouse architecture. Extract-Transform-Load (ETL) processes are used to extract, clean, transform, and load data from source systems for cohesive integration, bringing it all together to build a unified source of information for business intelligence.
A key design concept, ETL is at the core of data warehouse architecture. It ensures that all the processes connect seamlessly and data continues to flow as defined by the business, shaping and modifying itself where and when needed according to your workflow.
Here are five things you should do when considering data warehouse architecture from an ETL perspective:
- Determine business requirements
- Understand and profile your data sources
- Determine data extraction methods
- Establish data transformation requirements
- Decide how you will orchestrate the ETL process
Let’s take a look at a typical data warehouse environment to understand the basic architecture and delve deeper into these 5 steps:
Data warehouse architecture diagram
Determine Business Requirements
When deciding on your data warehouse architecture, you must ensure that the output of your data warehouse aligns perfectly with organizational goals. This step is critical as it can make or break the success of your business intelligence initiative.
Figure out what your business users and stakeholders expect to achieve from the data warehouse and understand the needs of each specific group of users. Ask them to clearly outline the ‘why’ so you can filter and prioritize data warehouse requirements, determine the source systems needed to fulfill those requirements, and think about how and when this data will be consumed by the enterprise data warehouse.
Questions like these should be asked and answered from as many perspectives and in as much depth as possible. The answers will determine how you need to architect the solution and perform ETL when building the data warehouse.
Understand and Profile Your Data Sources
In the architectural diagram above, you can see a list of typical data sources on the left. Data warehouses are naturally resistant to structural changes, and so, source data systems must be carefully analyzed and chosen during the data warehouse development. According to TDWI, it takes 7.1 weeks on average to add a new data warehouse source after the system has been built.
Profile your data sources based on its type. For instance, you could begin by listing down your production databases, such as MS SQL or PostgreSQL, SaaS applications for sales and marketing like HubSpot or Google AdWords, customer support data sources like ZenDesk, ecommerce sources like Shopify and Stripe, legacy sources like COBOL copybooks and IBM mainframes, and unstructured report sources like PDFs and Word files.
Identifying the sources will allow you to prioritize better and think about how data from each of your sources must be extracted. You cannot perform ETL without understanding source data.
Determine Data Extraction Methods
With your data sources listed, think about the unique data extraction challenges of each source. Traditionally, data extraction using ETL was associated with transactional databases, but enterprises are increasingly using SaaS applications while also moving from paper to digital reports. All of this data must be fed into the data warehouse if it can help with decision making. This means business intelligence teams must think about how to extract data from unstructured sources using report mining tools to convert it into structured formats, how to perform API-based integration to extract data from SaaS applications, and how to integrate with legacy systems, like COBOL, and extract data from copybooks, in addition to determining the extraction method from regular relational databases.
In addition to the extraction method, you must also devise an extraction strategy before and after the system is in place. The data warehouse design should accommodate both full and incremental data extraction. When data is being loaded for the first time, full extraction is needed, but after that, you can use incremental data extraction techniques like Change Data Capture (CDC) to regularly update only records that have been modified.
While considering data extraction, determine whether the extracted data needs to be copied to a staging database first (as seen in the diagram above), or to the data warehouse directly. It’s not always possible to extract all required data at the exact same time. Reasons could include varying business cycles, geographical factors, limitations of processing resources, etc.
One example is that of financial data, which often requires reconciliation at the end of a month to make sense for end-users, while sales data, for instance, could be extracted and loaded on a daily basis. In such cases, a business can consolidate data in their staging database and then load it into the data warehouse at a pre-specified time and frequency using their data warehouse tool’s workflow orchestration capabilities.
Establish Data Transformation Requirements
Closely associated with the data extraction stage, data usually needs to be converted to make it conform to a standard schema that the data warehouse uses for storage. Data restructuring and data cleansing to fix inconsistencies are key when considering the transformation phase of ETL in the data warehouse architecture.
The amount of work this requires necessitates a degree of automation. When considering your data warehouse design, think about the various ways you’d need to validate, clean, and convert source data to transform it into the finished product for loading into the data warehouse. As seen in the architectural diagram, source data undergoes a number of transformations at several stages, which must be predefined in your data warehouse workflow.
Decide How You Will Orchestrate the ETL Process
We have talked about the different phases of building an ETL architecture for your data warehouse, but it all boils down to how you orchestrate each phase and develop the functionality needed to do so. Your data warehouse architecture design is not complete until you figure out how to piece all the components together and ensure that data is delivered to end-users reliably and accurately.
Choose a data warehouse automation tool that has built-in job scheduling, data quality, lineage analysis, and monitoring features to allow you to orchestrate the ETL process easily. While this will ensure that ETL plays its role correctly in your data warehouse architecture, try to choose a data warehouse solution that provides end-to-end automation, allowing you to visually model your data warehouse and orchestrate integration flows, while the associated ETL code is generated automatically in the background.
11 comments