Date Virtualization for Agile Data Warehousing

Enterprises, owing to their operations in diverse global markets, can no longer depend on traditional data warehouse architectures to fulfill their Business Intelligence (BI) requirements. This is because of the redundant and inflexible nature of ETL processes, which are unable to support the new BI specifications, like investigative analytics, agile reporting and analyses, and self-service BI. Over the years, new methods and advanced technology have been introduced to lay down the foundation for agile data warehousing.

An agile data warehouse follows a logical architecture to establish a system for BI and analytics. It presents data from different sources as a single, coherent database and hides the details of the source data, such as structure, location, size, and others, from the consuming applications and users.

This simplicity is only possible by decoupling consumers from the data storage areas as depicted in the figure below.

Taking the complexity of the architecture into consideration, there is no single method that can transform a physical data warehouse architecture into a logical one. Several tools and techniques, like master data management, metadata management, distributed processing, and data virtualization can be used to build logical architecture for your data warehouse. In this blog, we will focus on data virtualization and why this technology is critical for achieving an agile data warehouse framework.

What is Data Virtualization?

The data virtualization technology allows applications to retrieve and manipulate data without delving into data-related complexities. This is accomplished by adding a virtual layer to the existing enterprise data architecture, which hides the back-end data processes through front-end user applications, like dashboards and analytics tools.

Through data virtualization, the data rests at one place and is not moved or replicated to another data repository. This allows delivering real-time updates to the information systems, enabling a better analytics experience for robust decision making.

The technology helps combine data from disparate sources like data warehouses, data lakes, NoSQL databases, and legacy systems without investing in a separate physical layer infrastructure for data integration. These features make it an agile and cost-effective method of making information easily accessible across dispersed business units of an enterprise.

Data virtualization serves as an alternative to ETL and traditional data warehousing since it utilizes a virtual layer for accessing and consolidating data from multiple sources. It can deliver an effective system without investing extensive resources and time into performing a data warehouse optimization project. In addition, it can be extended to accommodate or replace both, ETL and ELT processes.

Does Data Virtualization Deliver on the Performance Standards of an Agile Data Warehouse?

Since lengthy and resource-intensive ETL processes are run numerous times to prepare data for analyses and reporting, the traditional data warehouse systems generally fail to meet BI requirements of large-scale businesses. As a result, companies can opt for a hybrid approach, comprising of data virtualization tools to achieve an agile data warehouse design.

Let’s discuss how data virtualization for data warehousing can help in speeding up various data processes for accurate BI and analytics.

1. Intelligent query optimization and parallel processing

The legacy concept of first retrieving data and then applying transformations is not followed here. As an alternative, data virtualization works on the concept of ‘smart data retrieval’. It only retrieves data relevant to the query made by users or back-end systems. So, instead of moving huge amounts of data across the network, only the relevant data is called for further processing after executing the query.

The SQL query optimizer component of a DV system divides the query into sub-queries, which are executed in parallel. Each sub-query is responsible for a dedicated data source. Since most user queries are simple, it generally doesn’t take long for them to execute and combine the results. While it can take up relatively more time for complex queries, it is still faster than the traditional ETL methods.

2. Query pushdown and parallel processing

Even with queries requiring a large amount of data retrieval, data virtualization ensures fast, optimized resultant datasets. Some of the key query pushdown features implemented in DV platforms are:

Full aggregation pushdown: Filtering and retrieving data from the source at a concise aggregation level.
Partial aggregation pushdown: Once the data is retrieved and filtered, the necessary joins and aggregation queries execute.

Let’s understand this through an example.

Assume that a car manufacturer is selling 2,000 products per month. For each product, the following details are entered in a MySQL database:

A primary key
Car name
Car category
Price

For this information, the sales department stores the following facts in a data warehouse:

A primary key
Sale date
Sale price

The data warehouse, consisting of over a million rows, stores information of the past one year. The rest of the data is transferred to a Hadoop-based data repository, which has the same sales material as the data warehouse.

Suppose, the BI team wants to extract a report that contains details regarding the car sales of the past two years. This requires:

Data of the past year from the data warehouse
Older data from the Hadoop-based hive

Note that the final report required for analysis will consist of just 2,000 rows. In a typical scenario, the query will be pushed down to both the data sources, returning a combined dataset to satisfy the query conditions. However, the number of rows in the resultant database will be in hundred-thousands, compiling and filtering information from which will be a time-intensive process.

On the other hand, working with an agile data warehouse system, which incorporates a data virtualization platform in its architecture, will execute this query in the following manner:

The query is divided into two sub-queries:
- One is pushed down to the data warehouse to compute last year’s result
- The other query executes in parallel after being pushed down to the Hadoop-based hive
Both the queries return a single row for each product, creating a database of a total of 2,000 rows from each data source).
The data virtualization layer simply adds these partial results to obtain the desired output.
Simultaneously, the DV platform also queries the database products table to integrate the additional product details with the result before displaying it as an output.

In most DV platforms, these techniques are automatically activated as the need for them arises.

3. Database Caching

Database caching is a technique used to store frequently accessed data in-memory. So, when the same information is accessed again, it can be directly retrieved from the cache instead of the data source. Caching serves many purposes within a DV platform, such as improving query performance for extracting data, especially when dealing with data sources of diverse latencies or reducing the unnecessary data duplication costs.

Although in contradiction with the standard data virtualization principles, you can activate caching in an agile data warehouse for faster data processing. This option is available because in some scenarios, like a slow query execution process, it is the only solution. In such cases, you can revert to cached data instead of going through the whole process of data replication from individual sources every time.

While it can be a laborious task if used inappropriately, intelligent caching, however, can greatly help in enhancing query execution performance. For example, if your business requires you to frequently access data on the cloud, e.g. SaaS, caching some of it will result in fast query results. Cached data, however, can be outdated if not updated in a timely manner. Since the whole idea behind data virtualization is to access data in real-time, you may have to schedule database caching to keep it up-to-date with the fresh data. You can refresh the cached data at set intervals, or to increase performance value, you can just update the modified data warehouse items in the cache.

Benefits of Adding a Virtual Layer to an Enterprise Data Warehouse

In addition to improved performance for your data warehouse, data virtualization offers several benefits, such as:

Enhanced security: Hides the intricate data details from unauthorized users.
Hidden data complexity: Provides a unified presentation of data upfront.
Varied data source integration: Combines data from diverse sources, like NoSQL databases or data lakes.
Improved business intelligence: Enhances data access in near real-time leading to faster analytics and improved decision making.
Query optimization: Streamlines queries for different types of data requests.
Real-time information access: Makes up-to-date data readily available for analytics and reporting.
Integration of diverse data sources: Enables users to access data from cloud and database sources in an agile manner.

To hold the reins of the ever-increasing data complexity, data integration techniques like data virtualization have become a necessity. Enterprises can retrieve data from disparate sources in a timely manner and make accessing data for front-end applications easier for users through a single, logical point. With the help of a DV tool, you can induce agility in your data warehouse environment by leveraging the query performance, database caching and other capabilities, making them ideal for effective business analytics.

Want to incorporate a reliable data virtualization solution for agile data warehousing? Our experts can help you out!