At its core, data virtualization falls within the domain of data integration. But unlike traditional data integration where Extract-Transform-Load (ETL) processes are used to physically move copies of data from disparate sources into a single, unified data warehouse, there is no physical movement of data with data virtualization. Source data remains where it is – no additional copies are created and it is not moved physically anywhere when data virtualization technology is used. Instead, different views or snapshots of enterprise-wide data are provided through a virtualized service layer designed on top of disparate sources. In other words, data is accessed from where it resides.
This virtual layer provides a needed level of abstraction, hiding the complexity of data access mechanisms, such as what type of system is being used to store data, what APIs are used to connect with source systems, and where exactly the data is stored. The end-user can simply run ad-hoc queries on the data virtualization layer without bothering themselves with the technical aspects of accessing required data.
With this approach, organizations no longer need to pay for resources to physically move and store copies of data, resulting in significant savings. Being an architectural layer deployed above the organization’s source systems, data virtualization also offers the additional benefit of managing access, privacy, security, and governance protocols from a single, virtual point of control. Regulatory issues and laws governing the handling of data, like the General Data Protection Act (GDPR), can be solved and compliance ensured by using data virtualization for data governance. This is one of the biggest reasons data virtualization is fast becoming mainstream today, with 56% of all respondents in a 2017 Forrester survey of global technology decision makers reporting that they’re currently using data virtualization, which is up from 45% in 2016.
How Data Virtualization is Being Used Today
The rapid rise of data virtualization has birthed a variety of business initiatives driven by the technology. Business tech leaders are using data virtualization today to promote real-time data sharing, provide a 360-degree view of enterprise data, control access to sensitive data through a centralized point, provide a self-service platform to both business and technical users, and complement their data warehouse design and enhance BI processes – to name a few uses.
Data Virtualization in the Modern Data Warehouse
Organizations typically decide to build a data warehouse when the data is fragmented and too many data silos exist. Each silo or data location has its own method of access, and with that comes the inability of business users to access the information they need, when they need it. Data virtualization makes the data location-agnostic, meaning that you don’t have to worry about where it resides and the various access mechanisms behind it. You simply need to access the virtual layer, pull the data you need within that layer, and drill down as deep as you need to while remaining insulated from the details of the underlying data.
Your data virtualization software takes care of temporarily pulling the required data onto the virtual layer, performing joins as needed to present it to you, and caching this data if you need it for later use. The simplicity of data virtualization lies in how it makes it look like you’re retrieving data from a single source, while the software is, in reality, pulling it from a number of disparate sources on the backend. The best part is that you don’t have to worry about the “how” of dealing with complex systems and source connections.
Let’s take a look at a few use-cases where data virtualization works in tandem with data warehousing to support and/or complement business intelligence.
Access External Data: A survey on BI Leadership Forum found that 77% of respondents use data virtualization to access data not available in the data warehouse. This external data could be clickstream data from IoT devices, data from public websites or subscription services, or simply a new enterprise data source that was built after the data warehouse had already been created.
Augment Data Warehouse with Real-time Data: Data warehouses are, by nature, built to store historical data for reporting and analysis. While technologies like Change Data Capture enable a business to record data updates and reflect them in a data warehouse, the database still won’t always have real-time data. Data virtualization software can be used to query real-time data from operational databases or any other source, join it with relevant historical data on-the-fly from your data warehouse, and display a complete picture of the required perspective.
Prototype Your Data Applications: Use of a virtual layer allows you to play around with different sets of data and prototype the end-result before moving the data to a physical store, a data warehouse for instance. This is a common use of data virtualization technology, allowing developers to test data-driven application or even data warehouse source systems when deciding on the architecture of the system.
Use It as a Source for ETL: When populating the data warehouse, organizations often use a staging database to bring together data from disparate systems. This can be done using a virtual layer, where you pull data from different sources, test it, perform any operations on it, and then use the data virtualization layer as a single source for your ETL tool rather than connecting it to a variety of source systems separately and figuring out access mechanisms for each.
Build a Logical Data Warehouse: In organizations with a decentralized structure, each department operates with a significant degree of autonomy and has their own departmental data warehouse or data mart. Data virtualization software can be used to build a virtual layer that encompasses data from all these mini data marts and build a logical enterprise data warehouse for decision-making. This would enable users to dynamically query data throughout every data store in the organization, without having to build a physical, consolidated enterprise data storage system.
ETL vs. Data Virtualization
Data virtualization does not seek to replace ETL. Rather, it complements the tried and true methodology by making data integration more agile. In certain cases, the physical movement of data is, in fact, more beneficial for a business, and that’s where ETL shines. But, the same data may need to be integrated temporarily with external applications to provide a 360-degree view – and that’s where data virtualization could be used to complement ETL. That’s just one example among hundreds of potential use-cases. The key to benefiting from both technologies is to use a data warehouse automation software that gives you the flexibility to use either ETL or data virtualization while making data warehousing easy enough for non-developers to benefit from it too.
5 comments