“The world is getting more distributed and it is never going back the other way.”
If you’ve been actively researching big data analytics, you are likely to be familiar with the architectures that research firms propose. They understand that there are countless systems today, spread across too many paradigms, for a purely physical architecture to be feasible. The inherent plurality of analytical source systems mandates the use of a logical, abstraction layer in the big data analytics architecture.
A typical enterprise may be using one or a whole host of systems for reporting and analytics, such as Operational Data Stores (ODS), NoSQL databases, Hadoop clusters, analytical appliances, data marts, streaming tools, in-memory data warehouses, etc. Each of these systems stores data differently and has separate access mechanisms. This means that a physical architecture is not possible without significant data conversion and transformation, triggering the shift to logical architectures for big data analytics.
Big Data and the Data Warehouse
Back when the need for data analysis on an enterprise was realized in the 90s and 2000s, the popularity of data warehouses skyrocketed. The aim was to build a single source of truth – one repository where all enterprise data from every source would be replicated. While ideal, this vision is rarely realized on an enterprise level, with data sources numbering to hundreds and many being unregulated.
While popular, the approach revolves around integrating data from key transactional databases and other systems – those containing metadata defining transactional data – into the data warehouse for analysis. With big data though, enterprises are dealing with huge volumes of data, generated too fast to build a structured data warehouse to analyze it in a streamlined manner. This has led to the use of systems like NoSQL databases and Hadoop clusters.
Let’s take a look at the value of a logical architecture for big data analytics across distributed systems.
Logical Architecture – The ‘What’ and the ‘How’
All big data analytics architectures referenced in modern tech have one component in common: an abstraction layer. It creates a virtual, unified interface where consuming applications can access data from any source while being insulated from the technical aspects of how that data is accessed and from where. Such a component must encapsulate three critical characteristics:
- Data transformation capabilities to combine data from different sources and present it coherently while allowing smooth access to consuming applications for analytics
- A single point of access which ensures enterprises can apply data governance and security policies while maintaining necessary compliance
- Ability to insulate consuming applications from the underlying infrastructure of source systems, and provide business consumers rapid access with reduced complexity
Here are some approaches currently available to build this unified, abstraction layer to create an overarching big data analytics architecture:
1. Virtual Layer with Business Intelligence Tools
Business intelligence tools have progressed rapidly, with many offering the ability to provide analytics across disparate data sources. While this exists, query optimization becomes a serious issue with this approach. To perform a distributed query, business intelligence applications must load all data from relevant source systems and then perform joins to return its results. The repercussions are extreme, especially when you’re dealing with big data, with queries requiring the BI application to load billions of rows. Your network will suffer from the kind of throughput distributed queries required today.
The other issue is that other consuming applications will not be able to access the virtual, abstraction layer created using a BI tool, effectively creating a tool lock-in in your big data analytics architecture. These are two major reasons BI tools should best be used for reporting and not integration.
2. Virtual Layer with Enterprise Service Bus (ESB)
One way to tap into data from different sources in a unified manner is to create data services and publish them. If you’re considering this route, you’ve probably thought of using an ESB to accomplish this, seeing as the technology has become quite popular for creating service layers.
Issues arise when you think about how ESBs work, depending on procedural workflows, where processes for data manipulation are defined step by step. This means that if you’re running a query through an ESB, it should have an existing process already defined in the ESB – otherwise, it will not work. The same issue arises when you consider query optimization, with manually defined workflows unable to account for all possible execution strategies and cases.
As with BI tools, ESBs have their place, but it is not in creating a virtual layer for analytics.
3. Data Virtualization at the Data Warehouse
A select few data warehouse design vendors have understood the value of creating a virtual layer to give way to a logical architecture for comprehensive, enterprise-wide analytics. Data virtualization complements traditional data warehouses with unstructured and semi-structured data sources, including big data lakes. It allows you to create a unified, virtual layer that can be configured to have a single point of access, and also insulates users from the underlying complexity of accessing source data.
Depending on the data warehouse automation tool you choose, you may find that available data sources are limited and query optimization is still lacking, but this domain is developing rapidly. Gartner predicts that, by 2020, 35% of all enterprises will be using data virtualization as their primary approach to data integration, potentially making this route future-proof to build a logical architecture for your big data analytics.
Consult with our solution architects for free and learn how to use data virtualization with your data warehouse.