An enterprise uses an average of 928 applications, along with many other on-premise systems. This means that you can have about a thousand source systems from different vendors, each storing data differently. When data is scattered across so many systems throughout the enterprise, how do you make sense out of it? If you have a little know-how of the data management realm, you already know that data integration is the answer. But the real question is, which type of data integration is best suited for your organization’s business intelligence needs?
Let’s take a look at four common data integration approaches and the pros and cons of each to understand in which scenarios they are best suited for.
Export and Import Data
The simplest and the earliest data integration approach, it refers to exporting data from a source system in a file and then importing it to your target system. Let’s say you need to bring your marketing campaign data to your sales application. You could export data from individual campaigns in a .CSV file and import it to your sales application manually. The other option is to develop a custom program that automatically exports data from specified campaigns and imports it at a pre-configured time. Either way, there’s a lot of manual work required – that is conducting the import/export with the former option and doing custom development with the latter.
Even when you’re doing it manually, your source and target could have different fields. Perhaps your marketing system has separate fields for FirstName and LastName while the sales app only has the FullName field. You’d need to transform data to make it consistent before importing. Another major limitation is that you can only export and import data to and from two systems at a time. In enterprise environments, you could potentially be required to integrate data from hundreds of applications.
Once you start thinking about data integration on a large scale, ETL becomes an viable option, one that has been around for decades due to its utility and scalability. As is clear from the abbreviation, the ETL process revolves around Extracting the desired data from the source system, Transforming to blend and convert it into a consistent format, and finally Loading it to the target system. The entire process is largely automated, with modern tools offering a workflow creation utility where you can specify the source and destination systems, define transformations and business rules to be applied, and configure how you want the data to be read and written. The workflow could include multiple integrations from a variety of source systems. Once completed, you can execute the workflow to run ETL jobs behind the scenes.
While ETL does have its own set of challenges, many of them are not properly understood. One major misconception is that ETL is ideal if you’re integrating data between two systems. In reality, it depends on how you’re creating the ETL code. If you’re hand-coding it, then integrating multiple systems would become a chore, but you can use a tool that allows you to build integration flows visually, so they can be as complex as you need them to be while you use drag-and-drop operations to continue building. Take your workflows deeper, and could even create an Enterprise Data Warehouse or data marts if you start thinking of your integration flows on a macro level.
Another ETL misconception is that it only allows data to be loaded in batches, on fixed hourly, daily, or weekly frequencies. However, today’s data integration software offer advanced Change Data Capture (CDC) features that allow data integration in real-time so your business intelligence platforms remain up to date and allow decision making on current data.
While ETL has been around since the 70s, point-to-point integrations remained popular until the 2000s when the increasing number of enterprise applications made the approach unsustainable. If you want all your enterprise applications to be able to communicate with each other, you’d need to build (n*(n-1))/2 bi-directional point-to-point integrations, where ‘n’ refers to the number of enterprise applications.
Let’s use the average 928 applications number we mentioned at the start to bring this into perspective. If your enterprise has 928 applications and each needs to communicate with each other, you’d need to develop a code for 430,128 point-to-point connections. This is clearly impractical, more so when you account for maintenance. With each new application release, you’d need to do regression testing and fix issues continuously, looping you into an endless cycle of testing and maintenance.
Just as with ETL, point-to-point integrations have evolved too, carving their own utility in the data integration space with modern variants, such as Enterprise Application Integration (EAI), which uses Enterprise Service Bus (ESB) as a solution. This model centers on a hub-and-spoke approach to build point-to-point integrations. ESB software offer a pre-built environment that allows rapid development of point-to-point connections in an enterprise while allowing the capability to develop transformations, error-handling, and performance metrics within that same environment. The result is an integrated layer of services, with business intelligence applications invoking services from the primary layer. This solution has made point-to-point integrations viable again for complex integrations, but still requires IT involvement.
The data virtualization approach is becoming increasingly popular because it eliminates physical data movement altogether. Data sits where it is created in source systems, and queries are run on a virtualization layer that insulates users from the technical details of accessing data. Queries could be from your reporting application, or any business intelligence system that retrieves data, blends it, and displays results to users.
For the connecting applications, the virtualization layer looks like a single, consolidated database, but in reality, data is accessed from different source systems. That’s why data virtualization layer is also referred to as the ‘logical data warehouse’. Data virtualization software today also support caching mechanisms, so you can make it available when running multiple queries on the same sets of data, which reduces time and effort.
The primary benefit of this approach is that data you don’t need a separate database to consolidate data from disparate sources, or need to apply complex transformations to make formats across each source consistent.
Keep in mind that data virtualization is not a replacement of the Enterprise Data Warehouse (EDW). Rather, it complements the data warehouse by providing convenient access to unstructured data types. Together, the EDW and data virtualization can serve as your ‘single source of truth’ for BI.
While we have explained common data integration approaches here, you can talk to our experts for a free consultation if you’re still confused. Ideally, your chosen solution should support multiple data integration types for building integrations as per your business needs and have supporting automation features to ramp up the process.