Too many buzzwords, too little understanding. With the ubiquity of data and analytics, IT vocabulary is expanding fast, and terms like data marts, data lakes and data warehouses are being used interchangeably. While they are from the same domain, their meaning is quite different when you go about building a data repository. This article focuses on providing a clear understanding of the definitions of data marts, lakes, and warehouses, both from a technical and business perspective.
Let’s start our differentiation with the term all of us are most familiar with, one that has been around for decades: Data Warehouse.
What is a Data Warehouse?
A data warehouse is a central data repository where data is integrated from different enterprise systems, optimized for reporting and made available for business intelligence. The data warehouse is, therefore, the foundation of an organization’s data analysis and reporting needs.
A data warehouse is essentially a collection of databases. Large organizations generally use a multitude of systems, each of which has a database for content storage and management at the back-end. Extract-Transform-Load (ETL) processes are used to extract data from individual databases spread across the enterprise, transform the data to convert it into a single, consistent format for your target data warehouse, and finally load it into the data warehouse.
Data from disparate systems exists together here, enabling analysts to query data anywhere in the enterprise and get the answers they need without having to query bits and pieces across multiple transactional databases and then piece them together manually. This is the business intelligence system from which every reporting and analysis application or system is derived.
Usage | Primary repository to support operational and performance analytics |
Time-to-market | Weeks, days, hours – depending on approach |
Cost | Medium-to-High |
Users | High |
Data growth | Low-to-Medium |
How Is a Data Warehouse Different from a Data Mart?
It’s not that different actually.
Data marts are a subset of the data warehouse, designed to fulfill the reporting needs of a specific operational department or subject. The data warehouse could then also be called a collection of data marts.
Let’s look at an allusion to understand this concept better. If the data warehouse is a library that contains all possible books that an organization needs, data marts are sections of that library where books about a particular subject are grouped together. Readers who are concerned with only a specific subject can simply go to the relevant library section, or data mart in this case, and get the information they need faster because they won’t have to search through the entire library to find that information.
Data marts are also a core consideration when deciding on your data warehouse design approach. One way to build a data warehouse is to consolidate data on a departmental level, model your data and create individual data marts, and then bring these data marts together to form the enterprise data warehouse. This is a more agile approach to data warehousing, allowing you to focus on specific requirements rather than spending weeks developing a deep understanding of business processes on an enterprise scale and then building an overarching data warehouse to derive individual data marts.
Usage | Front-line business reporting |
Time-to-market | Minutes, hours |
Cost | Low |
Users | Low |
Data growth | Low |
Are Data Lakes a Subset of a Data Warehouse Too?
No.
Data warehouses and data lakes serve completely different purposes. In a data warehouse, we first analyze requirements, map out architecture, identify sources, determine transformation, model data for reporting schemas, and execute the ETL process for moving data, which by this point is optimized for reporting and analysis. In stark contrast, data is “dumped” into a data lake.
Data lakes are designed to maintain all types of data, while data warehouses and marts store structured data. When your organization generates different types of data on a massive scale, and you know that data needs to be analyzed to derive strategic insights but aren’t yet sure how, using a data lake may be the preferred approach. Data lakes can store text, images, weblogs, social network activity, or any other non-traditional data source, without needing data to be cleaned or converted first. Traditional databases neither support these non-traditional data types, nor are they built to enable querying on very large datasets.
So, if data isn’t cleaned and converted or otherwise optimized for reporting in data lakes, how do you use them for business intelligence?
That’s where data scientists come in, using advanced predictive modeling and statistical analysis tools to make sense of large data sets and identify patterns. That’s a whole different discipline in itself.
Usage | Advanced predictive analytics |
Time-to-market | Weeks, months |
Cost | Very high |
Users | Low |
Data growth | Very high |
Which Option is Better?
As with everything, the answer depends on how you want to make data available to business users. If you don’t need a consolidated repository and just want to open up a specific subject area in your organization for reporting, an independent data mart would make sense.
If, on the other hand, you need reporting capabilities that encompass all your enterprise systems, go with a data warehouse. You could also create a number of independent data marts as the need arises throughout your organization and then bring them all together later on to create your data warehouse in stages. However, if your organization is generating too many types of data in immense volumes and you need a way to maintain all of that data for eventual analysis, a data lake would be a good option.
Data lakes vs. data warehouses is an age-old debate, and proponents of data lakes generally argue that data warehouse design doesn’t permit change easily and requires intensive resources in terms of development cost and time. Modern data warehouse automation platforms are changing that, allowing data modeling with agility and new data marts to be created in minutes.
One comment