Built to serve as the enterprise’s single source of truth, the Enterprise Data Warehouse (EDW) integrates data from disparate sources and applications, optimizes it for analytics and reporting, and presents it to business users in a simplified form to deliver business intelligence. To build such a central repository and meet business intelligence objectives, you will need to combine a number of different components and ensure data flows through each component correctly.
A dysfunctional data warehouse results in bad or poor quality data, which costs businesses $600 billion a year. Let’s take a look at the five key components of a data warehouse, understanding which can help you structure your data warehouse better and therefore minimize bad data.
1. Data Warehouse Storage
At its core, the data warehouse is a database that stores all enterprise data and makes it accessible for reporting in a simplified and optimized manner. Naturally, this means you need to decide on which type of database you will use to store your data warehouse. There are four basic types of databases you can for this purpose:
Typical Relational Databases: These are the row-based databases you probably use on a day-to-day basis and include Microsoft SQL Server, SAP, Oracle, and IBM DB2. Common relational databases are usually used for transactional systems, but you can use multiple databases running in parallel to serve as collective storage for your data warehouse, making your repository highly scalable.
Analytics Databases: These are databases, specifically designed for the storage of data to maintain and manage analytics, are commonly used as data warehouse storage. Major brand names in this category include Teradata and Greenplum.
Data Warehouse Appliances: Not exactly a type of storage, numerous vendors now provide appliances which provide both software to manage the data warehouse and hardware for storage. These appliances come pre-installed with a database management system. Major brand names include SAP Hana, Oracle Exadata, and IBM Netezza.
Cloud-hosted Databases: With the increasing focus on cloud, databases too can be hosted and accessed on the cloud, meaning that you don’t need to purchase hardware to install your data warehouse. Major brand names include Amazon Redshift, Microsoft Azure SQL, and Google BigQuery – although Amazon dominates the landscape, capturing over two-thirds of the market share.
Check out Teradata’s comprehensive whitepaper if you’re having trouble deciding on which DBMS to choose for your data warehouse.
2. Extract-Transform-Load (ETL) Tools
Often considered the backbone of data warehousing, you will need an ETL tool to extract data from disparate source systems across the enterprise, transform this data to convert it in a format suited for your data warehouse, and load it into your data warehouse.
While that covers the broad strokes, your ETL tool of choice will determine:
- The time you spent extracting data
- Methods of extraction
- Type of transformations you’ll apply and the ease with which you’ll do so
- Business rule definition to validate your data and clean it, so the end-product analytics are enhanced
- Populating missing data
- Defining information delivery from your central repository to your business intelligence applications
However, developing ETL scripts to manage these tasks takes considerable time. Your best bet to minimize implementation time is to go with an ETL tool that auto-generates ETL code while allowing you to perform all associated actions visually. You could drag and drop tables in a visual designer, specify source and target databases, and configure validations and transformations anywhere in the flow as per business requirements.
Metadata refers to data that defines the data warehouse and provide context to data.
A record in your customer database may look like:
Robin 76 13,000 94,923.00
This data is not comprehensible unless you review associated metadata:
Customer Name: Robin
Purchase SKU: 76
Purchase Quantity: 13,000
Order Amount: $94,923.00
Besides providing business context as shown in the above example, metadata also contains information about:
- Source data systems
- The number of times data has been changed or reloaded from the source
- The transformations or operations that were applied to source data when loading into the data warehouse
- The overall picture, i.e., the tables, keys, and attributes that reside in the repository
Metadata is a critical EDW component for both business and technical teams to make sense of the data contained within and convert it into information.
4. Data Warehouse Access Tools
At the backend, the data warehouse is built on top of a database or collection of databases. Business users cannot be expected to interact with databases, so tools designed to facilitate access to it are needed. The type of access tool you choose determines the level of access:
Query and reporting: These tools help users generate business reports for analysis. Reports could be in the form of spreadsheets, calculations, or interactive graphics, depending on the tool you choose. Modern business intelligence is increasingly moving towards graphical reporting, so ensure that your data warehouse tool offers integration with a graphical reporting tool.
Application development: In some cases, canned reports may not serve your business intelligence requirements, necessitating application development to create highly customized reports and presenting them in views designed for specific reporting objectives.
Data mining: Used for reporting in big data warehouses where data is not structured, data mining tools automate the process of finding patterns and correlations in large amounts of data based on advanced statistical modeling techniques.
OLAP tools: These tools help build a multi-dimensional data warehouse and enable analysis of enterprise data from multiple perspectives. OLAP tools are generally used in tandem with traditional reporting tools for end-user reporting.
5. Data Warehouse Management
A data warehouse spans the enterprise. The scale of operations necessitates a number of management and administrative operations, including but not limited to:
- Database updates
- Managing priority of tasks running in parallel
- Scheduling jobs
- Ensuring data quality checks are in place and are working correctly
- Monitoring the status of systems on which the EDW is dependent
- Managing backup and disaster recovery
- Auditing data warehouse usage
- Reducing redundancy to optimize storage
- Overseeing changes and iterations in EDW design
Most organizations use dedicated data warehouse management tools to accomplish this, while some vendors offer a number of management functions out-of-the-box.
When starting a data warehouse project, you should ideally choose a solution that helps you bring together each component of the data warehouse to form a unified whole. From facilitating requirements gathering, prototyping of reports, ETL processes, data modeling, metadata management, to data visualization, your preferred tool should bring it all together, while providing automation to enhance the creation of your EDW.