Building a Data Warehouse

Business decisions need to be taken fast to stay ahead of the competition and the key to that is right data, at the right time. The enterprise data warehouse acts as the central data repository, optimized for reporting and analysis. But the development of a traditional data warehouse requires skills and expertise that are scarce and expensive, which is why decision makers are hesitant to make such investments. However, with careful planning and a proper implementation plan in place, the risks associated with the development of a data warehouse can be reduced.

Discover

The data warehouse building process must start with the why, what, and where. The output of your data warehouse must align perfectly with organizational goals. This requires an investigative approach. If it starts with no clearly defined objective in place, it is bound to end as well with no returns on investment.

In the discovery stage of the project, business decision makers and technology experts should work together to identify business objectives and set milestones for project deliverables based on those objectives. Each department within an organization has different reporting needs and an enterprise data warehouse should be able to collectively cater to those requirements. Business users should clearly identify and communicate those requirements to the IT teams for them to implement and translate their ideas in a physical design.

Design

In the design phase of a data warehousing project, logical concepts are translated into a physical design. Different data warehousing tools use different design approaches which are based on the specific problems those tools are developed to address. If the goal is to optimize BI capabilities and make data reporting faster, Kimball’s bottom-up approach of developing a data warehouse from data marts is the best way to go. However, Inmon’s top-down approach is best suited if the primary objective of a data warehouse is to record the atomic level of data and treat the data warehouse as a central information repository for a business.

Each department within an organization usually maintains a different transactional database. These transactional data sources are used for data entry but lack the capabilities to present the data for reporting. Dimensional modeling is used to denormalize the entities to incorporate the BI capabilities for faster reporting and analysis. A dimensional data model contains facts and dimensions and specifies the relationships between them. Facts are entities that contain numeric information and dimensions contain descriptive data about a certain fact. Usually a business fact is explained using multiple dimensions and the structure of the dimensional data model resembles a star. This resemblance is why a dimensional model is also called a star schema. When creating dimensional models, attributes in a dimension entity are also identified as slowly changing dimensions or fast changing dimensions.

You start designing a data warehouse by establishing connections to the transaction systems. With the help of data virtualization technology, users can also connect to multiple structured and unstructured data sources at once. Once the connections are established, the next step is modeling the data. The entities in the transactional database are denormalized, and facts and dimensions are identified to create dimensional models.

Develop

After identifying facts and dimensions and creating data models, the next stage is the development phase. The development stage of a data warehouse comprises of two main components. The first is to map data models to their physical counterparts. Source entities can be manually mapped to the destination entities. Some advanced data warehousing technologies also provide in-built auto-mapping functionality. The auto-mapping feature makes use of fuzzy name logic to map source entities to the respective destinations.

Another important component in the development stage is to define data load settings for the ETL of data from external sources into the data warehouse. Usually, a full load is performed the first time and then incremental load is sufficient to extract and load new data entered in the transactions systems or relational databases. Data load requirements depend on the business needs and volume and velocity of the incoming data.

Deploy

Deploying a data warehouse is different than deploying a transactional database. A data warehouse is usually deployed incrementally, in different tiers. Each layer is tested after it is deployed so that if there are any problems, they are identified and solved simultaneously. A data warehouse is not rolled out or made available to all the users within an organization at once. The pace at which the deployment takes place, and the order in which various groups get access to the data warehouse, depends on the requirements identified in the discovery stage.

The deployment stage begins with putting infrastructure (servers, hardware, storage, etc.) in place. Then software is installed and tested to ensure that it is ready for production. Next comes the design part where transactional sources are connected, facts and dimensions are identified, entities are denormalized, and dimensional models are created. Once relational and OLAP databases for data warehouse are set up, ETL processes are brought online by specifying data load settings. The application and BI integration layers are added last. A data warehouse without a proper UI (user interface) is of no use to users. It will be able to maintain historical data but without application and BI integration layers, a data warehouse will lack the capabilities to present the data to the users for analysis or creating reports.