MMH Data Warehousing, Corporate Portal & e-business Applications.

Data Warehousing, Corporate Portal & e-Business Intelligence Applications

 

Mistakes to Avoid in Building Data Warehouses #3

Table of Contents:

Data-Mart-in-a-Box

In an attempt to overcome the limitations of virtual data warehouses, many organizations are attracted to an inexpensive product known as a data-mart-in-a-box. As shown in Figure 2, a data-mart-in-a-box is a packaged product sold by many vendors that integrates multiple data warehousing features, including a data replication tool, a target database, and an end-user data access and analysis tool. Data is extracted from multiple source databases, stored in a target database, and accessed by business analysts with user-friendly data analysis tools. Local metadata describing data definitions in business terms is stored in the target database.

The data-mart-in-a-box eliminates interference with operational databases since it creates a physically separate target database that is used exclusively to support DSS operations. However, it creates a host of new problems, including:

  • Lack of support for common business rules, semantics, and data definitions across business areas
  • Population of data marts with "dirty" source data containing missing, inconsistent, and erroneous data values
  • Multiple, non-integrated extraction processes

data-marts-in-a-box tend to proliferate rapidly as managers from multiple business areas purchase packaged data warehousing products from different vendors. The architecture supports multiple, non-integrated, independent data marts, each of which provides business analysts with the ability to access and analyze data for their business unit from a single target database. Each business unit may be very pleased with the functionality provided by their data mart. However, the data marts are not integrated across business units and do not support an enterprise data warehousing strategy. There are no common business rules, semantics, and data definitions across business areas. The lack of consistent data definitions makes it difficult for business end users to obtain meaningful answers to their queries.

Recommendation: Avoid using the data-mart-in-a-box architecture. This architecture rapidly leads to a chaotic environment, which is difficult to maintain. It is characterized by multiple data marts, each of which supports the needs of individual business units, but do not support integration of clean, consistent DSS information across data marts.

Independent (Stovepipe) Data Marts

Most organizations that implement a data-mart-in-a-box architecture quickly realize that it is essential to replace the multiple, uncoordinated source data extraction processes shown in Figure 3 with a single, integrated data extraction, transformation, and load process. However, they are faced with two major choices to implement a single extraction, transformation, and load process:

  • Write a custom program in COBOL or C to extract data from multiple source files, transform the data, and load the target databases
  • Purchase an off-the-shelf extraction/transformation/load (ETL) tool

A common mistake is to write custom COBOL or C programs to perform the extraction/transformation/load functions. Writing an extraction/transformation/load (ETL) program by hand may seem to be a viable option because the program does not appear to be too complex and programmers are available. However, there are serious problems with hand-coded ETL programs:

  • Unlike OLTP applications, the functions to be supported by individual data marts cannot be predicted in advance. In a typical data mart, over 50% of the required functionality is defined by end users after the data mart goes into production. To keep up with the high volume of changes initiated by end users, hand-written ETL programs have to be constantly modified and in many cases rewritten. The effort required to maintain these programs often becomes a major burden for the project.
  • Metadata is not generated automatically by hand-generated ETL programs. As discussed below, metadata is the key to integrating data marts across business units. If metadata is not available, it is difficult to avoid the development of stovepipe data marts that satisfy the needs of individual business units, but cannot be integrated across the enterprise.
  • Hand-coded ETL programs are likely to have a slower speed of execution, compared with directly executable code generated by off-the-shelf ETL tools. Hand-generated programs are typically single-threaded, while modern ETL tools generate multi-threaded, directly executable code that can run on multiple, parallel, high-speed engines.

To avoid these problems, automated, off-the-shelf ETL tools are increasingly being used to extract, cleanse, transform, and load data into the target database. An important function of these tools is to generate and maintain centralized metadata. As shown in Figure 4, the data extraction/transformation/load tool and its associated central metadata repository enable all data sources to be directed to a logically single ETL facility, which in turn populates all target databases for the data warehouse. The ETL facility consists of an integrated family of tools from one vendor that supports data extraction, data cleansing, data transformation, loading of target databases, generation and maintenance of a central metadata repository, a metadata exchange architecture, data warehouse administration tools, and an interface to a data modeling tool. Analysts define source-to-target mappings, data cleansing rules, and data transformation rules using a graphical point-and-click interface provided by the ETL tool. When all mappings and transformations have been specified, the ETL tool automatically generates the data extract/transformation/load programs, which typically run in batch mode. Functions supported by automatically generated ETL programs include extraction of data from multiple source environments, data cleansing, reorganization, transformation, aggregation, calculation, automatic loading of data into the target database, and automatic generation of executable code to perform parallel processing of transformations on multiple engines. Many ETL tools are quite old and were designed to run on host computers. However, these older tools are being rapidly replaced by less expensive second-generation ETL tools designed to run on low-cost server platforms. These newer ETL tools are specifically designed to support data warehousing applications and exploit parallelism by running on multiple, low-cost servers, in a multi-threaded environment. Physical data models for all the target databases are stored in the central metadata repository, which is maintained by the ETL tool. The central metadata repository provides a “single version of the truth” that can be used to define enterprisewide source data definitions, data models for target databases, and transformation rules that convert source data into target data. Selection of an appropriate ETL tool is the most important decision that has to be made in choosing components of a data warehouse. Unlike other components of a data warehouse, it is very difficult to switch from one ETL tool to another. Due to a lack of standards, data definitions and transformation rules cannot be moved from one tool to another. The ETL facility is the one component of the architecture that is not easily replaceable. The architecture presented in Figure 4 solves several important architectural problems for data warehouses:

  • The ETL tool acts as a single point of data extraction that provides coordinated access to source data. The tool is a single, automated facility that performs data extraction, transformation, and load functions
  • The ETL tool also addresses the dirty data problem - data from source files can be cleansed and inconsistencies in the data resolved as part of the extraction and transformation process
  • The ETL tool is used to generate and maintain central metadata
  • The administrative component of the ETL tool is used to support control and administration of the data warehouse

Unfortunately, there is a major problem with the architecture in Figure 4 that has caused the failure of many data warehouses. The problem is the lack of integration between the metadata maintained in the central repository and metadata maintained by individual data marts. As shown in Figure 4, most vendors of data access and analysis tools maintain local metadata that defines the local meaning of business data to end users. However, many vendors do not provide a direct link to the central metadata maintained in the central metadata repository. This missing link makes it impossible for users of individual data marts to share data definitions and business semantics across the enterprise.

Lack of integration with central metadata results in the development of independent data marts, often called stovepipe data marts. Stove-pipe data marts are small data warehouses that support the needs of individual business units, but cannot be integrated at the corporate level because they do not conform with enterprisewide data definitions.

IS departments have known for years that they should not build stovepipe applications. In spite of the well-known problems with stovepipe applications, most data warehouses built today incorporate stovepipe data marts.

To minimize development risk and provide rapid return on investment, many organizations build data warehouses “bottom-up”, one business area at a time. Although the bottom-up approach minimizes development risk, it encourages the implementation of stovepipe data marts. In their rush to deploy individual data marts, many organizations do not take the time to implement the infrastructure required to support consolidation of data across data marts. The result is a profusion of stovepipe data marts that satisfy the requirements of individual business units, but cannot be used to support corporate requirements for an integrated view of data across data marts. Subsequent attempts to install the infrastructure required to integrate data marts with a central repository often fail. The failure is due to the complexity and risk associated with making major changes to an operational data warehousing facility that is growing rapidly and supports large numbers of business end users.

It is extremely difficult for organizations to avoid building stove-pipe data marts. For example, a business unit may use departmental funds to build a data mart that supports their specific DSS requirements. The data mart maintains local metadata (a local semantic layer) that provides intuitive access to data in the data mart. Subsequently, another business unit identifies the need to build a data mart for their area of the business. They build a data mart that meets their business requirements, but their local metadata does not conform with the metadata definitions contained in the first data mart. Neither data mart conforms with definitions stored in the central metadata repository.

Recommendation: Do not buy data warehousing components that cannot integrate with a central metadata repository generated and maintained by an ETL tool. Most vendors of components of data marts support only local metadata for each data mart and do not provide links to a central metadata repository. The result is that each data mart stores local, inconsistent, incompatible definitions of business entities, semantics, and business rules. Data marts meet the needs of individual business units, but cannot be integrated with data marts supporting other business units.

[ TOP ]