![]() |
Data Warehousing, Corporate Portal & e-Business Intelligence Applications |
|||||||||||||||||||||
![]() |
![]() |
![]() |
||||||||||||||||||||
|
Mistakes to Avoid in Building Data Warehouses #3Table of Contents:
Data-Mart-in-a-BoxIn an attempt to overcome the limitations of virtual data warehouses, many organizations are attracted to an inexpensive product known as a data-mart-in-a-box. As shown in Figure 2, a data-mart-in-a-box is a packaged product sold by many vendors that integrates multiple data warehousing features, including a data replication tool, a target database, and an end-user data access and analysis tool. Data is extracted from multiple source databases, stored in a target database, and accessed by business analysts with user-friendly data analysis tools. Local metadata describing data definitions in business terms is stored in the target database. The data-mart-in-a-box eliminates interference with operational databases since it creates a physically separate target database that is used exclusively to support DSS operations. However, it creates a host of new problems, including:
data-marts-in-a-box tend to proliferate rapidly as managers from multiple business areas purchase packaged data warehousing products from different vendors. The architecture supports multiple, non-integrated, independent data marts, each of which provides business analysts with the ability to access and analyze data for their business unit from a single target database. Each business unit may be very pleased with the functionality provided by their data mart. However, the data marts are not integrated across business units and do not support an enterprise data warehousing strategy. There are no common business rules, semantics, and data definitions across business areas. The lack of consistent data definitions makes it difficult for business end users to obtain meaningful answers to their queries. Recommendation: Avoid using the data-mart-in-a-box architecture. This architecture rapidly leads to a chaotic environment, which is difficult to maintain. It is characterized by multiple data marts, each of which supports the needs of individual business units, but do not support integration of clean, consistent DSS information across data marts. Independent (Stovepipe) Data MartsMost organizations that implement a data-mart-in-a-box architecture quickly realize that it is essential to replace the multiple, uncoordinated source data extraction processes shown in Figure 3 with a single, integrated data extraction, transformation, and load process. However, they are faced with two major choices to implement a single extraction, transformation, and load process:
A common mistake is to write custom COBOL or C programs to perform the extraction/transformation/load functions. Writing an extraction/transformation/load (ETL) program by hand may seem to be a viable option because the program does not appear to be too complex and programmers are available. However, there are serious problems with hand-coded ETL programs:
To avoid these problems, automated, off-the-shelf ETL tools are increasingly being used to extract, cleanse, transform, and load data into the target database. An important function of these tools is to generate and maintain centralized metadata. As shown in Figure 4, the data extraction/transformation/load tool and its associated central metadata repository enable all data sources to be directed to a logically single ETL facility, which in turn populates all target databases for the data warehouse. The ETL facility consists of an integrated family of tools from one vendor that supports data extraction, data cleansing, data transformation, loading of target databases, generation and maintenance of a central metadata repository, a metadata exchange architecture, data warehouse administration tools, and an interface to a data modeling tool. Analysts define source-to-target mappings, data cleansing rules, and data transformation rules using a graphical point-and-click interface provided by the ETL tool. When all mappings and transformations have been specified, the ETL tool automatically generates the data extract/transformation/load programs, which typically run in batch mode. Functions supported by automatically generated ETL programs include extraction of data from multiple source environments, data cleansing, reorganization, transformation, aggregation, calculation, automatic loading of data into the target database, and automatic generation of executable code to perform parallel processing of transformations on multiple engines. Many ETL tools are quite old and were designed to run on host computers. However, these older tools are being rapidly replaced by less expensive second-generation ETL tools designed to run on low-cost server platforms. These newer ETL tools are specifically designed to support data warehousing applications and exploit parallelism by running on multiple, low-cost servers, in a multi-threaded environment. Physical data models for all the target databases are stored in the central metadata repository, which is maintained by the ETL tool. The central metadata repository provides a “single version of the truth” that can be used to define enterprisewide source data definitions, data models for target databases, and transformation rules that convert source data into target data. Selection of an appropriate ETL tool is the most important decision that has to be made in choosing components of a data warehouse. Unlike other components of a data warehouse, it is very difficult to switch from one ETL tool to another. Due to a lack of standards, data definitions and transformation rules cannot be moved from one tool to another. The ETL facility is the one component of the architecture that is not easily replaceable. The architecture presented in Figure 4 solves several important architectural problems for data warehouses:
Unfortunately, there is a major problem with the architecture in Figure 4 that has caused the failure of many data warehouses. The problem is the lack of integration between the metadata maintained in the central repository and metadata maintained by individual data marts. As shown in Figure 4, most vendors of data access and analysis tools maintain local metadata that defines the local meaning of business data to end users. However, many vendors do not provide a direct link to the central metadata maintained in the central metadata repository. This missing link makes it impossible for users of individual data marts to share data definitions and business semantics across the enterprise. Lack of integration with central metadata results in the development of independent data marts, often called stovepipe data marts. Stove-pipe data marts are small data warehouses that support the needs of individual business units, but cannot be integrated at the corporate level because they do not conform with enterprisewide data definitions. IS departments have known for years that they should not build stovepipe applications. In spite of the well-known problems with stovepipe applications, most data warehouses built today incorporate stovepipe data marts. To minimize development risk and provide rapid return on investment, many organizations build data warehouses “bottom-up”, one business area at a time. Although the bottom-up approach minimizes development risk, it encourages the implementation of stovepipe data marts. In their rush to deploy individual data marts, many organizations do not take the time to implement the infrastructure required to support consolidation of data across data marts. The result is a profusion of stovepipe data marts that satisfy the requirements of individual business units, but cannot be used to support corporate requirements for an integrated view of data across data marts. Subsequent attempts to install the infrastructure required to integrate data marts with a central repository often fail. The failure is due to the complexity and risk associated with making major changes to an operational data warehousing facility that is growing rapidly and supports large numbers of business end users. It is extremely difficult for organizations to avoid building stove-pipe data marts. For example, a business unit may use departmental funds to build a data mart that supports their specific DSS requirements. The data mart maintains local metadata (a local semantic layer) that provides intuitive access to data in the data mart. Subsequently, another business unit identifies the need to build a data mart for their area of the business. They build a data mart that meets their business requirements, but their local metadata does not conform with the metadata definitions contained in the first data mart. Neither data mart conforms with definitions stored in the central metadata repository. Recommendation: Do not buy data warehousing components that cannot integrate with a central metadata repository generated and maintained by an ETL tool. Most vendors of components of data marts support only local metadata for each data mart and do not provide links to a central metadata repository. The result is that each data mart stores local, inconsistent, incompatible definitions of business entities, semantics, and business rules. Data marts meet the needs of individual business units, but cannot be integrated with data marts supporting other business units.
|