![]() |
Data Warehousing, Corporate Portal & e-Business Intelligence Applications |
|||||||||||||||||||||
![]() |
![]() |
![]() |
||||||||||||||||||||
|
Mistakes to Avoid in Building Data Warehouses #5Table of Contents:
Development of Additional Architected Data MartsFollowing completion of the initial architected data mart, additional business areas may be identified that require data marts. Figure 7 illustrates the implementation of multiple architected data marts within an integrated data warehousing framework. A user-needs assessment is performed for each business area to define the additional dimensions and attributes that will be required to support the business area. Additional tasks for each data mart include definition of the data mart architecture, identification of functional requirements, design of the target database, including conformed dimensions, specifying the data mapping, extraction, and transformation rules, implementing aggregations, summarization, and partitions, and ensuring that all components of the data marts integrate with central meta data. An important objective is to phase the development effort to deliver additional data marts every three months. This will provide business managers with a steadily increasing Return On Investment and will ensure the ongoing commitment of the business to the data warehouse implementation effort. The implementation of multiple architected data marts, as described above, avoids development of stovepipe data marts that support individual business areas, but do not support enterprise requirements for access to data across multiple business areas. The solution of this problem requires integration of all data marts with a central metadata repository. The recommended solution also avoids loading dirty source data that contains missing, incomplete or inconsistent data into the data warehouse. Expansion to an Enterprise Data WarehouseComponents of an enterprise data warehousing architecture include:
The enterprise data warehouse is a complex environment characterized by high development cost and risk. The recommended approach is to implement the enterprise data warehouse step-by-step, using a bottom-up development methodology. Building the enterprise data warehouse in a single, large development effort is extremely costly and risky, and is likely to defer the Return on Investment for the data warehouse project by many years. In a bottom-up development plan, data marts are typically implemented for one business area at a time. The organization may or may not choose to implement a central data warehouse. If a central data warehouse is not implemented, detail data for each business unit is stored in each data mart, together with summary and aggregated data. At a later time, the organization may decide to move most of the detail data to a central data warehouse. To facilitate this process, it is important to develop consistent detail data models for each data mart that can be consolidated easily in a central data warehouse. Maintenance and Administration of the Data WarehouseData warehouse administration is an ongoing process that grows in complexity as the size of the database and the number of users increases. The process can be simplified by using administrative tools to set up users, authorize access, set security, and monitor access and usage patterns. The recommendation is to use data warehouse administration tools that are incorporated as integral components of data extraction/transformation/load tools. Administrative tools can be used to monitor and manage all aspects of data warehouse operations. To guard against runaway queries, administrative tools can be used to block long queries and reschedule them to run in batch mode. Data administration tools can also be used to restructure the physical database schemas to improve performance. An important function of data warehouse administrators is to monitor business issues facing end users so that queries can be predicted and planned for in advance. The amount of effort required to administer and maintain data warehouses is often grossly underestimated. Data warehouses are never completed. They grow organically and require a high level of continuous effort to overcome performance problems and respond to the evolving requirements of end users. Overcoming Scalability and Performance ProblemsAlmost every data warehouse ever built eventually runs into scalability and performance problems. The initial pilot project may have been highly successful, providing excellent performance for a relatively small number of users. However, after the data warehouse goes into production, the size of the target database may double, the number of users increases rapidly, and queries become more complex. Sooner or later, response time goes up, performance suffers, users complain, and the system bogs down. What can a data warehouse administrator do to overcome scalability and performance problems? First, make sure that every component of your data warehouse architecture is scalable. Select hardware platforms that can scale from Symmetric Multiprocessors (SMPs) to Massively Parallel Processors (MPPs). Choose RDBMSs that scale seamlessly from shared disk environments to shared-nothing environments (very few RDBMSs meet this test!). Make sure that applications, particularly extraction/transformation/load tools, can run in a multi-threaded environment on multiple, parallel, high-speed servers. Finally, ensure that your networks are scalable to handle a rapidly escalating load. Next, specify metrics that define your performance expectations. These include data extract times, data cleansing and transformation times, load times, aggregate calculation times, query response times, report-generation times, etc. Measure the performance of the data warehouse frequently against these metrics to determine when an application exceeds the acceptable performance range. When the allowable performance bounds are exceeded, you have only a few options. Your first line of defense is to pre-compute more aggregate tables. Aggregate tables provide fast response for users because they contain pre-computed answers to common queries. However, effective use of aggregate tables means that you must anticipate the pattern of queries from users; the query tools you use must support aggregate navigation; and there must be sufficient time within the database update window to pre-compute the aggregates. Caching is another approach that can increase performance by storing the results sets of executed queries and pre-caching popular reports. Performance can also be increased by physically partitioning the target database into subsets, by time, by business area, or another parameter. In combination with a shared-nothing hardware and software architecture, partitioning can dramatically reduce the number of rows in a fact table that must be processed by each CPU, in response to a query. Careful use of indexes, in combination with partitioning, can speed up queries and reduce the time required to re-index portions of the fact table following a database update. Finally, you can utilize tools that distribute query processing efficiently across a multi-tier architecture, such as the new category of hybrid OLAP tools. True hybrid OLAP tools increase query performance by embedding multidimensional technology within the kernel of conventional relational databases, and by providing a mid-tier server that performs multidimensional calculations using the MDX language and implements a persistent, multidimensional cache of results sets.
|