MMH Data Warehousing, Corporate Portal & e-business Applications.

Data Warehousing, Corporate Portal & e-Business Intelligence Applications

 

Mimno, Myers & Holum

Enterprise DW Architecture

The recommended enterprise data warehousing architecture is described in this document. It incorporates a general-purpose infrastructure, a central data warehouse, an optional Operational Data Store, and multiple data marts. The architecture is representative of what the data warehousing environment may look like as it evolves to support future enterprisewide requirements. As shown in Figure 1, components of the general-purpose, enterprise data warehousing architecture include the following:

  • Source databases. Data for the data warehouse is accessed from multiple data sources, including legacy files, relational databases, ERP sources (i.e., SAP R/3, PeopleSoft, Oracle Financials, Baan, etc.), e-commerce, Web log files, and external information sources
  • Data extraction/transformation/load tool. A second-generation ETL tool, such as PowerMart/PowerCenter from Informatica, or DataStage Suite from Informix Software (formerly Ardent Software), is used to access data sources, transform the data to make it understandable to end users, and load the transformed data into the central data warehouse and multiple data marts. It is assumed that the ETL tool will be used to support these functions, rather than coding ETL functions by hand using a programming language, such as COBOL, C, or PL/SQL. The ETL tool will be used to support the following functions:

-Extract data from a wide variety of source databases using native interfaces, flat-file interfaces, or an FTP interface. In addition to data extraction and data staging, the ETL tool will be used to cleanse the data, identify anomalies in the data, transform the data, compute summaries and aggregates, compute derived data, and load the data at high speed into the target databases

-Generate and manage a central metadata repository that contains definitions of source data, target data models, transformation rules, derived data computations, etc. The central metadata repository and metadata exchange architecture supported by the ETL tool provide a common source of definitions that can used to integrate all components of the enterprise data warehousing architecture

-Support the import of data models used to define the physical organization of data in each of the target environments (i.e., the data warehouse and each of the data marts). Only the portion of the data model that applies to a specific data mart needs to be specified. Each additional data mart adds detail to the model. The ETL tool will be used to load the physical data models from an external data modeling tool

-Utilize the integrated data warehouse administration tool to support rapid response to end-user requests for additional data dimensions and attributes, analyze the impact of requested changes, and implement changes in the rules governing cleansing, transformation, summarization, aggregation, and computation of target data

  • Data modeling tool. The data modeling tool is used to specify logical and physical models for target databases. Data modeling tools that integrate with widely used ETL tools include ERWin from Computer Associates, PowerDesigner from Sybase, Inc., or Oracle Designer from Oracle Corporation. Physical data models specified by the data modeling tool are imported to the ETL tool repository using an integrated interface
  • Data warehouse. The data warehouse is a database that stores large amounts of detail data (individual atomic transactions) and lightly summarized data that spans multiple subject business areas. An important requirement of the data warehouse is to capture and store a complete history of detailed, transaction-level data for multiple business areas within the organization. The data warehouse is an efficient facility to store a complete record of clean, consistent, atomic transactions accessed from all data sources. During the data update window at night, the ETL tool is used to capture detailed transactions and changes to reference files from all databases that feed the data warehouse. The ETL tool is used to stage the data, cleanse the data, transform the data, and prepare the data for loading into the data warehouse. The ETL tool loads the data warehouse by appending the record of detailed transactions for the current day to the record of transactions for previous days, producing a complete history of transactions. Queries that require access to atomic-level transaction data are automatically directed to the data warehouse by the end-user tools. Queries directed to the data warehouse include requests for data that are not available in higher level, precomputed aggregates, as well as queries that require access to atomic transactions across business areas, such as market basket analysis and fraud analysis.
  • Operational Data Store: An Operational Data Store (ODS) consolidates data from multiple source systems and provides a near real-time, integrated view of volatile, current data. An ODS may also be used as the real-time database for purchased application packages, such as financial packages, claims processing applications, etc. The architecture shown in Figure 1 supports both a data warehouse used for decision support functions and an operational data store (ODS) used for on-line operations. It is important to ensure the physical separation of operational data in the ODS from decision-support data in the data warehouse.
  • Multiple data marts. Data marts contain subject-specific information supporting the requirements of end users in individual business units. Typically, data marts store summarized or aggregated data, rather than detailed transaction-level data. Aggregated data represents precomputed answers to common queries. Data marts can provide rapid response to end-user requests if most queries are directed to precomputed, aggregated data stored in the data mart. A major function of the data warehouse administrator is to identify candidates for aggregates that can be precomputed during the nightly update cycle and stored in data marts. If a precomputed aggregate is available in a data mart to satisfy a query, the response to the query is almost instantaneous. However, if an aggregate is not available in the data mart and the query must be directed to detail data in the data warehouse, the response to the query may be much slower.

    In many applications, the incremental aggregation function of the ETL tool will be used to precompute aggregates as part of the data update window at night. Aggregates are computed in a single pass as data streams through the ETL tool server. At the end of the update cycle, the ETL tool loads the precomputed aggregates into the target databases for individual data marts. Precomputed aggregates are stored in data marts in the form of star schemas for relational databases, and in the form of indexed arrays for multidimensional databases.

    For many projects, the target database for the initial development effort will consist of a single data mart. Following a bottom-up development methodology, additional data marts are implemented incrementally, one business area at a time. Future development phases may include implementation of a data warehouse and an Operational Data Store.

    It is assumed that the development of the initial data mart is not dependent on implementation of the data warehouse. Until the data warehouse is available, the ETL tool will load the initial target database for the data mart with both transaction-level data and precomputed, aggregated data. When the data warehouse is available, the ETL tool will be used to move the complete history of detailed, transaction-level data from one or more data marts to the data warehouse.

    Data bases for data marts may utilize either relational database management systems (RDBMS), specialized multidimensional database management systems (MDB), or a combination of both database technologies. Relational target databases are used to support data marts that require analysis of large amounts of data in an ad hoc, unpredictable, query environment. Multidimensional databases (MDBs) are often used to support specialized data marts that require high-speed access to moderate amounts of data (under 100 GB of summarized or calculated data) in a well-defined, predictable query environment. Financial and budgeting applications may require multidimensional databases to provide rapid response for complex cross-dimensional calculations. Representative multidimensional databases include Oracle Express from Oracle Corporation, Essbase from Hyperion Solutions, Inc., and Microsoft OLAP Services (a component of SQL Server 7)
  • Front-end data access and analysis tools. Off-the-shelf, data access and analysis tools are used to specify queries, access pre-calculated reports, create ad hoc reports, and analyze the data using drill-down and On-Line Analytical Processing (OLAP) functions. These tools typically access aggregated data from individual data marts and detailed, transaction-level data from the data warehouse. Data warehouse administrators generally provide end users with a choice of approved data access and analysis tools to support a wide range of end-user requirements. Data access and analysis tools should provide identical functionality in both client/server and Web environments.

    Data access and analysis tools are used with both RDBMSs and MDBs to support high-speed access and analysis of data in the target database. An important function supported by many of these tools is On-Line Analytical Processing (OLAP), which enables business analysts to access and manipulate business-oriented information from the data warehouse using multidimensional analysis techniques. Categories of end-user access, reporting, and analysis tools include:

-Desktop OLAP tools that support managed query, reporting, and light OLAP processing by providing multidimensional views of relational data

-Relational OLAP (ROLAP) tools for power end users that provide multidimensional views of relational data and support for complex OLAP processing

-Multidimensional OLAP (MOLAP) tools for financial analysts that provide multidimensional views of multidimensional data arrays and support for high-speed OLAP processing

-Data mining tools used to uncover hidden patterns in large data sets and support predictive modeling.

-Data visualization tools used to display complex data in graphical form for human pattern recognition analysis

Multiple categories of tools may be required to support the range of data warehousing requirements at an organization. However, for the initial pilot project, the recommendation is to utilize a simple, low cost, off-the-shelf desktop OLAP tool to support the query, reporting, drill-down, and OLAP processing needs of business analysts and general business end users. Representative desktop OLAP tools include BusinessObjects from Business Objects, Inc., Brio Enterprise from Brio Technologies, Inc., Impromptu/PowerPlay from Cognos, Inc., Seagate Info from Seagate Software, and Hummingbird BI/Suite from Hummingbird Communications Ltd. The desktop OLAP tool provides direct access to summarized and aggregated data stored in the target database for the data mart, as well as drill-through access to detailed, transaction-level data stored in the data warehouse. The desktop OLAP tool will be used in the pilot project to support dynamic query, reporting, drill-down, and server-based OLAP analysis functions over the LAN and the Web.

[ TOP ]