|
|
Mimno, Myers & Holum
Best Practices in Data Warehousing
Numerous issues must be considered in the design and implementation of successful
data warehousing applications. These issues include definition of the business
purpose of the data warehouse, specification of an enterprise data warehousing
architecture, and selection of appropriate components and tools. In my seminars
and consulting practice, I define a set of best practices in data warehousing
that can be used as the basis for the specification of data warehousing architectures
and selection of tools. These best practices, which are derived from extensive
consulting experience, include the following:
-
Ensure that the data warehouse is business-driven, not technology-driven
-
Define the long-term vision for the data warehouse in the form of an Enterprise
data warehousing architecture
-
Provide flexibility to allow for significant changes in data warehousing functionality
and integration of new business requirements
-
Identify best practice solutions and introduce new functionality in a piloted
or prototyped manner
-
Reduce dependence on individuals who have key information
-
Do not build "virtual" data warehouses that access data directly from
source environments and have no target database
-
Avoid "stovepipe" data marts that do not integrate at the metadata
level with a central metadata repository, generated and maintained by an ETL
tool. Replace existing stovepipe or tactical data marts by developing fully integrated,
dependent data marts, using best practices
-
Buy, don’t build data warehousing components
-
Use a data modeling tool to perform logical and physical modeling. Ensure that
a single logical data model drives the generation of all physical, target data
models
-
Create a hub-and-spoke architecture, with the ETL tool as the Hub, to extract
data from all data sources, resolve inconsistencies in data sources, and generate
clean, consistent target databases for decision support
-
Use a 2nd-generation ETL tool to automate the extraction, transformation, and
load functions
-
Use the ETL tool, as the hub of a hub-and-spoke architecture, to access source
data from all data sources, apply procedural cleansing, compute aggregates, apply
integrity constraints, implement business rules, generate and maintain central
metadata, and load clean, consistent, target databases
-
Do not code extraction, transformation, and load functions by hand using procedural
code, such as COBOL, C, C++, PL/SQL, Perl scripts, Cold Fusion, etc. In many
organizations, procedural code is used extensively to access source data, transform
the data, and load target databases. Procedural code is difficult and expensive
to modify and enhance. In addition, procedural code does not automatically generate
metadata, which leads to the development of non-integrated, stovepipe data marts.
An important objective is to eliminate the need to write procedural code by using
only point-and-click tools, e.g., point-and-click ETL tools, data modeling tools,
data cleansing tools, and BI tools
-
The ETL tool should incorporate native interfaces to source and target databases,
including legacy files, relational databases, as well as ERP, CRM, Web log, and
real-time data sources
-
Use the ETL tool to access atomic data from data sources, not summary data
-
Ensure that the ETL tool is platform-independent, i.e., it runs in mainframe,
Windows and Unix environments, and supports multiple relational and non-relational
target databases
-
Use the incremental aggregation capability of the ETL tool to compute aggregates.
The aggregates provide fast response to user queries
-
Ensure data security and integrity through automatic encryption, authentication,
data recovery and data integrity
- Populate the target databases using the ETL tool:
-
Re-organize the target data into business-centric dimensional models that facilitate
analysis using metrics, dimensions, and attributes that are intuitively meaningful
to business users
-
Use dimensional star schemas as the physical organization of data in data marts.
Establish a goal to achieve 60 – 70 % conformed dimensions across data
marts, and low overlap at the atomic level of detail across data marts
-
Eliminate the use of Access databases and Excel spreadsheets to maintain and
manage departmental data
-
Integrate all components of the data warehousing architecture with central metadata
-
Buy only components that integrate with central metadata
-
Look for product suites that are integrated at the metadata level across all
components of the DW architecture, i.e., that provide an end-to-end solution
across data sources, data profiling tools, data cleansing tools, data modeling
tools, ETL tools, target databases, analytic applications, analytic delivery,
and general-purpose BI functions
-
Use the ETL tool to generate and maintain central metadata. Metadata generation
should be automatic and transparent. Do not generate metadata by hand using procedural
code
-
Use the ETL metadata repository as the “single version of the truth” for
entity definitions, data models, business rules, transformation logic, calculations,
etc.
-
Support global, synchronized metadata by integrating multiple, networked, ETL
engines with common, global metadata generated and maintained by a central ETL
engine. Beware of “multiple versions of the truth”
-
Map business terms to the data mart schema (business metadata). Ensure that there
is consistency in the definition and use of corporate definitions and metrics
across business functions, including systems of record across all elements
-
Do not use an external metadata repository, such as CA Platinum or Rochade, as
a global metadata repository. This creates multiple versions of the truth that
cannot be resolved through bi-directional metadata synchronization
-
Generate extensible, CWM (Common Warehouse MetaModel) -compliant metadata and
LDAP-compliant directories
-
Ensure that local business rules are 100% dependent on central entity definitions.
This important constraint enables business analysts to generate reports using
business rules that are defined locally, while supporting the needs of central
administrators to generate reports across business units that can be consolidated
and reconciled
-
Track where metadata and data models are used across tools, including data profiling,
data cleansing, data modeling, ETL, and BI tools
-
Support data lineage reporting, i.e., automatic documentation and mapping of
data flow from data sources, through transformations, to target databases
-
Use the real-time features of the ETL tool to access near real-time data and
support implementation of near real-time, closed-loop, analytic applications/li>
-
Support strategic DSS and tactical DSS using the same architecture
- Strategic DSS uses the ETL tool to process all accumulated source data transactions
in a batch mode update at night. Strategic DSS is oriented toward access of summary
data that is accurate as of end-of-business yesterday. Aggregates are pre-computed
as part of the nightly update process
-
Tactical DSS uses the ETL tool to process near real-time data. It may be used
to support 24X7 operation of the data warehouse and provide a high level of reliability
and availability. The target databases are updated in near real-time without
a nightly data update window. Tactical DSS is oriented toward access of near
real-time detail data that is no more than a few minutes old. It is not practical
to pre-compute aggregates for tactical DSS. Instead, if aggregates are required,
they are computed in real-time
-
To support strategic DSS and tactical DSS in the same target databases, use timestamps
on both the aggregates and the detail data. Constrain strategic DSS to access
only detail data that has a timestamp that is equal to or prior to the timestamp
of the aggregates
-
Do not load dirty source data into the data warehouse
-
Use a data cleansing tool, in combination with a data profiling tool and an ETL
tool, to analyze the source data, clean the source data, and resolve logical
inconsistencies. Data profiling tools provide insight into the state of source
data before it is extracted, including data types, dimensions, measures, primary
and foreign keys, normalization, etc. Functions supported by data cleansing tools
include assessment of data quality, standardization, integration, de-duping name
and address data, and identifying household relationships.
-
Look for data cleansing tools that integrate at the meta data level with the
ETL tool. Data cleansing function calls should be available directly from ETL
mappings or as a pre/post process
-
Use the data cleansing tool to resolve inconsistencies in the Customer, Employee,
and Household dimensions, and to append geocoding data
- Do not let an ERP package, such as SAP R/3 or PeopleSoft, dominate the data warehouse.
Treat the ERP application as one of many data sources. Leverage the off-the-shelf
functionality of packaged data warehousing products, such as SAP BW or PeopleSoft
EPM, but maintain the ability to build custom data warehousing applications using
the same ETL tool utilized to populate the packaged product
-
Support pre-packaged analytic applications that are fully integrated with the
ETL tool and analytic delivery platform. Look for analytic applications that
implement common business processes, provide an end-to-end solution, complement
the existing infrastructure, deliver metrics in the form of intuitive, dashboard
displays, and are modular, extensible, scalable, and customizable
-
Build the data warehouse bottom-up, not top-down
-
Bottom-up development maximizes ROI and minimizes risk
-
Develop the data warehouse incrementally, one business area at a time
-
Focus initially on the development of multiple data marts using a Data Warehouse
Bus Architecture. Design data marts that support differentiable business processes,
such as sales, finance, customer touchpoints, procurement, supply-chain management,
asset management, human resources, etc.
-
The Bus Architecture consists of data sources, data cleansing tool, data modeling
tool, ETL tool, target databases, and BI tools. All components of the architecture
are integrated at the metadata level. In the bottom-up methodology, a central
data warehouse, Operational Data Store, and persistent staging area are optional;
they are not required components of the architecture
-
Avoid normalized, relational database structures, which can be extremely slow
in joining large normalized tables, such as tables of transactions, customers,
products, and employees, in response to a query
-
Implement target databases for data marts as dimensional structures. Compared
with normalized relational structures, dimensional structures generally provide
higher performance, greater flexibility, and less sensitivity to changes in data
models. Dimensional structures are easier to navigate for end users and map much
better to business processes than normalized relational structures. Ensure that
the dimensional model is data-driven and not report-specific.
-
Store atomic transactions for a specified business process, together with multiple,
pre-computed aggregates, in the target database for each data mart. The set of
data marts, implemented as dimensional structures, is the logical equivalent
of a central data warehouse
-
Access facts across data marts using the star-join capability of the RDBMS, rather
than joining normalized relational tables in a central data warehouse. The star-join
technique provides faster response to queries across business processes at the
atomic level, in comparison with joining normalized relational tables
-
Design data marts to achieve minimal overlap in atomic transactions across data
marts and a high level of conformed dimensions across data marts
-
Consider implementation of a central data warehouse and/or a persistent staging
area only if there is significant overlap in atomic transactions across data
marts and a low level of conformed dimensions across data marts
-
Prove each step before moving on to the next step
-
Deliver a major increment of functionality, in the form of a data mart, every
90 days or less. Measure the ROI from each delivered data mart
-
Keep the incremental, bottom-up development effort under control through use
of logical data modeling techniques and integration of all components of the
architecture with central metadata, generated and maintained by the ETL tool.
Create logical and physical models of the target databases with a data-modeling
tool. Do not allow DBAs to modify physical table structures directly. Modifications
to table structures must be performed at the logical level using a data modeling
tool. Do not assume that OLTP data modeling experience applies to data warehousing
-
Anticipate scalability and performance issues:
-
Select an ETL tool and server that support a linear increase in performance through
incorporation of additional CPUs and RAM in the server. Ensure that the hardware
resources can scale as the data warehousing application grows
-
Provide high performance through the use of pre-computed aggregates
- Use an automated tool to identify candidates for aggregates
- Use the incremental aggregation feature of an ETL tool to pre-calculate aggregates
in one pass of the source data. Alternatively, use the DMBS to pre-calculate
aggregates, or develop materialized views
- Use the aggregate navigation functions of a BI tool or a database to re-direct
queries to pre-computed aggregates
-
Select an ETL tool that supports a high degree of multi-tasking, high-speed operations
in RAM memory, concurrent processing of parallel data streams, load balancing
across CPUs and servers, and fail-over across multiple servers
-
Select an ETL tool that supports transparent migration from a Windows to a Unix
environment. Consider implementing the initial data marts in a low-cost Windows
environment and migrating to a Unix environment as performance requirements increase
-
Work with the DBA to specify indexing and partitioning of the target data to
minimize I/Os, provide faster database access, and faster loads
-
Simplify the target data model to eliminate extraneous joins, typically caused
by excessive snowflaking (i.e., normalized dimension tables)
-
Use a more powerful RDBMS, such as Teradata, DB2 UDB, Informix, or Tandem NonStop
-
Use hybrid tools, multidimensional query languages, and a persistent multidimensional
cache to improve the performance of the BI tool and target database. Move to
hybrid OLAP tools as soon as they integrate fully with ETL tools
-
Support data integration standards, including XML, CWM (Common Warehouse MetaModel),
LDAP, UML, JMS, SOAP J2EE, etc.
-
Do not code reports by hand:
-
Use intuitive, point-and-click, Web-enabled Business-Intelligence tools to support
query, reporting, OLAP analysis, drill-down, and drill-through functions
-
Deliver information from the data warehouse over the Web
-
Support ad hoc reporting, “push” reporting (reports are pushed to
users on occurrence of user-defined events), and “pull” reporting
(named, template-driven reports are pulled to the user)
-
Provide business analysts with intuitive, user-managed, digital dashboard displays,
KPIs, balanced scorecards, etc., with business-defined threshold values, color-coded
in red, yellow, and green
-
Provide an intuitive interface to communicate the meaning of a business rule
or entity to the business user
-
Use a BI-oriented corporate portal as a gateway to both structured and unstructured
information
-
Build for the short-term, but strategize and architect for the future
-
Set reasonable expectations on the initial Data Mart delivery and make good on
meeting and exceeding those expectations within the timebox allocated
|
 |