MMH Data Warehousing, Corporate Portal & e-business Applications.

Data Warehousing, Corporate Portal & e-Business Intelligence Applications

 

How to Select an Extraction/Transformation/Load Tool

Selecting an ETL Tool

Selection of an appropriate ETL tool is the most important decision that has to be made in choosing components of a data warehousing application. The ETL tool operates at the heart of the data warehouse, extracting data from multiple data sources, transforming the data to make it accessible to business analysts, and loading multiple target databases. Unlike other components of a data warehousing architecture, it is very difficult to switch from one ETL tool to another. Due to a lack of standards, data definitions and transformation rules cannot be moved from one tool to another. If the selected ETL tool fails in production, it is the one component of the architecture that is not easily replaceable.

Organizations are faced with two major choices to implement an extraction, transformation, and load process:

  1. Write a custom program in COBOL, C, or PL/SQL to extract data from multiple source files, transform the data, and load the target databases
  2. Purchase an off-the-shelf extraction/transformation/load (ETL) tool
Hand-Coded ETL Processes

A common mistake is to write custom programs to perform the extraction, transformation, and load functions. Writing an ETL program by hand may seem to be a viable option because the program does not appear to be too complex and programmers are available. However, there are serious problems with hand-coded ETL programs:

  • Unlike OLTP applications, the functions to be supported by individual data marts cannot be predicted in advance. In a typical data mart, over 50% of the required functionality is defined by end users after the data mart goes into production. To keep up with the high volume of changes initiated by end users, hand-written ETL programs have to be constantly modified and in many cases rewritten. The effort required to maintain these programs often becomes a major burden for the project
  • Metadata is not generated automatically by hand-generated ETL programs. Metadata is the key to integrating data marts across business units. If metadata is not available, it is difficult to avoid the development of “stovepipe” data marts that satisfy the needs of individual business units, but cannot be integrated across the enterprise
  • Hand-coded ETL programs are likely to have a slower speed of execution, compared with directly executable code generated by off-the-shelf ETL tools. Hand-generated programs are typically single-threaded, while modern ETL tools generate multi-threaded, directly executable code that can run on parallel, high-speed engines
Off-The-Shelf ETL Tools

To avoid these problems, automated, off-the-shelf ETL tools are increasingly being used to extract, cleanse, transform, and load data into target databases. An important function of these tools is to generate and maintain centralized metadata.

ETL tools solve several important architectural problems for data warehouses:

  • The ETL tool provides coordinated access to multiple data sources. Functions supported by ETL tools include extraction of data from multiple source environments, data cleansing, reorganization, transformation, aggregation, calculation, automatic loading of data into the target database, and automatic generation of executable code to perform parallel processing of transformations on multiple engines
  • ETL tools are used to generate and maintain a central metadata repository. The metadata repository provides a “single version of the truth” that can be used to define enterprisewide source data definitions, data models for target databases, and transformation rules that convert source data into target data. A metadata exchange architecture is used to synchronize central business rules with local business rules, maintained as local metadata by end-user BI tools
  • The ETL tool also addresses the dirty data problem - data from source files can be cleansed and inconsistencies in the data resolved as part of the extraction and transformation process, using procedural data cleansing techniques. Name and address correction, deduping, and householding functions require use of an external data cleansing tool

Analysts define source-to-target mappings, data cleansing rules, and data transformation rules using a graphical point-and-click interface provided by the ETL tool. When all mappings and transformations have been specified, the ETL tool automatically generates the data extract/transformation/load programs, which typically run in batch mode.

[ TOP ]