Tuesday, October 21, 2008

1. Datastage-modules

Datastage modules

The DataStage components:

* Administrator - Administers DataStage projects, manages global settings and interacts with the system. Administrator is used to specify general server defaults, add and delete projects, set up project properties and provides a command interface to the datastage repository.
With Datastage Administrator users can set job monitoring limits, user privileges, job scheduling options and parallel jobs default.
* Manager - it's a main interface to the Datastage Repository, allows its browsing and editing. It displays tables and files layouts, routines, transforms and jobs defined in the project. It is mainly used to store and manage reusable metadata.
* Designer - used to create DataStage jobs which are compiled into executable programs. is a graphical, user-friendly application which applies visual data flow method to develop job flows for extracting, cleansing, transforming, integrating and loading data. It’s a module mainly used by Datastage developers.
* Director - manages running, validating, scheduling and monitoring DataStage jobs. It’s mainly used by operators and testers.

Datastage manager view




Datastage Administrator view and project properties




Datastage Designer view with a job sequence

2. Designing jobs - datastage palette

Designing jobs - datastage palette


A list of all stages and activities used in the Datastage server edition is shown below.
Datastage server palette - general stages:



Datastage server palette - file stages:




Datastage server palette - database stages:



Datastage server palette - processing (transforming, filtering) stages:



Datastage server palette - sequence activities:


3. Extracting and loading data - ODBC and ORACLE stages

Designing jobs - ODBC and ORACLE stages


ODBC stages are used to allow Datastage to connect to any data source that represents the Open Database Connectivity API (ODBC) standard.
ODBC stages are mainly used to extract or load the data. However, ODBC stage may also be very helpful when aggregating data and as a lookup stage (in that case it can play role of aggregator stage or a hash file and can be used instead).
Each ODBC stage can have any number of inputs or outputs.
The input links specify the data which is written to the database (they act as INSERT, UPDATE or DELETE statements in SQL). Input link data can be defined in various ways: using an SQL statement constructed by DataStage, a user-defined SQL query or a stored procedure.
Output links specify the data that are extracted (correspond to the SQL SELECT statement). The data on an output link is passed through ODBC connector and processed by an underlying database.

If a processing target is an Oracle database, it may be worth considering use of ORACLE (ORAOCI9) stage. It has a significantly better performance than ODBC stage and allows setting up more configuration options and parameters native to the Oracle database.
There’s a very useful option to issue an SQL before or after main dataflow operations (Oracle stage properties -> Input -> SQL). For example, when loading a big chunk of data into an oracle table, it may increase performance to drop indexes in a ‘before SQL’ tab and create indexes and analyze table in a ‘after SQL’ tab ('ANALYZE TABLE xxx COMPUTE STATISTICS' SQL statement).

Update actions in Oracle stage
The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Update actions available from the drop-down list:

* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

Examples of the use of ODBC and ORACLE stages in Datastage
SQL generated by ODBC stage



ODBC columns view



Oracle destination stage update action

4. Extracting and loading data - sequential files

Designing jobs - sequential (text) files


Sequential File stages are used to interract with text files which may involve extracting data from and write data to a text file. Sequential File stages can read files, create (overwrite) or append data to a text file. It can be processed on any drive that is local or mapped as a shared folder or even on an FTP server (combined with an FTP stage). Each Sequential File stage can have any number of inputs or outputs. However, trying to write to a sequential file simultaneously from two data streams will surely cause an error.


The use of sequential files in Datastage is pretty straightforward. Please refer to the examples below to find out how to use sequential files in datastage jobs.

Datastage sequential file properties

5. Transforming and filtering data

Designing jobs - tranforming and filtering data


It's a very common situation and a good practice to design datastage jobs in which data flow goes in the following way:
EXTRACT SOURCE -> DATA VALIDATION, REFINING, CLEANSING -> MAPPING -> DESTINATION
The data refining, validation and mapping part of the process is mainly handled by a transformer stage. Transformer stage doesn't extract or write data to a target database. It handles extracted data, performs conversions, mappings, validations, passes values and controls the data flow.
Transformer stages can have any number of input and output links. Input links can be primary or reference (used for lookups) and there can only be one primary input and any number of reference inputs.

Please refer to the examples below to find out what is the use of transformers.

In the job design depicted below there is a typical job flow implemented. The job is used for loading customers into the datawarehouse. The data is extracted from an ODBC data source, then filtered, validated and refined in the first transformer. Rejected (not validated) records are logged into a sequential file. The second transformer performs a lookup (into a country dictionary hash file) and does some other data mappings. The data is loaded into an Oracle database.
Design of a common datastage job with validations and mapping



Design of a simple transformer



Design of a sample transformer with data filter and reject flow


6. Performing lookups in Datastage

Designing jobs - looking up data using hash files


The data in Datastage can be looked up from a hashed file or from a database (ODBC/ORACLE) source. Lookups are always managed by the transformer stage.

A Hashed File is a reference table based on key fields which provides fast access for lookups. They are very useful as a temporary or non-volatile program storage area. An advantage of using hashed files is that they can be filled up with remote data locally for better performance.
To increase performance, hashed files can be preloaded into memory for fast reads and support write-caching for fast writes.

There are also situations where loading a hashed file and using it for lookups is much more time consuming than accessing directly a database table. It usually happens where there is a need to access more complex data than a simple key-value mapping, for example what the data comes from multiple tables, must be grouped or processed in a database specific way. In that case it's worth considering using ODBC or Oracle stage.

Please refer to the examples below to find out what is the use of lookups in Datastage

In the transformer depicted below there is a lookup into a country dictionary hash file. If a country is matched it is written to the right-hand side column, if not - a "not found" string is generated.

Design of a datastage transformer with lookup



In the job depicted below there is a sequential file lookup, linked together with a hash file which stores the temporary data.
Sequential file lookup

7. Implementing ETL process in Datastage

Implementing ETL process in Datastage to load the DataWarehouse

ETL process

From an ETL definition the process involves the three tasks:

* extract data from an operational source or archive systems which are the primary source of data for the data warehouse.
* transform the data - which may involve cleaning, filtering and applying various business rules
* load the data into a data warehouse or any other database or application that houses data

ETL process from a Datastage standpoint

In datastage the ETL execution flow is managed by controlling jobs, called Job Sequences. A master controlling job provides a single interface to pass parameter values down to controlled jobs and launch hundreds of jobs with desired parameters. Changing runtime options (like moving project from testing to production environment) is done in job sequences and does not require changing the 'child' jobs.
Controlled jobs can be run in parallel or in serial (when a second job is dependant on the first). In case of serial job execution it's very important to check if the preceding set of jobs was executed successfully.
A normal datastage ETL process can be broken up into the following segments (each of the segments can be realized by a set of datastage jobs):

* jobs accessing source systems - extract data from the source systems. They typically do some data filtering and validations like trimming white spaces, eliminating (replacing) nulls, filtering irrelevant data (also sometimes detect if the data has changed since the last run by reading timestamps).
* loading lookups - these jobs usually need to be run in order to run transformations. They load lookup hashed files, prepare surrogate key mapping files, set up data sequences and set up some parameters.
* transformations jobs - these are jobs where most of the real processing is done. They apply business rules and shape the data that would be loaded into the data warehouse (dimension and fact tables).
* loading jobs load the transformed data into the database. Usually a typical Data Warehouse load involves assigning surrogate keys, loading dimension tables and loading fact tables (in a Star Schema example).

Datawarehouse master load sequence

Usually the whole set of daily executed datastage jobs is run and monitored by one Sequence job. It's created graphically in datastage designer in a similiar way as a normal server job.
Very often the following job sequencer stages/activities are used to do a master controller:

* Wait for file activity - check for a file which would trigger the whole processing
* Execute command - executes operating system commands or datastage commands
* Notification - sends email with a notification and/or job execution log. Can also be invoked when an exception occurs and for example notify people from support so they are aware of a problem straight away
* Exception - catches exceptions and can be combined with notification stage

Example of a master job sequence architecture

It's a good practice to follow one common naming convention of jobs. Job names proposed in the example are clear, easy to sort and to analyze what's the jobs hierarchy.

-Master job controller: SEQ_1000_MAS
--Job sequences accessing source: SEQ_1100_SRC
----loading customers: SEQ_1110_CUS
----loading products: SEQ_1120_PRD
----loading time scale: SEQ_1130_TM
----loading orders: SEQ_1140_ORD
----loading invoices: SEQ_1150_INV
--Job filling up lookup keys : SEQ_1200_LK
----loading lookups: SEQ_1210_LK
--Job sequences for transforming data: SEQ_1300_TRS
----transforming customers (dimension): SEQ_1310_CUS_D
----transforming products (dimension): SEQ_1320_PRD_D
----transforming time scale (dimension): SEQ_1330_TM_D
----transforming orders (fact): SEQ_1340_ORD_F
----transforming invoices (fact): SEQ_1350_INV_F
--Job sequence for loading the transformed data into the DW: SEQ_1400_LD

The master job controller (sequence job) for data warehouse load process SEQ_1000_MAS can be designed as depicted below. Please notice that it will not start until a trigger file is present (WaitFoRFile activity). The extract-transform-load job sequences (each of them may contain server jobs or job sequences) will be triggered in serial fashion (not in paralell) and an email notification will finish the process.

Mater job sequence for loading a DataWarehouse