Datastage tutorial: 2008

Tuesday, October 21, 2008

1. Datastage-modules

Datastage modules

The DataStage components:

* Administrator - Administers DataStage projects, manages global settings and interacts with the system. Administrator is used to specify general server defaults, add and delete projects, set up project properties and provides a command interface to the datastage repository.
With Datastage Administrator users can set job monitoring limits, user privileges, job scheduling options and parallel jobs default.
* Manager - it's a main interface to the Datastage Repository, allows its browsing and editing. It displays tables and files layouts, routines, transforms and jobs defined in the project. It is mainly used to store and manage reusable metadata.
* Designer - used to create DataStage jobs which are compiled into executable programs. is a graphical, user-friendly application which applies visual data flow method to develop job flows for extracting, cleansing, transforming, integrating and loading data. It’s a module mainly used by Datastage developers.
* Director - manages running, validating, scheduling and monitoring DataStage jobs. It’s mainly used by operators and testers.

Datastage manager view

Datastage Administrator view and project properties

Datastage Designer view with a job sequence

2. Designing jobs - datastage palette

Designing jobs - datastage palette

A list of all stages and activities used in the Datastage server edition is shown below.
Datastage server palette - general stages:

Datastage server palette - file stages:

Datastage server palette - database stages:

Datastage server palette - processing (transforming, filtering) stages:

Datastage server palette - sequence activities:

3. Extracting and loading data - ODBC and ORACLE stages

Designing jobs - ODBC and ORACLE stages

ODBC stages are used to allow Datastage to connect to any data source that represents the Open Database Connectivity API (ODBC) standard.
ODBC stages are mainly used to extract or load the data. However, ODBC stage may also be very helpful when aggregating data and as a lookup stage (in that case it can play role of aggregator stage or a hash file and can be used instead).
Each ODBC stage can have any number of inputs or outputs.
The input links specify the data which is written to the database (they act as INSERT, UPDATE or DELETE statements in SQL). Input link data can be defined in various ways: using an SQL statement constructed by DataStage, a user-defined SQL query or a stored procedure.
Output links specify the data that are extracted (correspond to the SQL SELECT statement). The data on an output link is passed through ODBC connector and processed by an underlying database.

If a processing target is an Oracle database, it may be worth considering use of ORACLE (ORAOCI9) stage. It has a significantly better performance than ODBC stage and allows setting up more configuration options and parameters native to the Oracle database.
There’s a very useful option to issue an SQL before or after main dataflow operations (Oracle stage properties -> Input -> SQL). For example, when loading a big chunk of data into an oracle table, it may increase performance to drop indexes in a ‘before SQL’ tab and create indexes and analyze table in a ‘after SQL’ tab ('ANALYZE TABLE xxx COMPUTE STATISTICS' SQL statement).

Update actions in Oracle stage
The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Update actions available from the drop-down list:

* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

Examples of the use of ODBC and ORACLE stages in Datastage
SQL generated by ODBC stage

ODBC columns view

Oracle destination stage update action

4. Extracting and loading data - sequential files

Designing jobs - sequential (text) files

Sequential File stages are used to interract with text files which may involve extracting data from and write data to a text file. Sequential File stages can read files, create (overwrite) or append data to a text file. It can be processed on any drive that is local or mapped as a shared folder or even on an FTP server (combined with an FTP stage). Each Sequential File stage can have any number of inputs or outputs. However, trying to write to a sequential file simultaneously from two data streams will surely cause an error.

The use of sequential files in Datastage is pretty straightforward. Please refer to the examples below to find out how to use sequential files in datastage jobs.

Datastage sequential file properties

5. Transforming and filtering data

Designing jobs - tranforming and filtering data

It's a very common situation and a good practice to design datastage jobs in which data flow goes in the following way:
EXTRACT SOURCE -> DATA VALIDATION, REFINING, CLEANSING -> MAPPING -> DESTINATION
The data refining, validation and mapping part of the process is mainly handled by a transformer stage. Transformer stage doesn't extract or write data to a target database. It handles extracted data, performs conversions, mappings, validations, passes values and controls the data flow.
Transformer stages can have any number of input and output links. Input links can be primary or reference (used for lookups) and there can only be one primary input and any number of reference inputs.

Please refer to the examples below to find out what is the use of transformers.

In the job design depicted below there is a typical job flow implemented. The job is used for loading customers into the datawarehouse. The data is extracted from an ODBC data source, then filtered, validated and refined in the first transformer. Rejected (not validated) records are logged into a sequential file. The second transformer performs a lookup (into a country dictionary hash file) and does some other data mappings. The data is loaded into an Oracle database.
Design of a common datastage job with validations and mapping

Design of a simple transformer

Design of a sample transformer with data filter and reject flow

6. Performing lookups in Datastage

Designing jobs - looking up data using hash files

The data in Datastage can be looked up from a hashed file or from a database (ODBC/ORACLE) source. Lookups are always managed by the transformer stage.

A Hashed File is a reference table based on key fields which provides fast access for lookups. They are very useful as a temporary or non-volatile program storage area. An advantage of using hashed files is that they can be filled up with remote data locally for better performance.
To increase performance, hashed files can be preloaded into memory for fast reads and support write-caching for fast writes.

There are also situations where loading a hashed file and using it for lookups is much more time consuming than accessing directly a database table. It usually happens where there is a need to access more complex data than a simple key-value mapping, for example what the data comes from multiple tables, must be grouped or processed in a database specific way. In that case it's worth considering using ODBC or Oracle stage.

Please refer to the examples below to find out what is the use of lookups in Datastage

In the transformer depicted below there is a lookup into a country dictionary hash file. If a country is matched it is written to the right-hand side column, if not - a "not found" string is generated.

Design of a datastage transformer with lookup

In the job depicted below there is a sequential file lookup, linked together with a hash file which stores the temporary data.
Sequential file lookup

7. Implementing ETL process in Datastage

Implementing ETL process in Datastage to load the DataWarehouse

ETL process

From an ETL definition the process involves the three tasks:

* extract data from an operational source or archive systems which are the primary source of data for the data warehouse.
* transform the data - which may involve cleaning, filtering and applying various business rules
* load the data into a data warehouse or any other database or application that houses data

ETL process from a Datastage standpoint

In datastage the ETL execution flow is managed by controlling jobs, called Job Sequences. A master controlling job provides a single interface to pass parameter values down to controlled jobs and launch hundreds of jobs with desired parameters. Changing runtime options (like moving project from testing to production environment) is done in job sequences and does not require changing the 'child' jobs.
Controlled jobs can be run in parallel or in serial (when a second job is dependant on the first). In case of serial job execution it's very important to check if the preceding set of jobs was executed successfully.
A normal datastage ETL process can be broken up into the following segments (each of the segments can be realized by a set of datastage jobs):

* jobs accessing source systems - extract data from the source systems. They typically do some data filtering and validations like trimming white spaces, eliminating (replacing) nulls, filtering irrelevant data (also sometimes detect if the data has changed since the last run by reading timestamps).
* loading lookups - these jobs usually need to be run in order to run transformations. They load lookup hashed files, prepare surrogate key mapping files, set up data sequences and set up some parameters.
* transformations jobs - these are jobs where most of the real processing is done. They apply business rules and shape the data that would be loaded into the data warehouse (dimension and fact tables).
* loading jobs load the transformed data into the database. Usually a typical Data Warehouse load involves assigning surrogate keys, loading dimension tables and loading fact tables (in a Star Schema example).

Datawarehouse master load sequence

Usually the whole set of daily executed datastage jobs is run and monitored by one Sequence job. It's created graphically in datastage designer in a similiar way as a normal server job.
Very often the following job sequencer stages/activities are used to do a master controller:

* Wait for file activity - check for a file which would trigger the whole processing
* Execute command - executes operating system commands or datastage commands
* Notification - sends email with a notification and/or job execution log. Can also be invoked when an exception occurs and for example notify people from support so they are aware of a problem straight away
* Exception - catches exceptions and can be combined with notification stage

Example of a master job sequence architecture

It's a good practice to follow one common naming convention of jobs. Job names proposed in the example are clear, easy to sort and to analyze what's the jobs hierarchy.

-Master job controller: SEQ_1000_MAS
--Job sequences accessing source: SEQ_1100_SRC
----loading customers: SEQ_1110_CUS
----loading products: SEQ_1120_PRD
----loading time scale: SEQ_1130_TM
----loading orders: SEQ_1140_ORD
----loading invoices: SEQ_1150_INV
--Job filling up lookup keys : SEQ_1200_LK
----loading lookups: SEQ_1210_LK
--Job sequences for transforming data: SEQ_1300_TRS
----transforming customers (dimension): SEQ_1310_CUS_D
----transforming products (dimension): SEQ_1320_PRD_D
----transforming time scale (dimension): SEQ_1330_TM_D
----transforming orders (fact): SEQ_1340_ORD_F
----transforming invoices (fact): SEQ_1350_INV_F
--Job sequence for loading the transformed data into the DW: SEQ_1400_LD

The master job controller (sequence job) for data warehouse load process SEQ_1000_MAS can be designed as depicted below. Please notice that it will not start until a trigger file is present (WaitFoRFile activity). The extract-transform-load job sequences (each of them may contain server jobs or job sequences) will be triggered in serial fashion (not in paralell) and an email notification will finish the process.

Mater job sequence for loading a DataWarehouse

8. SCD implementation in Datastage

SCD Type 2

Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact table usually stores dimension key and version number.
SCD 2 implementation in Datastage

The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it's name is D_CUSTOMER_SCD2) which has the following structure and data:

Datastage SCD2 job design

# The most important facts and stages of the CUST_SCD2 job processing: The dimension table with customers is refreshed daily and one of the data sources is a text file. For the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it is the only record with changed data. It has the following structure and data:
SCD 2 - Customers file extract:

# There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text file.
# A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to separate columns.
SCD 2 lookup transformer

A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes through only records that differ.
SCD 2 check discrepancies transformer

# A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated with current indictator flag set to no and the new record is inserted with current indictator flag set to yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer

# ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed where part of an SQL statement.
# ODBC Insert stage (O_DW_Customers_SCD2_Ins) - insert action 'insert rows without clearing' and the key column is CUST_ID.

9. Datastage jobs real-life solutions

Datastage server job designs

Examples of datastage server job designs which are solutions for real problems in the data warehouse environment.

Design of a DataStage server job with Oracle plsql procedure call

Example of a DataStage server aggregation job with use of containters

Datastage sequences examples

Design of a Datastage job sequence with email notification

Design of a complex Datastage server job sequence with parallel job execution

10. Header - Items processing example

Processing a header and trailer textfile in Datastage

Example case

The input file in our scenario is a stream of records representing invoices. It is a text file in a header-trailer format and has the following structure:
- A header starts with a letter H which is followed by an invoice number, a customer number, a date and invoice currency
- Every item starts with a letter I which is followed by product ID, product age, quantity and net value
- And the trailer starts with a T and contains two values which fullfill the role of a checksum: the total number of invoice lines and total net value.

Header and trailer input textfile in Datastage:

Solution
Datastage job design which solves the problem of loading an extract structured into headers and items:

Detailed illustration of each component of the job

# trsfGetKinds transformer - a type of record is extracted from each input line and assigned and written into a kind column
The transformer reads first character of each line to mark records and divide them into kinds:

trsfAssignHeaders transformer assigns invoice headers to each line of the input flow. Invoice header is a 8 character key which is extracted from the dataline using the following formula: L02.data_line[1,8]
A datastage transformer which assigns headers using a stage variable:

trsfReformat transformer splits the data flow into invoice headers and invoice lines. A InvoiceLineID stage variable is used to populate a sequence number for each invoice line. In order to correctly load net value, the ereplace function replaces commas with dots in this numeric field.
The following transformer splits the flow into headers and items and reformats records into a corresponding structure:

Target oracle tables with correctly loaded invoice headers and lines:

Datastage tutorial

Labels

Blog Archive