Datareign

All Datastage processing may be considered as being a form of Extract, Transform & Load (ETL), although it is possible to develop other processes with the tool.

Because Datastage jobs tend to be made up from a large number of sub-jobs, carefull planning beforehand is essential if the process is not to 'get away from' the implementors, with a confusion of roles and a profusion of competing and even conflictiong components.

Structure of a System

It is normal Datastage practice to design a single master job, which manages any number of child jobs and, more importantly, provides a single interface for parameters and constants. This is the preferred mechanism for achieving portability by ensuring that differences between hosts are contained in a single location. Without a Job Sequence, every Datastage job in a batch would need to have its parameters updated when moved to a different host.

Unconnected jobs may then be run in parallel by the job sequence. Connected jobs must be overtly run serially and the job sequence must contain logic to confirm successful completion of a predecessor before starting its successor.

It's standard practice to seperate jobs into four notional types:

  • Extractors typically clean incoming data, preparing it for following stages.
  • Preprocessors use extracted data to create hashed files for lookups, produce surrogate key mapping files, set up sequences and define parameters for later jobs.
  • Transformers read the incoming data from any previous stages and amend it as required.
  • Loaders write the amended data into the target database.

See Datastage Job Sequences for more detail.

Sequence Job

This is the master job referred to above. It typically contains four sections:

  • Wait Process polls the file system for one or more required files, holding up processing until everything required is on-line.
  • Executor fires off the required Datastage and operating system processes required by the overall job.
  • Notifier handles overall logging.
  • Exception Handler deals with problems, typically by emitting emails to the appropriate queues.

There can, of course, be more than one sequence job in an application. In such a case, judgements will have to be made as to whether subsequent sequence jobs should be called from a single master or kicked off individually by the host operating system's scheduler.

Job Mapping

Datastage processing has a tendency to create large numbers of job files. It is therefor a basic survival strategy to ensure that a logical naming convention is applied. One possible style is illustrated below.

     DS45000MASTER     Overall controller for European load
     DS45010FRANCE     French data reader
     DS45020GERMANY    German data reader
     DS45030ITALY      Italian data reader
     DS45040SPAIN      Spanish data reader
     DS45110LOOKUPS    Lookup file creator
     DS45210ACCEPTED   Split out accepted applications
     DS45220DENIED     Split out denied applications
     DS45230WAITING    Split out incomplate applications
     DS45310ACCEPTED   Load accepted applications
     DS45320DENIED     Load denied applications
     DS45330WAITING    Load incomplate applications

Component Naming

As with the jobs, so the components within a job should be named to follow a logical pattern that allows easy understanding of the relationships between them.

Let's consider a simple example: suppose that we want to pick up credit card applications, link them to product names, apply a score to each application and then drop the updated records into our warehouse staging area. Our job might look something like this…

                                   
                           -------------- 
                          | Product List |
                           --------------
                                 |
                                 |
                                 |
                                 v
                           --------------
                          | Product Hash |
                           --------------
                                 |
                                 |
                                 |
                                 v
  --------------         -----------------         -----------------        ---------
 | Applications |------>| Lookup Products |------>| Calculate Score |----->| Load DW |
  --------------         -----------------         -----------------        ---------

We could use the names assigned in the above sketch, but this might lead to some confusion, if we need to build several similar loaders. Much better to assign names that relate the components to the job, whilst providing them with unique identifications:

                                   
                      ----------- 
                     | DS45310F2 |
                      -----------
                           |
                           |
                           |
                           v
                      -----------
                     | DS45310H1 |
                      -----------
                           |
                           |
                           |
                           v
  -----------         -----------         -----------        -----------
 | DS45310F1 |------>| DS45310L1 |------>| DS45310T1 |----->| DS45310W1 |
  -----------         -----------         -----------        -----------

Then, in the documentation component, we can clarify the functionality in detail…

 Job: DS45310ACCEPTED (Load accepted applications)
    
 Component    Description
 -----------  ---------------------------------------------------------------
 DS45310F1    Amalgamated Applications file
 DS45310F2    Product List reference file (from marketing database)
 DS45310H1    Generated product list reference hash
 DS45310L1    Lookup products and apply to DS_45310_F1
 DS45310T1    Calculate Score transform. (Uses rule from spec. section 4.5.2)
 DS45310W1    Load to OracleGlobal.DW.LiveApplications
 -----------  ---------------------------------------------------------------
Last modified: 2009/01/15 10:40