Datareign

The Ab Initio System

Ab Initio consists of a run-time (the Co>Operating System) and a range of support tools. The product is aimed at customers who need to process massive amounts of data and is generally run on very large machines with multiple processors.

It is pretty well unheard of for Ab Initio to be used as a stand-alone system. A typical installation will consist of the Co>Operating System running on a multi-processor machine with 4 - 64 nodes. The interface to the system is called 'mp' and this is typically called from within a shell script.

In addition to this there will be one or more developement workstations running the Graphical Development Environment (GDE). The GDE is used to generate shell scripts to access mp. It is normal to use Korn shell when running the Ab Inito system.

Because Ab Initio is only economical when used to process massive amounts of data, the main machine will typically be connected to multi-Terrabyte disk arrays. The data on these disks will be stored in a dedicated 'multifile' system, a set of directories that spread data out to mimic the manner in which it will be processed.

Thus, if the machine has 8 processors, all of which will be used by mp, the multifile system is typically implemented in 8-ways. If 16 processors will be used, a 16-way multifile system is usually indicated and so on. Ab Initio supplies utilities for tasks such as implementing and manipulating the multifile systems.

All such settings are defined by a series of memory variables which must be set in the run-time user's environment. These variables all begin with $AI_ and, particularly on large multi-processor machines, are often set dynamically by a set-up script called before the main script.

The Develop & Run Cycle

Ab Initio is intended to provide rapid development as well as fast execution. An idealised cycle is…

  • Design each job (graph) in the Graphic Development Environment (GDE)
  • Test the graph and save it into the Enterprise Metadata Environment (EME)
  • Deploy the graph directly from the EME, running it against a Co>Operating System (Co>Op)
  • Store logs and other metadata back to the EME for later analysis

The following diagram illustrates this process…

Multi-Tasking

mp can be told which processors to use by means of the appropriate variable. On a 64 processor machine it's not uncommon to have three 16 processor sessions running while the remaining 16 processors are available for non-related tasks.

Ab Initio programmes are generally written using the GDE. This is currently implemented to run on the Microsoft Windows platform and implements a CASE style approach to programming, using Ab Initio's Component Library. These building blocks provide most of the facilities required in processing large amounts of data and are shown on the GDE's canvass as code blocks.

The programmer selects the required block from a pallet, places it on the canvass, attaches it to other blocks and alters the parameters. In addition, the programmer has access to a fairly advanced programming language, which he can use to implement new components or modify the actions of existing ones. Finally, the programmer may set up shell commands in the 'start' and 'end' scripts. These commands will be executed before the graph commences and after a successful run.

One other tool that deserves mention at this point is the EME which is a repository management tool used to store and co-ordinate work on Ab Initio programmes (known colloquily as 'graphs'). It is can connect directly to the GDE allowing programming teams to coordinate their development efforts. In addition, the EME can store a wide variety of meta-data, such as definitions, reports, logs, etc..

Types of Parallelism

There are three forms of parallelism employed by Ab Initio: Data Parallelism, Pipeline Parallelism and Processor Parallelism. These three mechanisms interact in different ways and all have to be taken into account when designing systems.

Data Parallelism

Ab Initio can store data, pre-sorted or otherwise, in aggregate file systems known generically as multifiles. The creation and management of these file systems is handled by Ab Initio utilities. Data in a multifile can be ordered by specified keys, on a statistical basis using hashing algorithms or left unordered.

The main purpose of multifiles is to keep data ready for parallel processing and thus avoid the overhead of ordering it on the fly. Normally, this is a good thing. On the rare occassions that it is not, Ab Initio's sorting and partitioning tools are sufficiently powerful to take the sting out of the need to re-order a file.

As data often comes from other systems which do not offer data parallelism, or, if they do, in a form incompatible with that employed by Ab Initio, it is common practice to have one or more data import graphs which take the incoming serial data and distribute it to suitable multifiles.

Pipeline Parallelism

Inside a graph, data may be processed either serially or in parallel. This is irrespective of the underlying hardware and is achieved by the use of the operating system's pipeline programming facility.

The Ab Initio coder need know nothing about the mechanics of this system in order to write graphs but an understanding of a particular operating system's pipeline model may help in producing an efficient solution.

Processor Parallelism

When the Co>Processor runs on a machine with multiple processors, it automatically translates both the Data and Pipeline parallelism of a graph to the most efficient model for the underlying hardware.

It doesn't harm for the coder to know how many processors will be available to run a job and in time-critical cases it's considered good practice to specify how many processors must be available to a particular graph.

Discussions on Specific Topics

Last modified: 2009/01/21 18:02