The Organiser is where the building blocks of an Ab Initio 'graph' may be found and is simply a view of the directory 'Components' under the GDE's home directory.
Within the Organiser, components are grouped into more or less logical categories, corresponding to the sub-directories of 'Components'. If the user wishes, it's a simple matter to move the components around, rename groups and generally customise the Organiser to his own needs.
The components themselves take the form of simple text files which define the parameters passed to mp. the CoOperating system runtime. Those with a .mdc extension define datasets while those ending in .mpc define processing components. The GDE translates the component files into the graphical elements that a user works with.
The following is a guide to a typical starter installation.
These are similar in function to the stand-alone compression utilities such as Gzip and Zip.
Compress reduces the size of files
Uncompress restores data to its original format
Continuous components permit Ab Initio to deal with data which appears in a non-synchronous manner. As an example, consider a stock control system which receives updates from shops on a 'just in time' basis. These updates can simply be dropped into a landing zone from which a constantly running graph extracts and processes them.
Batch Subscribe reads data from a source file until it meets an
EOF marker or until a specified condition is false. This provides a mechanism for continuous flows to be terminated programatically. You'd use this where you wanted to spot that a set of data was now complete.
Continuous Rollup is an aggregating process which will give out results at specified intervals. For a fuller description of how it works, see Aggregate (below).
Continuous Scan can provide running totals or other summaries for groups of related records with the advantage of not requiring that the input flow be in sorted order.
Continuous Update Table will update a database table from a continuous flow and can commit the update at specified intervals.
Publish writes to a data set which may then be treated as the source of a continuous flow.
Subscribe reads from a dataset, typically created by publish (above)
Components which interface Ab Initio to the user's database manager.
Input Table defines a data source in a database. Although typically used with a specified table for speed, may just as easily specify a view or a complex select statement.
Output Table defines a target table to write to in a database.
Run SQL can execute any valid command against the target database. It's possible to get quite creative with this but it requires stringent testing for anything sophisticated to ensure that timing and other issues don't cause unexpected problems.
Truncate Table deletes all data from the target table.
Update Table is a special purpose version of RunSQL which can insert or update to a specified table.
The input and output table components are as shown under database above.
The input, intermediate, lookup and output files are actually the same component with different default settings. You can easily change one into the other from the properties dialogue.
There are often special purpose components added here such as SAS input and output file types.
The main file type can address serial, multifile or compound multifiles as required. depending on how the properties are set, the GDE displays an appropriate icon for the component.
Departition Components in this group convert multiple flows into a single flow.
Concatenate takes a number of flows and attaches them 'head to tail', that is, the first partition is read out, followed by the second and so on.
Gather reads records from multiple flows and combines them in an arbitrary fashion to create a single flow. As a result, there is no telling in what order records will appear in the output. For this reason, it's common to follow gather with a sort. That said, it's not necessary to precede a sort component with a gather because sort contains its own gather.
Interleave reads each flow in turn, rather like someone collating a document from a number of piles of pages.
Merge expects the data on all its flows to be in sorted order. It then combines the flows so as to maintain that order.
This is where old components go to die. Anything in this group is available to maintain backwards compatibility with previous versions of the CoOperating system but should not be used for new development.
is where the Ab Initio tutorial material lives.
Components to communicate with computers which are not themselves running Ab Initio. Note however, that for these components to work, the remote computer must be running a FTP server.
FTP From gets files from the remote server.
FTP To drops files onto the remote server.
The odds and sods department where things end up which just don't fit elsewhere. Always worth looking at if you're visiting a new site.
Transitive Closure I've never really found out what these actually do but they seem to be in every Ab Initio installation. They're not even documented in the help file!
Compute Closure
Recirculate
Gather Logs (originally known as Logger) writes the logs from each component connected to it to a protected file which is not affected by Ab Initio's rollback mechanism. It's basically a debugging tool.
Redefine Format copies all the data on its input port to the output port, changing the record format as it goes. You could do the same job with a Transform but this is easier if you're not changing values and better from the documentation point of view.
Replicate is a form of Gather that can write several duplicate versions of its output. The same caveats apply to both components.
Run Program does exactly what it says on the tin. It will pass back the output of the programme it executes so you can use that later in the graph.
Trash is, oddly enough, one of the most useful components of all during development. It works just like /dev/null in Unix, sending all output to the great bit-bucket in the sky. However, all the normal GDE facilities of Counts and Watchers can be applied to a Trash line, so you can use it to spot, for example, anomalous events on reject ports.
Drop any components which you create yourself in here.
These components convert serial flows into parallel flows. It is always better to process in parallel with Ab Initio because that's what the CoOperating system is best at. Speed gains on a big machine can be quite spectacular if you get it right.
Broadcast is kind of like Replicate. It takes any number of source flows, interleaves them arbitrarily, then copies the combined flow to any number of outputs.
Partition by Expression takes a piece of DML and applies it to each input record in turn. The value generated by the expression must be a number from 1 to n where n is one less than the number of output flows. The input record is then written to the output flow defined by the number.
Partition by Key (previously Hash Partition) basically works like an updated version of the old ISAM files, so beloved of COBOL programmers. You give the component a key expression and it applies a hashing algorithm such that records are spread more or less evenly across all the output flows with identical keys ending up in the same flow.
Partition by Percentage takes an input flow and distributes it to the output flows on a percentage basis. You need to define the split either by setting the percentages parameter or supplying a file in which each record is a number. Whichever way you do it, the total must come to 100 or less. Let's say that we supply the parameter '20 10 40' and we have four output flows. Now, if we have, say, ten input records, the output flows will receive 2, 1, 4 and 3 records (the last flow receives any surplus if the parameter totals less than 100). The distribution isn't quite arbitrary but I've never been able to work out exactly what's happening.
Partition by Range takes an unsorted input flow and splits it into equal sized output flows. To do this, it relies on a separate input flow of 'splitters' which are generated by the Find Splitters component. Given that the latter is deprecated, one can only assume that Partition by Range will soon be dropped also.
Partition by Round-robin is the simplest partitioning component. It takes the input flow records and writes them to the output flows in turn, the first record to the first flow, the second record to the second flow and so on. When it runs out of flows it starts again with the first flow. This is the component of choice when all you're looking to do is take advantage of parallelism without needing to worry about differentiating records by key values, etc.
Partition with Load Balance used to be known as Load Level which describes its function very well. It reads records from its input and writes them to the first output flow, stopping when the output flow's buffer is full. It then starts writing to the next output flow, again stopping when the output buffer signals full. This process continues until there are no more input records. As a method of getting high through-put it has its merits but, like the round robin partitioning scheme, gives absolutely no control over what data goes into what flows, so it is only useful where speed of processing is paramount and all flows can be processed identically.
Contains a large number of components for use with the SAS (Statistical Analysis System) language.
Strangely enough, contains the Ab Initio sort components.
Internal is a directory containing sub-components of the sorts. You could play around with them but Ab Initio don't recommend it and neither do I.
Checkpointed Sort performs a memory sort on groups of records, merging the groups to produce a sorted flow. Basically, this is two components, PartialSort and MergeRuns. PartialSort reads data from the input flow until it's filled up its buffer (the size of which is defined by the parameter max-core). Once the buffer is filled, the contents are sorted in memory and the sorted block is passed to MergeRuns, as are all subsequent sorts. MergeRuns then interleaves the sorted blocks to produce a completely sorted output. So, you ask, why all this complexity? The answer lies in checkpoints, Ab Initio's mechanism for controlling programme flow, rollback and restarts. Checkpoints are, essentially, a managed set of intermediate files. If a graph fails due to a data error, it's possible to fix the failing data and restart from the last checkpoint. When dealing with hundreds of gigabytes of data, which is what Ab Initio is intended for, this can save hours or even days of run-time.
Partition by Key and Sort is another hybrid. In this case, the first part is our old friend Partition by Key (see above). Once the entries have been spread across the output flows, each flow is then sorted on the key field.
Sample is a useful tool for those cases where you want to reduce a larger data set for testing purposes or to use for statistical analysis. It's claimed that the chance of any particular record appearing in the output is the same for all records, regardless of the size of the input file or the position of the record.
Sort is the basic sort component. It actually performs a sort-merge, in the same way as the Checkpointed Sort described above, but without writing checkpoint files. The advantage is raw speed and the disadvantage is that if the data contains a single invalid record you lose the whole of the process. Basically, you'd use Sort whenever you're confident that the data will always be valid or where the loss of the process is irrelevant, such as where the flow to be sorted is relatively small (say, less than one hundred megabytes).
Sort within Groups is an awkward name for a component that probably ought to be called subsort or something like that. Basically, it expects the input to be pre-sorted on the 'major key'. The component reads data from the input until either the major key changes or it exceeds the buffer size set by max-core. In the first case it sorts the records on the minor-key (or sub key if you prefer) and writes the sorted data to the output flow. In the second case, it throws an error. You may, bizarrely, set the component to allow data that is not sorted on a major-key, although you still have to define a major-key field. If you do this, the component works as described previously but now reads until the major-key changes without caring if the next major-key is higher or lower in collating order. This latter behaviour is actually quite useful if you're dealing, for example, with invoices where the data is in date order but the major-key is, for example, customer ID. This allows you to sort the invoice lines on a particular order without jumbling up the invoices (although, come to think on it, you'd have problems if you had two invoices for the same customer following one another - who said life should be easy!).
These are the 'power tools' of Ab Initio.
Aggregate produces subtotals or counts. It expects the input flow to contain data sorted on the grouping key. So long as the keys on the input flow match, Aggregate assumes they're in the same group. The actual aggregation is done by a function which is expected to provide the logic in standard Ab Initio DML. This means that, once you've broken the pain barrier of understanding exactly how it works, Aggregate allows you to perform an almost unlimited variety of actions on the flow.
Dedup Sorted splits an incoming flow of sorted records into two output flows. The first time it encounters a key, it sends the record to its 'out' port. All subsequent records are sent to the 'dup' port. The interesting thing to note here is that you can ignore either the 'out' or the 'dup' port. Thus, you could only read from the 'out' port to get the first example or only read from the 'dup' port to ignore a header record. You could, of course, process records from both ports if you wish.
Denormalize Sorted takes a sorted input stream and groups the related records into one. It's similar to Aggregate (above) but it's used to create an output similar to Excel's PivotTable, where a value from each input record becomes a field in the output record. The big trick is to make sure that your logic traps the case where you have more input records than expected - Denormalise Sorted will collapse in a heap if you fail to do that.
Filter by Expression applies a test to an input flow. Records which pass the test appear on the 'out' port while those which don't appear on the 'deselect' port. You must connect the out port to something but deselect can be ignored if you wish.
Join allows you to select data based on matches between two or more input flows. Records which fail to match are sent the 'unused' ports while the successful joins are available to create an output record using a process similar to Reformat (below).
Match Sorted looks at first blush rather like a simpler version of Join but it's actually a completely different beast. Each port is filtered through its own select statement and the record only passed on if it satisfies the selection. The component then reads records from its buffers in turn starting with port 0. If port 1 has a record with the same key, that record is made available, the same with port 2 and so on. If a port does not have a matching key, then the previous value is presented again. The selected records are then passed into a transform function, written by the developer which decides whether the presented records are valid or not. If the function passes the record, it is written to the output, otherwise it's written to the reject port corresponding to the input port. In all fairness, I've never yet found a use for Match Sorted because the logic is just too convoluted. I find it better to use a combination of selects and joins to do the same job on the basis that what you lose in execution speed is more than made up for in readability and consequent reliability.
Normalize does exactly the opposite job to Denormalize Sorted (above). It takes a single record and splits it up to produce new records based on the data in the parent record. You have to define a length function which will decide how often to call the normalize function (which you will also need to define). To see how this works, imagine a record which contains production data with a field for each day of the week. In a case like this, the length function would simply return '7' on each call. The normalize function would therefor be called seven times and on each call it would generate a new output record. A more sophisticated implementation might compute the number of calls to the normalize function dynamically, only producing records for days with non-zero production quantities. Normalize is remarkably flexible and worth learning well.
Reformat is the Swiss Army Knife of the system. Basically, you can define any number of input flows and then select and modify data as required to create the output record. The only thing to remember is that you must have one input record on each flow for each output record. Reformat appears as the guest star in many other components.
Rollup is a souped up version of Aggregate. It can do everything Aggregate does plus a few things that the other component cannot. Later versions of the GDE have a Wizard associated with Rollup which makes programming it just a little easier.
Scan is yet another take on the Aggregate/Rollup idea. It's pretty well as flexible as Rollup plus it gives the option of retrieving intermediate summaries.
is where file format translators live.
Read XML takes an
XML document and extracts the data in it to a flat file.
Write XML takes a flow and writes the data to an
XML document.
Contains components that are used for validating data.
Check Order is a rather specialised variant on a sort merge. You feed it a file which you expect to be sorted on a particular key. You can optionally pass it a limit value to define how many errors to accept. For each error it finds, Check Order writes a message to its output port. If the number of errors exceeds the specified limit, Check Order throws an error and halts the programme.
Compare Checksums compares the checksum values on its two input ports. You'd use this with Compute Checksum (below) to ensure that data wasn't being mangled, for example, by an electronic transfer process.
Compare Records is rather like Check Order (above) but compares two records arriving at its input ports.
Compute Checksum generates a checksum from a record. The output from this component is a record containing the CRC, length and record count for the input record.
Generate Random Bytes creates one or more records consisting of random byte patterns.
Generate Records will produce a specified number of records the fields of which are filled with random patterns. This is more useful than it may sound as it saves a lot of time when large amounts of non-identical test data is required, such as for timing worst case scenarios, etc.
Validate Records will check for the validity of the record format on its input port. This is a great tool if you think you're being passed out of
spec data as it will generate a stream of error messages detailing each fault. It's also a great time saver if you're debugging a complex graph that you think may have conflicting record formats that aren't being picked up by the GDE.
The installation at each site may well vary from the examples shown here but hopefully this has given you a feel for the main Ab Initio components you're likely to encounter.