Datareign

Notes on the behaviour of the Ab Initio sort component

The Sort component's default behaviour is to attempt to perform everything in memory, only spilling data to disk when it exceeds MAX_CORE. It follows that, provided you have plenty of available memory, you should set MAX_CORE to a suitably high value. Sort is optimised to work in parallel and to make use of data as and when it falls out of each incoming flow.

So what happens when Sort exceeds its allocation of memory? The first thing it does is to finish sorting on the current data. It then creates 16 seperate overflow files and dumps ithe current (sorted) memory into them. Sort next reads in more records up to MAX_CORE, sorts them and, assuming it once more runs out of space, dumps this set to the files. It repeats this behaviour until it runs out of input records or exceeds the size limit of the files.

If it hits the limit on the files, it creates 256 files, spreads the previous data across them and then goes through the sort and dump excercise again. When it finally runs out of input records, it performs a sort-merge on its working files and presents the result at its output(s).

Obviously, the bigger you set MAX_CORE the less likely is the above scenario and the faster your sort will proceed but, as always, there's a catch…

Well optimised operating systems may spot an 'excessively' high memory demand and take pre-emptive action to mark the culprit for an early swap. Thus you can end up in a Catch-22 situation where you can't use a high value for MAX_CORE because the system goes into 'swap catatonia', while using an acceptable value leaves you with heavy disk usage as Sort works off-line.

In cases like this there is no substitute for being able to explain exactly how the bottle neck is caused. Working with the system administrators, it's usually possible to achieve a workable compromise but, in such a case, it's vitally important to document what's been done and why, so that an incoming programmer doesn't 'optimise' your solution into the wastebin and find himself facing the same problem again.

Another thing to note is that, when working in parallel, Sort will work on each flow independently, emitting records as each partition completes. This behaviour can lead to a subtle 'gotcha!' that often catches novices.

In order to ensure that everything is synchronised for smoother throughput, people sometimes put a phase break upstream of a Sort but this can actually have the opposite effect. If the streams are markedly lopsided, the phase break will halt all processing until the last stream is complete. It will then restart. If the phase break hadn't been present, the smaller flows would have shot through the sort leaving more capacity for the larger ones and thus giving a higher overall throughput.

Worse still is the above scenario with a checkpoint added. Now the system has to write all the data to disk and read it back again. If your goal is speed, this is not the way to achieve it!

It's worth noting that, if the next component after the Sort is a Join, as is often the case, then checkpointing before the Sort is doubly wasteful. This is because Join is intelligent enough, like Sort, to write its input flows to disk when deadlocking or overflow becomes an issue. In this case, you've now got two writes per flow and two reads where you may need none at all.

It's worth considering that Ab Initio components are generally smart enough to do the work for you if you let them.

Last modified: 2009/01/21 17:59