It used to be called 'normalisation' but, whatever the name, the game's the same: keeping data organised as efficiently as possible, while retaining speed of access and ease of maintenance.
Let's take a common requirement: maintaining invoices. In the days before computers, the process was quite simpleā¦
So, all the information for each customer was repeated as often as required on each invoice.
When computers first appeared, this tested paper system was simply carried over, more or less as is, to the electronic system, which meant that a lot of storage, which in those days was very expensive, was tied up with repeated customer and product information.
In the late 'sixties, Edgar Frank Codd realised that computers could handle such data much more efficiently. Taking our invoice example, Codd proposed that repetitive data, such as the customer's address and the product descriptions, could each be stored once only and simply referenced as required.
Codd referred to this process as normalisation (spelt with a 'z' in the U.S.) and applying this process in a simplistic manner to our invoice data results in a record structure that might look like this:
Hmmm, a lot seems to have gone missing. What's this customer number and where are the invoice lines and the totals?
Well, the customer's details are in a seperate file:
and the lines are in another:
which refers back to the invoice and forward to the product file:
While the totals and the tax are stored once only in the sales ledger file:
Notice that we don't store the gross total, after all, the computer can calculate that on the fly, thus allowing for late changes. Nor do we store the date, because we have that in the invoice file.
This is a relatively simple expansion of the relational model. In essence, this type of model refers to main tables as Facts and to subsidiary tables as Dimensions. The entirety of a schema is known as a Constellation. Sticking with our invoice, the dimensional model might look like this:
If you're used to relational data modelling, you could be forgiven for saying that this is simply a relational data model, and that is basically true. The main difference is the slightly artificial rule that a Dimension cannot reference another table, it must be complete unto itself. By this definition, only the Product and Address tables are dimensions; the others are all fact tables.
Despite its slightly artificial flavour, the dimensional model does provide one major benefit. The aim is to drive the normalisation process to the most practical extreme and this can be adventagous when dealing with very large databases, such as are typical in data warehouses.
Another advantage is the removal of doubt when building complex schemas. Because tables are either dimensions or not, it is possible to identify areas of confusion early, especially when a large team is working on the design and it is essential that each analyst can trust his colleagues' decissions. Quality assurance is much easier when one worker can look at a definition and rapidly identify an error in its classification.
The most obvious problem with the dimensional model, relative to the relational model, is that it locks the designer into a straight-jacket, which can lead to awkward and sometimes counter-productive schemas.
The diagram above is a good example. It isn't actually a true dimensional model, because there really shouldn't be references from one fact table to another. However, sticking to this rule can lead to the construction of models where dimension tables exist only to provide a cross reference between two fact tables. This is not invariably problematic and, indeed, it has been argued that such a rule is inherent in the relational model itself. However, if carried to its logical extreme, this can create schemas that are a nightmare in coding and performance terms.
As with all methodologies, the trick is to use the tool intelligently. Before going too far, it always pays to identify, if possible, the most awkward case. Model this and implement the resulting design in a real RDBMS, using quantities of test data that match the anticipated traffic. If you get mired in unbearable complexity, or attempting to write real queries produces unacceptable loads or run times, you can be reasonably sure that your requirement demands a different approach.