In the last installment of Regular Expressions, we stressed the importance of valid XML. In even worse shape than its XML are the world's relational database management systems (RDBMSs).
RDBMSs are more important than XML messages, and at least as flawed. In general, databases--even important ones, like those which encode the medical records at a leading hospital, the communications records of the White House, the parts lists of military craft, or the mortgage histories of a bank's customers--are deeply flawed. Errors fill them. We have a few coding tips for quickly shaking out the worst of these.
Act Normal!
Data management is a huge subject--arguably as big and complex as all the rest of computing combined. We've only begun as a discipline to understand best practices, so the tips in one book, let alone a single column like this one, can only address a tiny fraction of all the errors present now in production databases.
Certain errors are so ubiquitous, though, that it's worth understanding quick ways to identify them. A good starting point is to understand database normalization. Informally, normalization has to do with consistency, integrity, and maintainability of data. An example will help:
Imagine a business that keeps records on, along with much else, customer orders for products, and customer complaints. We'll focus now on data in tables called orders and complaints. Along with information about products, pricing, timestamps, and so on, each of these tables has columns called customer_id and customer_name.
Experienced datamanagers will be on the alert already. This kind of redundancy almost certainly is a symptom of a bad design. It's also extremely common, possibly even more so than in decades past. Although normalization theory was well understood early in the '70s, a combination of causes, including mis-designed Web frameworks, faulty object-relational ideology, and incoherent college training, leaves us with databases in 2009 that are at least as fragile as those common a quarter-century ago.
Compare this to memory errors. Memory-access faults are frustrating, for they were thoroughly understood in C-like languages during the '70s, yet continue to plague real-world programs, including such crucial ones as operating systems and Web browsers. We do seem to be making progress with memory management, though, and even the failures they cause are relatively benign, if only because of their obscurity: often no more grave than a Web session that must be restarted.
When databases go wrong, though, innocent people are arrested, medication given by mistake, and bank accounts wiped out. Look at all the ways our customer_id can go bad:
Missing entries
Presumably a complaint is about an order, so each complaint should trace back to a specific order. With customer_id a copy of what appears in the order, though, rather than a reference, it's inevitable that the two will diverge. This error is particularly common with frameworks that autogenerate data designs based on visual screen layouts. Once this practice introduces redundancy, though, it's the moral equivalent of pushing spaghetti uphill to keep the different entries consistent. The result: when you run a query such as
select distinct customer_id, customer_name from complaints
where customer_id not in (select customer_id from orders)
order by customer_id;
you end up with more than zero records. Each one of these represents an "orphaned" complaint, one where, at best, the complainant appears to be a person who has never done business with the company.
Non-unique keys
Also common are "ids" that don't really identify: customer 12345 appears as both "John Smith" and "John A. Smith". In principle, there might be value in tracking the different names a person uses during different transactions. In our experience, though, organizations never argue this; the different names are simply mistakes.
All commercial-class DBMSs support at least one mechanism for ensuring uniqueness of keys--yet adequately confused developers still find ways to introduce inconsistencies. A query like
select distinct complaints.customer_id,
complaints.customer_name, orders.customer_name
from complaints, orders where
complaints.customer_id = orders.customer_id and
complaints.customer_name <> orders.customer_name
order by customer_id;
spots records where the complaints and orders tables disagree about a customer's name.
Also valuable is a report on just one table of any inconsistencies where the same customer appears under multiple names:
select customer_id, counter from (select customer_id,
count(distinct customer_name))
counter
from orders group by customer_id) subquery where counter > 1;
NULL and other wrong values
A missing customer_id is certainly a problem:
select distinct customer_name from complaints
where customer_id is null;
It's probably just as bad to have a customer with a blank name:
select distinct customer_id from orders
where ltrim(rtrim(customer_name)) = '';
Human-readable values such as names admit plenty of other basic checks of consistency: are any names only one character long? Is capitalization consistent throughout the database? What characters appear in names? "O'Brien" is likely correct, but "0'Br13n" (embedding three digits) deserves at least confirmation.
Summary
Database managers generally understand that their responsibility is great, and this inclines them to conservatism: they avoid changing what seems to work. Recognition of the importance of normalization and other aspects of good data design is far less likely. If you ever need to deal with a defective production database--that is, nearly any real-world application--the queries given above provide a good model for an initial "sanity check". If the design is well normalized, and all these queries return empty results, your database it at least minimally consistent, and can support at least a preliminary level of new development. If, however, basic elements are missing or in conflict between related tables, you need to address your database foundation before trying to construct any new application atop it. These simple reports constitute powerful arguments for database refactoring and correction.
