Is your XML well-formed and valid? It's best if you can answer "Yes!", with confidence. In our experience, anything else is just "asking for trouble." The cost to validate XML documents is pleasantly low.
And the cost of invalid XML is plenty high. We rarely go a week without running into a crucial database or interface that's simply wrong--it lists an entry-level salary as $0.08 / hour, or inventories supplies for all 200 floors of a twenty-story building--and the consequences of even these apparently-trivial errors range from annoying to catastrophic. Cleaning up bad data is often a thankless job; as with backup and recovery work, too many managers only support it after it's too late. Still, we're certain that the general strategy of cleaning up data repositories and interfaces is the right thing to do. It certainly beats the alternative so many organizations choose of trying to program around errors, with the towers of fragile maintenance code that result. More precisely, it's clear to us which is the better choice on a technical basis; we recognize that politicized systems, including several government ones in which we've personally been involved, can sometimes even spawn multiple databases of the errors and omissions in other databases.
Let's look at a few specific examples, to understand better what's involved.
Well-formed? Valid? Correct?
There's a lot of XML in the contemporary computing world, so improvements in our handling of it have the potential to pay off with a large multiplier. DOCX and other office-automation documents, service requests to Amazon's clouds, all the messages service-oriented architecture (SOA) involves, trillions of transactions between financial actors, software build tools, electronic election results, chemical measurements, and thousands of other domains employ XML to store or communicate data.
Even if your XML-based system passes your functional tests, there can be value in validating the XML itself. A fragment such as
...
<transaction dau = "Monday" ...>
...
</transaction>
...
might give correct results because the default value for day happens to be Monday. Eventually, though, even such small errors will break new releases, new platforms, or new uses. The earlier they're corrected, the better.
XML culture distinguishes three levels of "rightness". A "well-formed" XML document is one that is recognizable as XML: it admits parsing into a tree of nodes identified by "tags", and so on. A fragment such as
...
<ul>
<li>first item
<li>second item
</ul>
...
is malformed, if only because its <li> tags aren't properly closed with corresponding </li>s. As weak as the standard of well-formedness is, it's often violated.
A "valid" XML document is well-formed, but goes beyond that to respect its domain-specific grammar. An Ant configuration file and MathML-coded article might both be well-formed XML, but each is only valid when used within its own domain--Ant knows about <target> and <condition>, but not <matrix>.
Finally, and somewhat more subjectively, an XML instance is "correct" if its content is without errors. There's nothing in the syntactic definitions of XML to distinguish
<fullname>Elvis Presley</fullname>
from
<fullname>Elvis Parsley</fullname>
but presumably only the former correctly fulfills "business requirements".
Validity can be automated, however, in the sense that validating parsers are widely available. It's entirely practical for most systems to scan all the XML they manage, to ensure validity.
The interesting point from a development perspective is that there are better and worse ways to validate XML documents. The general basis for validation involves application of a parser against two documents simultaneously. One of these is a "schema": an abstract representation of what it means to be valid in a particular domain. Validation thus reduces to comparison by the parser of the schema to a particular XML document or instance or file.
The oldest schema form for XML is the Document Type Declaration (DTD). DTDs actually existed before XML, and some authors restrict "schema" to more recent parsing forms. We'll concentrate on just a single example for this column, without detailing all the historical complexity of schema theory. Consider
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE REPAIR [
<!ELEMENT REPAIR (ITEM,OWNER+)>
<!ATTLIST REPAIR status (OPEN|FINISHED)>
<!ELEMENT ITEM (SERIAL,MODEL)>
<!ELEMENT SERIAL (#PCDATA)>
<!ELEMENT MODEL (#PCDATA)>
<!ELEMENT OWNER (#PCDATA)>
]>
<REPAIR status="OPEN">
<ITEM>
<SERIAL>1234567</SERIAL>
<MODEL>laser-aligned mattock sharpener</MODEL>
</ITEM>
<OWNER>John Smith</OWNER>
<OWNER>Mary Smith</OWNER>
</REPAIR>
This is the kind of message that a computing system for a repair store might transact. If you feed it to the on-line validator at xmlvalidation.com, the message "passes". If your repair ticket includes no OWNER's name, though, or tries to name two different ITEMs to repair, or lists the REPAIR status as WAITING, the validator will properly complain. Effective applications will always need to incorporate their own "business logic" to ensure that there is at least one OWNER, of course. It's precisely because XML so often plays a role in communication between systems, though, or in the operations of a single system at different times, that basic XML validation is valuable.
Better Ways: XSD and RNC
Much, perhaps most, existing literature on XML validation focuses on DTDs. Don't you do this, though. There are better ways. At least two newer schema forms, most often abbreviated as XSD (XML Schema Document) and RNC (Relax NG Compact Syntax, with "Relax" itself derived from "regular language description for XML"), are simply better. They're both simultaneously more succinct, more readable by humans, easier for computers to parse, more powerful, and more precise than DTDs.
What does "more precise" mean? The SERIAL number, OWNER name, and so on, in the DTD scheme above, must allow essentially any data. XSD and RNC have the effective ability to define domain-specific content-types, in contrast; it's easy in these forms to require that a serial number start with one or two letters, followed by one to eight digits. The example which follows shows how to restrict the MODEL name to forty characters.
James Clark and other XML luminaries have created a rich collection of open-source tools for working with XSD and RNC, including validators and translators to and from DTD. There are even good descriptions of how to engineer a transition of a large system from DTD to XSD or RNC. Whenever possible, we choose RNC. RNC feels "lighter" and more in tune with our usual practice in "Regular Expressions". XSD, on the other hand, is somewhat better supported by commercial tools. Both represent a huge step forward from DTDs. While there are a few small corners where DTD solutions are apparently more powerful than the corresponding ones in RNC and XSD (default attributes are the most potent example), we are comfortable recommending XSD and RNC for all but the rarest of legacy applications.
Compare for yourself an RNC example that corresponds to the repair-store document above: strip our XML document down to
<?xml version="1.0" encoding="UTF-8"?>
<REPAIR status="OPEN">
<ITEM>
<SERIAL>1234567</SERIAL>
<MODEL>laser-aligned mattock sharpener</MODEL>
</ITEM>
<OWNER>John Smith</OWNER>
<OWNER>Mary Smith</OWNER>
</REPAIR>
If you feed this to a validator such as the validator.nu one, against a schema such as
element REPAIR {
attribute status {"OPEN"|"FINISHED"},
element ITEM {
element SERIAL {text},
element MODEL {xsd:string {maxLength = '40'}}
},
element OWNER {text}+
}
you'll see that the XML document is reported valid. The readability of the RNC schema, compared to that of its DTD correspondent, speaks for itself.
Summary
It's worth your time to understand what it means for XML to be well-formed, valid, and correct, and it's probably also worth your time to validate any XML you send, receive, or store. When you design a validation scheme, use XSD or RNC.
Kathryn and Cameron run their own consultancy, Phaseit, Inc., specializing in high-reliability and high-performance applications managed by high-level languages. They write about scripting languages and related topics in their "Regular Expressions" columns. Phaseit's first contract involved SGML, an ancestor of XML.
