Untaught XML Schema

Tagged with Regular Expressions

Last month's "The Importance of Being Valid" inspired an unusual volume of follow-up directed to us. A critical mass of developers seems ready to leave DTDs (Document Type Definitions) behind, and adopt more modern and capable definitions for their XML work.

As that earlier column explained, it's entirely timely to do so: several XML schema languages exist which provide all or nearly all the functionality DTD gives, while reducing its liabilities. Although we already mentioned our favorite is RNC, we recognize that XML Schema is more widely accepted commercially. This month, therefore, we'll let XSD show off a few of its intermediate-level techniques.

Note on Nomenclature

Note that the "XSD" of the previous sentence is an abbreviation often used for "XML Schema Definition". There are many XML schema languages (note the capitalization), including XML Schema (XSD), RNC, Schematron, and SOX. There's only one XML Schema (capitalization!), though, the one that's the subject of this column. The similarity of these two terms motivates us to cut through any confusion by calling the language "XSD".

An abundance of on-line tutorials, dead-trees introductions, conference-hosted training, and other treatments of XSD is already available. As valuable as all these are, few adequately emphasize a couple of "low-hanging fruits" we've seen in commercial reliance on XML. This month, therefore, we attend to "restrictions" and import-based refactoring.

Healthy Limits

XML is the base language for all sorts of work in a modern information technology (IT) department. It builds in a variety of capabilities to permit expression of quite general content from all sorts of human and computing domains.

When you're working on a particular task, though, your aim needs to be precision. As charming as it is to be able to write about chemical formulae in Farsi, what's important when you have responsibility for an invoicing system is that the right people are billed the right amounts. All those other possibilities are just distractions. That's where schema languages enter: they give a basis for restriction of all XML's potential so that XML is focused on just your problem. With the right schema, significant problems of parsing and semantics can be solved as syntax within a particular schema.

Examples will help make this clear: suppose you represent an accounts-payable department which receives invoices from a wide range of suppliers as XML. To be accepted by your processing system, an invoice must come from an enrolled supplier, must bill for an authorized amount of money, and so on. The remarkable fact, unknown even to many long-time XML coders, is that many of these constraints admit syntactic expression.

Automatic schema generators often turn out schemata for such problems that look like

  <?xml version="1.0" encoding="UTF-8"?>
    <xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema">
    <xsd:annotation>
      <xsd:documentation>invoice1.xsd</xsd:documentation>
    </xsd:annotation>
      <xsd:element name = "STANDARD_INVOICE">
        <xsd:complexType mixed="true">
          <xsd:sequence>
            <xsd:element name = "STANDARD_TIMESTAMP"    type = "xsd:dateTime"/>
            <xsd:element name = "JOB_CODE"              type = "xsd:string"/>
              ...
          </xsd:sequence>
        </xsd:complexType>
      </xsd:element>
    </xsd:schema>
  

This is an idealization; in practice, superfluous declarations clutter most generated schemata. JOB_CODE, for example, might introduce a named type, which is only later resolved as simply xsd:string. Set asides these distractions for now.

A schema such as this requires an input invoice to look something like

        <?xml version="1.0" encoding="UTF-8"?>
        <!-- "invoice1.xml" -->
        <STANDARD_INVOICE>
          <STANDARD_TIMESTAMP>2009-05-01T08:20:00.000-05:00</STANDARD_TIMESTAMP>
          <JOB_CODE>AB12345</JOB_CODE>
            ...
        </STANDARD_INVOICE>
      

With these items in place, you can use standard tools to validate candidate invoices. One good one, XSV, is even on-line, and requires no installation; if you apply it to this invoice1.xml and invoice1.xsd, it returns the result, "Attempt to load a schema document ... succeeded".

Your invoice-consuming program doesn't have to take responsibility for checking that an invoice has a STANDARD_TIMESTAMP, which might well be a legal matter involving hundreds of thousands of dollars of negotiations and contingencies. More to the point, you don't have to announce the defect. Error messages challenge programmers in several ways, but the easiest ones are those you don't have to program at all. With STANDARD_TIMESTAMP in the schema, it becomes far easier for everyone to agree that, if the XML is invalid, the invoice can't be accepted.

Restrictions: XML Pattern-Matching

Look what this means for JOBCODE. When defined simply as an xsd:string, anything, including Transfer all funds to my Swiss account or a buffer-vulnerability-threatening binary string, might appear in JOBCODE. If you know, though, that JOBCODE must include a one-or-two character prefix, followed by a five-digit code, you can have XML enforce that restriction for you:

  <?xml version="1.0" encoding="UTF-8"?>
    <xsd:schema xmlns:xs= "http://www.w3.org/2001/XMLSchema">
        <xsd:annotation>
          <xsd:documentation>invoice2.xsd</xsd:documentation>
        </xsd:annotation>
      <xsd:element name = "STANDARD_INVOICE">
        <xsd:complexType mixed="true">
          <xsd:sequence>
            <xsd:element name = "STANDARD_TIMESTAMP"    type = "xsd:dateTime"/>
            <xsd:element name = "JOB_CODE"              >
              <xsd:simpleType>
                <xsd:restriction base = "xsd:string">
                  <xsd:pattern value = "[A-Z]{1,2}[0-9]{5}"/>
                </xsd:restriction>
              </xsd:simpleType>
            </xsd:element>
          </xsd:sequence>
            ...
        </xsd:complexType>
      </xsd:element>
    </xsd:schema>
  

A schema such as this rejects JOB_CODEs such as

    ab12345
    AA2345
    ABC1234
  

while allowing

    A12345
    AA02345
    AC01234
  

XSD's restriction patterns are powerful, and capable of considerable precision. Automatic XSD generators often don't use them at all, or, when they do use them, simple-mindedly translate from degenerate type systems common in object-oriented languages and relational database management systems (RDBMSs). Such systems generate patterns such as

            <xsd:element name = "JOB_CODE"              >
              <xsd:simpleType>
                <xsd:restriction base = "xsd:string">
                  <xsd:maxLength value = "7"/>
                </xsd:restriction>
              </xsd:simpleType>
            </xsd:element>
    

With such a schema, JOBCODE can include any characters, in any order, as long as the length of the whole string doesn't exceed seven; an example might be "a1*$Q!". Such a definition passes almost a hundred quadrillion distinct JOBCODEs that invoice2.xsd correctly rejects.

XSD restrictions can do even more than we've shown here. With enough cleverness, they can even compare candidate values to validation tables stored in databases, or perform calculations. These are more advanced topics, though, beyond the scope of today's column. The point for now is that even a little bit of sophistication with xsd:restriction pays big returns.

Import Common Definitions

We're also enthusiastic about refactoring XML. While refactoring of procedural artifacts is big business, and even RDBMS refactoring has joined the mainstream, it's largely accepted that "[o]ne cannot refactor XML ...". A full response to that claim would fill a book; in the meantime, we'll content ourselves today with illustration of one technique handy in practical refactoring: imports.

Just as C programmers #include common header texts several distinct sources can share, XSD defines an import facility. Suppose you're responsible for a whole suite of XML-transacting domain-specific applications: one to receive invoices, but others to track performance, payments, regulatory compliance, and so on. JOB_CODE appears in many of them, so it deserves a common, consistent definition. With this definition maintained centrally, an individual application, such as the invoice receiver above, can be simplified to:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsd:schema xmlns:xsd= "http://www.w3.org/2001/XMLSchema"
                xmlns:common = "http://phaseit.net/tmp/xlm"
                targetNamespace = "http://phaseit.net/tmp/xlm">
        <xsd:annotation>
              <xsd:documentation>invoice3.xsd</xsd:documentation>
        </xsd:annotation>
      <xsd:import schemaLocation="common_types.xsd"
                 namespace="http://phaseit.net/COMMON"/>

      <xsd:element name = "STANDARD_INVOICE">
        <xsd:complexType mixed="true">
          <xsd:sequence>
            <xsd:element name = "STANDARD_TIMESTAMP"  type = "xsd:dateTime"/>
            <xsd:element name = "JOB_CODE"            type = "common:JobCode"/>
		...
          </xsd:sequence>
        </xsd:complexType>
      </xsd:element>
    </xsd:schema>
  

Where is the definition of JobCode? It appears in common_types.xsd:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsd:schema xmlns:xsd= "http://www.w3.org/2001/XMLSchema"
                targetNamespace = "http://phaseit.net/COMMON" >
        <xsd:annotation>
              <xsd:documentation>common_types.xml</xsd:documentation>
        </xsd:annotation>

    <xsd:simpleType name = "JobCode">
      <xsd:restriction base = "xsd:string">
        <xsd:pattern value = "[A-Z]{1,2}[0-9]{5}"/>
      </xsd:restriction>
    </xsd:simpleType>
      ...
    </xsd:schema>
   

For the case of just one defined type, used in only a single XSD, import is of course the opposite of a simplification. For production situations, though, with a dozen interfaces, and scores of elements, import can significantly clean up complicated definitions. Moreover, just as with class-writers and class-users in Java, this sort of partition suggests a potentially desirable division of labor between domain-specific XSD maintainers, and XML specialists who focus on optimization of type definitions.

Summary

If you work with XML, use an XML schema to ensure that individual messages are valid. Modern XML schema languages have a wealth of facilities that lend precision and power to their use. XSD restriction patterns and imports are two that are particularly helpful for maintaining production XSD instances.

Kathryn and Cameron run their own consultancy, Phaseit, Inc., specializing in high-reliability and high-performance applications managed by high-level languages. They write about scripting languages and related topics in their "Regular Expressions" columns. Phaseit's first contract involved construction and parsing of a DTD for SGML, an ancestor of XML.

 

0