URIs and information typing

I sent this to xml-dev, but I don't know how many schema folk follow that 
busy forum.  I don't know how interested people will be in the specific 
proposal I offer here, but I hope it will at least be worthy of some thought.

----------------------------------------
Using namespace-qualified identifiers (QNames) for type identification
seems to introduce some significant difficulties while only saving a few
keystrokes. This proposal suggests using bare URIs rather than QNames to
improve interoperability and extensibility.

[I've long been a critic of the (lack of) URI structure, notably on XML-URI
last summer and on various IETF lists. While I still have plenty of
reservations about URI structure and syntax, the basic idea is more and
more intriguing, and I'm probably going to have to eat a few of my past
words in making this proposal.]

At present, the typing mechanism in W3C XML Schema is both extremely
extensible and deeply constrained. W3C XML Schema Datatypes [1] provides a
family of primitive datatypes and mechanisms for extending them through
facets for defining atomic types, while W3C XML Schema Structures [2]
allows developers to create molecules from these sets of atoms.

Types, whether built-in or created by the designer, are assigned names
which are referenced with namespace-qualified names (type="QName"). Types
have a URI component, which application must derive from the namespace
declarations in the document. They also have a local name, separate from
the URI component, which identifies the particular type in the list of
types associated with that namespace URI. Prefixes are used as an
abbreviation mechanism.

This creates a number of interesting problems for XML Schemas on a number
of levels. The first problem is caused by the use of namespace prefixes
within attribute values, which requires applications to maintain additional
information about prefix-namespace mapping. This is certainly allowed by
the Namespaces in XML spec [3], but is an extension of the capability
provided there and this support isn't entirely "natural" to some views of
the namespace specification.

The second problem may not appear to be a problem when type structures are
viewed entirely within the context of W3C XML Schema. Definining a type
requires the use of W3C XML Schema syntax, and the inclusion of that
declaration within the schema in order that both its namespace URI and it's
local name can be assimilated with the larger schema. This creates a
barrier to other schema approaches which choose to rely on W3C XML Schema
Datatypes for convenience and interoperability reasons.

RELAX [4], for instance, uses W3C XML Schema Datatypes within RELAX
descriptions, but restricts users to the built-in types defined within that
specification. This allows RELAX developers to focus on RELAX, without
having to harness RELAX implementions to W3C XML Schema implementations
which can process W3C XML Schema type declarations. It also allows RELAX
to avoid the URI+local name issues involved in W3C XML Schema processing,
as it relies solely on the name portion of the datatypes.

Although RELAX has chosen the (human-friendly) approach of relying on the
names of built-in datatypes, I'd like to suggest that a slightly different
approach might be simpler, far more extensible, and still workable. Rather
than rely on a combination of a namespace URI and a local name to identify
types, the use of a bare URI would allow processors to include data typing
information created in a number of different frameworks without mandating
the use of a particular syntax for information type definition.

For example, I might create a datatype defining a 'simonSKU' identified by
the URI http://simonstl.com/dt/simonSKU. At that location I'd have a RDDL
[5] document, which would provide a human-readable description as well as
links to a W3C XML Schema definition of the data type, perhaps a Perl
regular expression which can be used to check my SKU, a Java class which
can be used to check it, etc. There could also be some RDF around
describing relationships between this type and other types, or additional
properties of the type like creator, projects in which it's used, etc.

It would be my responsibility to make sure all of these things worked
consistently, of course (and maybe a testing resource in RDDL would be
cool), but applications could use my datatype processing as appropriate,
and humans could have a full set of documentation as well.

I'm well-aware that this approach would involve potentially substantial
changes in both W3C XML Schema and RELAX to implement, so I'm not exactly
expecting it to happen. (RDF Schema [6] already uses a similar URI-based
approach.) It may well have been considered and rejected at a prior
date. I suspect it isn't necessary to meet the requirements of W3C XML
Schema within its own worldview, but might simplify the implementation of
certain aspects of W3C XML Schema and provide future extensibility in new
directions.

Also, URIs could point quite easily to locations within a single W3C XML
Schema document - this doesn't require schema fragmentation, so long as
only a single processing context is needed.

This approach might also simplify future projects which handle type
information as metadata, not necessarily as part of a validation process.

[1] - XML Schema Part 2: Datatypes
(http://w3.org/TR/2000/CR-xmlschema-1-20001024/)
[2] - XML Schema Part 1: Structures
(http://w3.org/TR/2000/CR-xmlschema-1-20001024/)
[3] - Namespaces in XML (http://w3.org/TR/1999/REC-xml-names-19990114)
[4] - Regular Language Expressions (http://www.xml.gr.jp/relax/)
[5] - Resource Directory Description Language (http://www.rddl.org)
[6] - Resource Description Framework Schema
(http://w3.org/TR/2000/CR-rdf-schema-20000327)


Simon St.Laurent
Associate Editor
O'Reilly and Associates

Received on Wednesday, 7 March 2001 17:58:24 UTC