Names, Namespaces and Languages

Henry S. Thompson

24 June 2005

1. Introduction

This is very much a work-in-progress, something I would have blogged except I don't have a blog. Please bear this in mind when responding -- there's very little here, particularly in the more speculative sections towards the end, which I'm firmly convinced of. So feedback is very much in order.

2. Background

TAG issues namespaceDocument-8 and abstractComponentRefs-37 were the topic of extended discussion at the last TAG f2f. There is considerable overlap between these two issues, and both are related to Dan Connolly's comment on the recently published Last Call Working Draft of XML Schema: Component Designators. Although a number of prior misunderstandings were identified and overcome in the discussion, more work is needed to make the background assumptions about what the problems are we're trying to solve and what the space of possible solutions is. This note is an attempt to begin that work.

3. XML Namespaces: An evolving understanding

The recent discussion about whether the xml:id spec. 'changes' the XML namespace by 'adding' a new name to it helped clarify that the minimalist reading of the XML Namespaces REC has achieved dominance in the intellectual marketplace. By "the minimalist reading" I mean I mean the reading on which an XML namespace is primarily a syntactic mechanism for distinguishing one class of uses of a particular simple name from all other uses thereof. This means a namespace is not a finite set of names, nor a more complex structured object as suggested by the (in)famous now-deleted non-normative Appendix A: The Internal Structure of XML Namespaces of version 1.0.

The minimalist reading is the only one consistent with actual usage -- people mint new namespaces by simply using them in an expanded name or namespace declaration, without thereby incurring any obligation to define the boundaries of some set. You could say that a namespace springs into life the first time anyone uses a URI as a namespace name, but on balance I prefer an understanding which doesn't reify a namespace as such at all. I don't object to using phrases such as "[some name] in the [some URI] namespace", but that's just another was of saying "the expanded name < some_URI, some_name >".

On this account it makes sense to ask questions about namespace names, e.g. "What namespace name will XSLT 2.0 use?" and about expanded names, e.g. "Does XSLT 2.0 change the definition of the element named < http://www.w3.org/Style/1998/Transform, output >?", but questions about namespaces as such are rarely if ever useful (unless of course they're understood as questions about namespace names or about some otherwise-defined set of expanded names with a namespace name in common).

4. From namespaces to languages

Taking the argument one step further, it is a necessary consequence of the position outlined above that it is incoherent to understand e.g. "Such-and-such a type is defined in the XML Schema namespace" to mean that the XML Schema namespace contains types (or type definitions). Considering things carefully, we must understand this sentence as meaning that the XML Schema language assigns the expanded name < http://www.w3.org/2001/XMLSchema, such-and-such > to some type definition. This perspective actually works well with our overall understanding of XML Schema: a schema document for a particular target namespace corresponds to a schema which assigns element declarations, type definitions, etc. to expanded names all of whose namespace name is that target namespace.

So it's languages (or as we used to say, applications, in the SGML sense) which assign expanded names to things. That assignment may be unique and unequivocal, but evidently it is often one-to-many. And of course it's the language which determines what there is to be named, its own little (or large) ontology.

Many languages of course do provide only one thing to be named using a particular namespace name (e.g. XQuery Functions and Operators), and others, although naming more than one sort of thing, constrain their use of names to be unambiguous (e.g. SVG, RDF). In both these cases, just an expanded name is sufficient to identify something, and constructing a URI for something is therefore straightforward.

On the other hand there are many examples of languages where the mapping is one-to-many. The most immediate example is XML itself. The low-level syntax of XML distinguishs two sorts of things which are identified by expanded name: elements and attributes. Since there is no prohibition on using the same expanded name for both an element and an attribute, an expanded name is not sufficient to uniquely identify a named aspect of an XML document (or document type, in the ordinary language sense) -- you need to know what I've been calling the sort as well, i.e. element or attribute. For example, all of the following names:

abbr
cite
code
dir
label
link
object
span
style
title

can be used for either elements or attributes in XHTML 1.0 (transitional) documents, and at least three of these (abbr, cite and title) survive as ambiguous in XHTML Basic 1.0.

When we expand our scope to XML validation, we suddenly get a much more complex situation, in which there are in principle an unbounded number of things which share a name, only disambiguateable by context: we have element declarations (max. one per expanded name), and attribute declarations (max. as many as there are element declarations). For example, there are four distinct attributes definitions called align and five distinct attribute definitions called type in the XHTML transitional DTD. W3C XML Schema not only has a richer set of what it calls "symbol spaces", so that there are seven things whose definitions can be named (it adds types, attribute and element groups, notations and identity-constraints along side elements and attributes), it also allows elements as well as attributes to be defined in context.

Finally we should note that a language may encompass quite a range of variation in terms of the things it assigns a particular expanded name to. There can be variation over time, as new versions of a language are released, and even alternative variants released at the same time. The HTML P element has a long and complex history, and even the XHTML p element has three distinct variants in version 1.0 (strict, transitional and basic), none of which is exactly the same as the one in version 1.1.

None of this should come as a surprise. Ordinary language uses names in ways which are both ambiguous and context-determined, and whose use changes over time. But its consequence for the Web are more serious, particularly as we consider the use of names for things on the Web intended for automatic processing, where appeal to context for disambiguation may not be straighforward at all. At the very least it is clear that it is no longer trivial to specify an approach to constructing URIs for things which will cover all the cases just discussed.

5. What abstractions to choose

Broadly speaking there are three ways one could respond to the situation outlined above:

Only expect to have a systematic approach to naming things with URIs when the language or application involved has a single flat story about naming (e.g. SVG, RDF). Abstract over variations. We might call this the simple (or simplistic) view.
Demand a systematic approach in all cases, and over all variations, but acknowledge that this means that in complex cases (e.g. WSDL, XML Schema) the resulting URIs will themselves be complex, requiring new media types and/or using new XPointer schemes. We might call this the rich (or overkill) view, exemplified by XML Schema: Component Designators.
Look for a middle ground, which adopts the simple view wherever possible, otherwise an approximation to it which abstracts over all variation and as much application-specific detail as possible, with the option to fall back to the rich view as and when this is necessary. We might call this the middle (or 80/20) view.

It's important to note that there's an unspoken common assumption to all three of the above views: We're going to construct the URI for some named thing by adding some variety of fragment identifier to the namespace name of its expanded name. There is no space here for the possibility that two distinct languages might use the same expanded name for two evidently distinct things. This is intimately bound up with another assumption with respect to variation, namely that it's possibly to tell reliably when a change in something counts as a variation, as opposed to a fundamental change of identity. If I change the named definition of a type by nudging its min or max a bit, that pretty clearly just produces a variant of the same type. But if I change the definition assigned to a name from being an integer to being a date, it's equally pretty clear that that's no longer the same type at all. Those are the easy cases, there will be many which are much harder to call.

I expect that both of these assumptions will want to be recast as Good Practice notes going forward (e.g. "Don't use the same expanded name for two different things of the same sort in different languages under your control"; "As a language evolves, use new expanded names for new things, don't recycle old ones").

6. More details on the middle ground

Without more detailed examination of real usage scenarios, it's hard to be sure of what general principles to establish here, but on the basis of my limited experience to date it seems likely that something along the following lines is a reasonable starting point.

It's up to the owner of a language, for each of the namespaces involved in that language, to provide a constructive definition of the way in which things which have expanded names can also be named with URIs. I've identified the following guidelines for such definitions:

Use the namespace URI as the basis of the constructed name;
Where part of the complexity of a language's name structure comes from giving expanded names to more than one sort of thing, include the sort in the URI;
Where evolution over time and or simultaneous language variants are a possibility, be clear that simple URIs are not capable of capturing this;
Try to provide retrievable representations so that the namespace URI(s) you construct a) have a widely used media type and b) yield a useful result when the fragment identifier is resolved.

7. The W3C XML Schema example

The position that emerged at the end of the recent TAG f2f is consistent with the above guidelines, but obviously lacking in detail. On balance my prefered approach would look something like this:

URI names are provided for everything defined or declared by name at the top level which have some conceptual identity independent of the details of W3C XML Schema, i.e. elements, attributes and simple and complex types.

The URI name for something of one of the above four sorts is constructed by concatenating the namespace name of its expanded name, a / if that does not already end with one, its sort (i.e. attribute, complexType, element or simpleType) a /# and the local name of its expanded name.

URI names for languages which don't use namespaces are based on a URI designated for the purpose in the language specification, e.g. http://www.w3.org/2002/xmlspec/ for the W3C's 'specprod' language.

It would be the responsibility of language owners to provide retrievable representations of resources at each sort-determined sub-URI of the namespace URI to make this work (but see httpRange-14 below under Outstanding issues).

So for example the URI for the W3C XML Schema's own dateTime type would be

http://www.w3.org/2001/XMLSchema/simpleType/#dateTime

and perhaps, for the DAML+OIL example cited in Dan Connolly's feedback, we would get the following ('perhaps' because there's no namespace involved in the example as published):

http://www.w3.org/TR/2001/NOTE-daml+oil-walkthru-20011218/simpleType/#over12

(My inspiration for this approach is at least in part the IANA structuring of their registry of media types, which give us e.g.

http://www.iana.org/assignments/media-types/application/mathematica

for application/mathematica (although irritatingly give us nothing for e.g. text/html).

8. Outstanding issues

This is by no means a fully-baked story. Some things I know are shaky are

httpRange-14: The TAG's recent resolution of this issue leaves the question of what sort of resource a namespace URI identifies, and whether you should be able to retrieve any representation of it at all, very much up in the air. The knock-on implications of this wrt fragment identifiers, sub-URIs, etc. are even more unclear.
Schema Component Designators: As presented there is a complete disconnect between this story and SCDs. Maybe that's the best that we can do, but it would certainly be better if we could get a solution which shared more.
Languages vs. namespaces: This notion of a language as distinct from a namespace is only just (at least for me) in the process of being worked out. It may yet be the case that we would do better to use some kind of 'language URIs' as the base, rather than namespace URIs. The continued widespread use of languages such as Docbook which don't use namespaces shouldn't be ignored.