[Fwd: A plea for Sanity] from Dan Connolly on 2002-05-24 (xml-names-editor@w3.org from May 2002)

Forwarded message 1

From: Joe English <jenglish@flightlab.com>
Date: Fri, 5 Apr 2002 11:44:30 -0500 (EST)
Subject: [Moderator Action] A plea for Sanity
To: xml-names-editor@w3.org
Message-Id: <200204051643.g35Ghbs10412@dragon.flightlab.com>
[ Also sent to xml-dev@lists.xml.org ]

"Namespaces in XML 1.1 Requirements" cites the ability to "undeclare"
a namespace as the principal (only?) new needed feature, because
of the case where:

| information items [...] from another document [...] may
| have fewer in-scope namespaces than their parent.  There is
| no mechanism for accurately serializing this situation. If
| the infoset is naively serialized and reparsed, the children
| will end up with additional namespace information items which
| serve no useful purpose.

I believe that this requirement is ill-considered.

Under SGML and XML 1.0, applications can treat generic
identifiers as atomic strings; with XML 1.0 + Namespaces,
element and attribute names become compound objects consisting
of a URI and a local name.  This complicates applications a bit,
but by itself is not an onerous burden: toolkits like SAX can
provide namespace processors that keep track of the namespace
environment, map GIs to {URI+localname} pairs, and throw away
the original namespace declarations.

The real complexity starts to show up in applications which
themselves need to keep track of the namespace environment
(e.g., XSLT).  This is usually required for applications that
need to reserialize an Infoset as XML and wish to retain
the original namespace prefixes on output.  (It gets hairier
for markup vocabularies that include QNames in content, but that's
a different issue.)

But the new requirement implies that the *exact set of in-scope
namespaces at each node* is an essential part of the Infoset.
This is the part that I think is ill-considered.  This property
should be deemed inessential, just as whitespace in tags and the
order of attribute value specifications are deemed inessential.
XML-related specifications should not expect or demand that it be 
preserved; any set of namespace declarations that produce the same 
{URI+localname} pairs after namespace processing should be considered 
equivalent.

In particular, "additional namespace information items which
serve no useful purpose" -- and hence do not affect the interpretation
of QNames in markup or content -- should not matter.  Applications
should be free to insert or discard them as they see fit without
changing the meaning of the Infoset.

 * * *

Now a plea for sanity.

(This is for people who design XML vocabularies and applications;
xml-names-editor, I know you're busy, so you can stop reading here.)

There are certain practices which, if avoided, can make life
simpler for application and toolkit developers.  These are
all legal according to the Namespaces REC, and I don't suggest
that they be disallowed in XML 1.1, but it may be beneficial
for individual applications to disallow them.

Some definitions:

Let's say that an XML document is _neurotic_ if it maps the same
namespace prefix to two different namespace URIs at different
points.  Neurosis makes it necessary for XML processors to
work with {URI+localname} pairs instead of GIs, and to keep
track of the namespace environment at each point in the tree
if there are QNames-in-content.  If it weren't for neurosis,
applications could use a single namespace map that applied to
the entire document.

Conversely, a document is _borderline_ if it maps two different
namespace prefixes to the same namespace URI.  Borderline documents
complicate reserialization: the choice of which prefix to
use for a particular {URI+localname} pair depends on its
position in the tree.

A document is _psychotic_ if it maps two different namespace prefixes
to the same URI _in the same scope_.  Psychosis presents an even
bigger difficulty for reserialization: now applications must keep
track of the original prefix as well as the {URI+localname} pair.

A document is _normal_ (or _in namespace-normal form_) if all
namespace declarations appear on the root element and it is
not psychotic.  (A borderline document with all namespace 
declarations in the same place is automatically psychotic;
a neurotic document with this property would be illegal according
to the Namespaces REC.)

Normal documents are the easiest to process: the application can
determine the global namespace environment at the beginning of the
parse, and can use it throughout processing.

It's not always possible to produce normal documents -- the producer
might not know all of the relevant namespaces at the time it emits
the root element start-tag -- so a weaker definition is useful:
A document is _sane_ if it is neither neurotic nor borderline.

Document producers should be designed to emit sane documents.

This is not hard to do -- the serializer just needs to maintain
a monotonic, bijective URI/prefix map and reuse the same prefix
whenever a namespace URI leaves and comes back into scope.
("Bijective": there is precisely one URI for each prefix and
one prefix for each URI; by "monotonic" I mean that prefix+URI
pairs may be added to the map but not removed.)

A sane document can be transformed into a normal document simply
by moving all namespace declarations to the root element and
filtering out duplicates.  (This can't be done in streaming
mode, but it might be an appropriate technique for XML databases.)

Now general-purpose XML consumers cannot expect to receive sane
documents.  However *special-purpose* consumers, designed to work
with specific markup vocabularies, can be a lot simpler if the
markup vocabulary includes namespace sanity as a requirement.

As an application developer, I'd prefer not to have to worry
about namespace nodes or {URI+localname} pairs.  I'd rather be
able to give the parser an internal namespace map describing
all the namespace URIs I'm interested in, and have the parser
translate QNames in markup to use my prefixes.  Then the application
can work with GIs instead of {URI+localname} pairs.  If the source
document is sane, then it's possible to preserve the original prefixes
on reserialization simply by remembering the original namespace map;
it's not necessary to keep track of namespace nodes during processing.

QNames in content are a lot easier to process in a sane document.
Sanity guarantees that a given QName means the same thing wherever
it appears.  Any future markup vocabulary which uses QNames in content
should include sanity as an application requirement.

A requirement for sanity shifts part of the burden onto document
producers, where it's easy to handle.  The alternative is maddening
complexity for document consumers.


--Joe English

  jenglish@flightlab.com