The XML conformance level that dare not speak its name (was Re: Request for Erratum to XML 1.0 and 1.1 Specs)

Norman Walsh wrote:

> | In my view, adding another XML conformance level below well formed is
> | not an erratum. Its a major change to the language.
> |
> | Encouraging XHTML (and MathML) processors to deal with non well formed
> | documents strikes me as highly dangerous and damaging; it could kill
> | off the already precarious position of client-side XML and relegate
> | XML to back-end processing only while perpetuating the 'non wellformed
> | but looks a bit like XML' mess. Pages purporting to be XHTML are
> | already the second highest type of non wellformed document. Lets not
> | encourage this practice.
> 
> What Chris said. In spades.

This is just to re-state (in case it wasn't clear) that I have already 
redrawn that first request. This is partly in response to Chris' 
comments, which promped me to go through the spec again.

In its place I have sent an ammended request, which is that the XML
REC should more clearly state that an application is allowed to
substitute default values when it finds an Unexpanded Entity Reference
Information Item in the infoset. And further, that the XML WG should
recommend to the XHTML, XML Schema and XSLT WGs that this is the
appropriate way to handle references to standard characters.

(I.e. in their specs, they should say "An XHTML/XS/XSLT processor
may/should expand Undeclared Entity Reference Implementation Items
which have the same name as one of W3C/ISO's standard character entities
with the replacement text for that entity before processing." And the
XML spec should add a note clarifying that this is licit. The intent
is still the same: that SAX processors can expand standard references
in certain situations.)

Please note that this does not create a new XML conformance level: it is
defined in terms of the infoset. That the Infoset REC has created, de 
facto, a conformance level less than well-formed (because the infoset
can come from the result of parsing a non-WF document with a
non-validating parser) is not a problem, except perhaps to our heads:
but it is just reflecting what is nascent in the XML spec.

This de facto conformance level currnetly exists as an unnamed gap:
all documents that can be represented in the infoset minus all WF 
documents. It is fine if the WG does not want to muddy the waters by 
giving it a name: but it exists nonetheless, and rather than deprecating 
it, I believe it gives a good way out from the entity mess.

So I think it is wrong to think I am inventing some new conformance
level. People talk as if the XML spec specified the following orthogal
kinds of documents:
   [valid]
   [WF]
   [not-XML]
but this is incorrect. It actually defines four kinds
  [WF, parsed validated]
  [WF, parsed non-validated]
  [non-WF, parsed non-validated]
  [non-WF, parse failed]

The Infoset takes in all the first three. This is why I think
too much fear of "creating" a new conformance class is misguided
(not that I am asking for such a thing!) because such a class
is already there, unnamed but common, lurking but ready to be
tamed.

I think it comes down to the question: is the class "XML document"
defined by all things that are accepted by an XML parser without
error?  I think that should be so: talking of WF as the bottom-line
of what XML is is nice, but a fiction: there are non-WF documents
accepted by non-validating XML processors. We need to start to
downplay WF as the bottom line, and start playing up "non-DTD-validated
XML" as the bottom line, in order to reflect reality and provide
a workable, simple, convenient, non-disruptive way out for XHTML/XS/XSL.


Cheers
Rick Jelliffe

Received on Thursday, 30 October 2003 01:19:25 UTC