Re: XML namespaces on the Web from Lachlan Hunt on 2009-11-18 (public-html@w3.org from November 2009)

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Wed, 18 Nov 2009 14:16:29 +0100
To: Liam Quin <liam@w3.org>
Cc: public-html@w3.org, public-xml-core-wg@w3.org
Message-ID: <4B03F3AD.9080605@lachy.id.au>
Liam Quin wrote:
> On Tue, Nov 17, 2009 at 08:26:16PM +0100, Lachlan Hunt wrote:
>> Liam Quin wrote:
>>> To amplify a little... the XML Spec says (in essence)
>>> that software that takes something (anything at all)
>>> that is not well-formed XML, can turn it into XML, but,
>>> if it does, it must not claim that the original input
>>> was XML.
>>
>> If that is really the case, then that is a problem because of the lack
>> of defined error recovery behaviour.
>
> No, not at all. The standard XML behaviour is that if it's got
> well-formedness errors in it, it's not XML.  It's a fatal error
> to try and process such  "document" as XML.
>
> But that doesn't mean you can't fix the error.

This seems to be turning into a circular argument.  The issue is not 
about whether or not they could fix the error, but rather *how* to fix 
the error.

I've been trying to figure out where exactly the disagreement between us 
lies, but I think we can all agree on the following:

1. There are applications that have the need and/or desire to implement
    non-draconian error recovery for documents created with the
    intention of being XML, but for whatever reason are not well-formed.

2. In order to achieve interoperability among such applications, it is
    necessary to have a specification that clearly defines how to parse
    documents intended to be XML and recover from any fatal errors.

3. The XML 1.0 specification only defines the format of a well-formed
    XML document.  Anything else is left undefined, and the spec takes no
    position on how to process documents that are not well-formed,
    beyond requiring that the error be reported to the application and
    giving a vague requirement about not continuing normal processing.

I think the source of disagreement comes from a much deeper 
philosophical difference here between the approaches taken by XML and HTML.

The approach taken by the XML specifications is to define what 
constitutes a well-formed document, while leaving the question of what 
the data is if, during parsing, it turns out to not be well-formed, 
undefined — it is simply not XML.  From a document format and 
conformance perspective alone, I can understand the logic behind this. 
However, this doesn't make as much sense from an implementation 
perspective where there is a need to process in some way, any input that 
is passed with the presumption of it being XML, to an XML parser.

This differs from the approach taken by HTML5 which simply makes a 
distinction between conforming and non-conforming HTML documents, while 
still accepting that non-conforming documents are, for all intents and 
purposes, HTML.

This table roughly illustrates the difference:

Intended Resource Type | No Errors       | Syntax Errors
=======================+=================+======================
HTML                   | Conforming HTML | Non-conforming HTML
-----------------------+-----------------+----------------------
XML                    | Well-formed XML | Undefined

It seems that those people supporting the XML philosophy consider it 
more of a feature that XML leaves non-well-formed data undefined, 
whereas others, including myself, consider it to be a flaw in the design 
of the XML specification, which the XML5 proposal is attempting to rectify.

The current XML5 proposal focusses entirely on the parsing issue, 
leaving the definition of what's considered to be a conforming, 
well-formed XML document to XML 1.0.  So, in this sense, it is fully 
compatible with XML 1.0, and any conforming XML 1.0 parser will also be 
a conforming XML5 parser, as the algorithm allows for either aborting or 
applying the defined recovery procedure upon encountering a fatal error.

However, there have also been some suggestions to extend the list of 
pre-defined entity references to all of those defined in HTML5 (which 
includes the XHTML and MathML sets).  If this were done, then conforming 
XML 1.0 parsers would need to be updated to recognise these entities in 
order to become conforming XML5 parsers.

-- 
Lachlan Hunt - Opera Software
http://lachy.id.au/
http://www.opera.com/
Received on Wednesday, 18 November 2009 13:17:11 UTC