Re: Suggested revised text for HTML/XML report intro from John Cowan on 2011-08-16 (public-html-xml@w3.org from August 2011)

From: John Cowan <cowan@mercury.ccil.org>
Date: Tue, 16 Aug 2011 10:09:00 -0400
To: Anne van Kesteren <annevk@opera.com>
Cc: Noah Mendelsohn <nrm@arcanedomain.com>, "public-html-xml@w3.org" <public-html-xml@w3.org>, Larry Masinter <LMM@acm.org>
Message-ID: <20110816140900.GA8113@mercury.ccil.org>

Anne van Kesteren scripsit:

> Your principle is wrong. HTML is not repaired; processing just does
> not stop.

As the author of TagSoup, I'm well aware of that.  However, it's
tangential to my point.

> Doing the same for XML is fairly trivial.

There are many proposals in the literature for repairing XML (or, if you
like, specifying the processing of non-well-formed XML), ranging from
XML5 and a 2004 proposal by Siefkes <http://conferences.idealliance.org/extreme/html/2004/Siefkes01/EML2004Siefkes01.html>, which are independent of
any schema, to TagSoup, which depends on a specially written schema
in its own schema language, to Blažević's 2010 implementation, which
employs a RELAX NG schema and hints in the form of PIs.

The problem is that there is no compelling reason to prefer one approach
to any other.  Without such a justification, all we end up doing is
complicating the description of XML further: instead of being able to say
"report a fatal error", we must specify in detail exactly what infoset to
produce for violations of each of the 83 productions, 12 well-formedness
constraints, and 8 miscellaneous fatal-error specifications in XML 1.0
(Fifth Edition).

-- 
John Cowan          http://www.ccil.org/~cowan         cowan@ccil.org
The native charset of SMS messages supports English, French, mainland
Scandinavian languages, German, Italian, Spanish with no accents, and
GREEK SHOUTING.  Everything else has to be Unicode, which means you get
only 70 16-bit characters in a text instead of 160 7-bit characters.

Received on Tuesday, 16 August 2011 14:09:27 UTC