Proposal: @parsing="loose | strict" from Doug Schepers on 2009-07-14 (public-html@w3.org from July 2009)

From: Doug Schepers <schepers@w3.org>
Date: Tue, 14 Jul 2009 03:16:07 -0400
To: "public-html@w3.org" <public-html@w3.org>
Message-ID: <4A5C30B7.30403@w3.org>
Hi, HTML WG-

There are advantages and disadvantages to both the strict ("draconian") 
and error-correcting parsing of markup.  HTML evolved to have loose 
parsing with undefined and browser-specific error correction, and XML 
was designed and well-defined to have strict parsing (probably as a 
reaction to the chaotic HTML approach).

We have come full circle on the matter, and the HTML5 spec marries many 
of the advantages of both approaches, by offering a well-defined 
error-correction model.  This has the advantage that it is sometimes 
easier to author (though it can make debugging more difficult), the more 
profound advantage that it hides problems from the reader, and the even 
more important advantage that it is more or less how browsers already 
parse HTML documents.

However, it cannot gracefully address all the situations in which strict 
parsing is an advantage:

* For authoring, it is often useful to know when you have validity or 
well-formedness errors, which helps debug script and CSS, and doing this 
on the fly in the browser is faster and easier while developing than 
reiterative validation with a separate tool;

* Strict markup works predictably for mashups and mixtures of different 
markup languages;

* Draconian error handling enforces structure and content models for 
mission-critical applications, such as the canonical "financial 
transactions" example, where the reader *wants* to know about problems 
in the markup [1], and for use cases that are low-tolerance for 
potential errors (such as the government and some industries).

To meet this need, I propose a new attribute, 'parsing', which, when 
placed on the document root, defines the type of parsing which a UA must 
use when parsing the document.  The values would be "loose" and 
"strict", with loose parsing as the default (an omitted @parsing 
attribute would result in loose parsing).

When the parsing is loose, the error-correction algorithms defined in 
HTML5 must be applied; when the parsing is strict, there must be no 
error-correction (as is commonly the case for XHTML in most browsers).

This way, authors could optionally enforce strictness when they want or 
need to, and then change/remove the value when they are ready for 
publication, or when the needs change.  It is possible that there would 
be instances where strict parsing makes it out of development and into 
production code, but this would have relatively few negative 
consequences (the kind of author who uses this would probably product 
strict code anyway, and would know it if they didn't), and would be 
easily corrected.  And, quite frankly, some people simply prefer 
stricter parsing for aesthetic or whatever, and this would provide them 
with that option while not imposing it on others.


Had this option been available in XML from the beginning, many problems 
and community schisms may have been avoided.  I believe that presenting 
the option for strict parsing may change how the various communities 
approach HTML5, and avoid further schisms.  I see this as having 
relatively low costs for the specification, and very little 
implementation cost, since browsers will already have both modes (even 
IE has a built-in XML parser, though it doesn't use it for XHTML). 
Please correct me if my assumption here is wrong.

I also believe that this is backwards-compatible, since the default will 
be loose parsing as is already applied, and forwards-compatible, since 
any alternate future parsing models (such as the proposed XML2 or XML5, 
or some use case we don't see today) can be specified as the value for 
@parsing in a later specification without changing how it would be used 
as defined in HTML5.  It may lay the groundwork for a new formulation of 
error-correcting XML, as Anne proposed.


I'm hoping that the dust has sufficiently settled about the parsing 
debate that we can hold a logical discussion of this proposal on its merits.


(Meta: I chose the keywords of the attribute and values for brevity, and 
I'm not at all married to them; treat them as placeholders for the 
purposes of discussing this proposal; another option might be something 
like @error-correction="true | false".  Please don't suggest different 
names quite yet unless they represent a functional difference to this 
proposal.  Also, I've BCC'ed the TAG just so they know.)

[1] http://www.tbray.org/ongoing/When/200x/2004/01/11/PostelPilgrim

Regards-
-Doug Schepers
W3C Team Contact, SVG and WebApps WGs
Received on Tuesday, 14 July 2009 07:17:20 UTC