Re: What problem is this task force trying to solve and why? from Kurt Cagle on 2010-12-22 (public-html-xml@w3.org from December 2010)

From: Kurt Cagle <kurt.cagle@gmail.com>
Date: Wed, 22 Dec 2010 15:11:17 -0500
To: Noah Mendelsohn <nrm@arcanedomain.com>
Cc: John Cowan <cowan@mercury.ccil.org>, David Carlisle <davidc@nag.co.uk>, Henri Sivonen <hsivonen@iki.fi>, public-html-xml@w3.org
Message-ID: <AANLkTikSFn79Zpre=HBF30J-kAMki1cPDgihwrJjv5f8@mail.gmail.com>

Not on the TC but a thought here about well-formedness vs. validation:

The challenge that I see XML5 introducing is that it requires a change not
only in validation behavior, but also in what is considered well-formedness,
and I would argue that it is the latter issue that needs to be of bigger
concern to both HTML and XML groups.

At heart is this fundamental conflict: HTML's mandate is to provide markup
language that is fault tolerant, based upon at least the one assumption that
the authors of such HTML are likely not to be programmers, and as such may
introduce could that would break in a stricter environment. XML's mandate is
to provide markup language that is fault intolerant, because fault tolerance
may end up introducing ambiguous or even erroneous assertions that can prove
difficult to resolve, especially when you are dealing with processing
thousands or even millions of such documents.

Perhaps at least one solution to this particular dilemma is to ask whether
such tolerance should reside not within the language itself but within the
parser and serializer. Establish a parseLevel of #strict or #lax as a
property for the relevant parsers that would interpret the XML content
strictly as XML 1.0 when set as #strict, or HTXML when set to #lax.
Serialization would similarly follow an XML or HTXML or model. This is a
pre-validation step, it only handles the parsing.

I think this would resolve a lot of things - because of the conflicting
well-formedness mandates, I don't necessarily see any resolution on the
XML/HTXML issue any time soon, and I'm increasingly wondering if that's all
that good an idea anyway. It means that XML parsers can in fact consume
ill-formed (from its perspective) HTML content and not choke, while at the
same time working consistently with well-formed XML - this then becomes a
case of caveat emptor from the developer's perspective - if you use lax
parsing, expect the unexpected. Additionally, as an added benefit it
resolves the reams upon reams of bad RSS2 content.

It would require reworking the parsers, of course, but I see this in many
ways as an easier step than dealing with billions of files of legacy XML and
HTML.

Kurt Cagle
XML Architect
*Lockheed / US National Archives ERA Project*

On Wed, Dec 22, 2010 at 1:06 PM, Noah Mendelsohn <nrm@arcanedomain.com>wrote:

>
>
> On 12/20/2010 4:25 PM, John Cowan wrote:
>
>> Noah Mendelsohn scripsit:
>>
>>  >  * Being liberal in what you accept has arguably proven useful on the
>>> >  Web, but we may offer better value in helping users to be conservative
>>> >  in what they send.  FWIW:  I find that XML validation of my (X)HTML
>>> >  sometimes trips on errors I wouldn't need to fix in practice, but
>>> >  often it catches errors that would cause a browser to skip significant
>>> >  content when rendering.  So, I find XML validation to be valuable;
>>> >  maybe or maybe not a good HTML5 validator would meet the need instead.
>>> >  Anyway, I think we need to think about the right mix of XML and HTML
>>> >  validation, in cases where users wish to ensure that generated or
>>> >  hand-authored content is correct.
>>>
>> Validation is important, and I'm not arguing against it.  What I don't
>> think matters is XML*validity*.  There are now many other useful ways
>> to validate documents that are not XML-valid.
>>
>
> Good catch.  I said XML validation.  I mostly meant well-formedness
> checking.  I didn't mean to suggest one way or the other whether
> schema-level validation might also be useful, and if so, using what schema
> languages.
>
> Noah
>
>

Received on Wednesday, 22 December 2010 20:12:21 UTC