Re: Draft from Jeni Tennison on 2012-02-23 (public-xml-er@w3.org from February 2012)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Thu, 23 Feb 2012 09:24:38 +0000
To: Innovimax W3C <innovimax+w3c@gmail.com>
Cc: Norman Walsh <ndw@nwalsh.com>, W3C XML-ER Community Group <public-xml-er@w3.org>
Message-Id: <27B4ADD9-F2F4-4791-A587-537480B7AB5F@jenitennison.com>

On 22 Feb 2012, at 23:30, Innovimax W3C wrote:
> On Wed, Feb 22, 2012 at 2:51 PM, Jeni Tennison <jeni@jenitennison.com> wrote:
>> On 21 Feb 2012, at 16:30, Norman Walsh wrote:
>> I'd suggest that in cases where the input really doesn't look anything like XML (ie whose first non-whitespace character isn't a <), an XML-ER parser does whatever it is that HTML does. HTML is as good a vocabulary as any for representing such content and the rules are already defined and implemented, particularly in the key places where we expect XML-ER to be used.
>> 
>> That would effectively limit the scope of what we have to define for XML-ER parsing, which is a good thing. The side-effect of course is that something like:
>> 
>>  I forgot my document element but I'll still
>>  have a <table><p>containing a paragraph!</p>
>>  <tr><td>just because I can</td></tr></table>
>> 
>> would lead to all sorts of strange HTML-specific fix-up taking place, but any documents that are that badly munged are almost bound to actually be HTML anyway :)
>> 
> 
> Really interesting idea...
> 
> But one nasty consequence is that XML-ER parser will have to contain
> an HTML5 parser...

Not really. It could be specced along the lines of "if there's no document element (however that's defined) then the XML-ER parser should report that the document is not something that can be handled by XML-ER parsing; what the application then does with the document is up to the application."

I'd suspect that for most browsers, editors and applications ingesting random rubbish off the internet, the appropriate fallback would be to treat it as HTML, but the XML-ER spec wouldn't have to mandate that. I just think that probably, particularly in the browser case, treating as HTML is a more ultimately useful recovery for this kind of egregious content than either ignoring text or slapping on a dummy document element.

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com

Received on Thursday, 23 February 2012 09:25:00 UTC