Re: Draft from Innovimax W3C on 2012-02-22 (public-xml-er@w3.org from February 2012)

From: Innovimax W3C <innovimax+w3c@gmail.com>
Date: Thu, 23 Feb 2012 00:30:12 +0100
To: Jeni Tennison <jeni@jenitennison.com>
Cc: Norman Walsh <ndw@nwalsh.com>, W3C XML-ER Community Group <public-xml-er@w3.org>
Message-ID: <CAAK2GfEnFCPVt6T1us7EU2D+AtRheoR-m-FEJ+H-FRKwsQpTyQ@mail.gmail.com>

On Wed, Feb 22, 2012 at 2:51 PM, Jeni Tennison <jeni@jenitennison.com> wrote:
>
> On 21 Feb 2012, at 16:30, Norman Walsh wrote:
>> The things that are not XML are well defined. We get to decide what
>> things are not XML-ER.
>>
>> I'm not sure what the right answer is. Some things seem clearly not to
>> be XML-ER. For example, if I feed a JPEG image to the XML-ER parser,
>> it's hard to imagine any value coming from any "document" produced by
>> parsing that "successfully".
>>
>> OTOH, a plain text document is less clearly "not XML-ER" to me. This is
>> one place where a schema-agnostic parser is at a disadvantage. If you hand
>>
>>  The quick brown fox
>>
>> to an HTML parser, it can manufacture a bunch of wrapper elements.
>>
>> I was just thinking about this the other day. I wonder if XML-ER
>> "documents" that don't have a clear root element should get one:
>>
>>  <er:document xmlns:er="whateverwedecide">The quick brown fox</er:document>
>>
>
> I'd suggest that in cases where the input really doesn't look anything like XML (ie whose first non-whitespace character isn't a <), an XML-ER parser does whatever it is that HTML does. HTML is as good a vocabulary as any for representing such content and the rules are already defined and implemented, particularly in the key places where we expect XML-ER to be used.
>
> That would effectively limit the scope of what we have to define for XML-ER parsing, which is a good thing. The side-effect of course is that something like:
>
>  I forgot my document element but I'll still
>  have a <table><p>containing a paragraph!</p>
>  <tr><td>just because I can</td></tr></table>
>
> would lead to all sorts of strange HTML-specific fix-up taking place, but any documents that are that badly munged are almost bound to actually be HTML anyway :)
>

Really interesting idea...

But one nasty consequence is that XML-ER parser will have to contain
an HTML5 parser...

http://software.hixie.ch/utilities/js/live-dom-viewer/?I%20forgot%20my%20document%20element%20but%20I'll%20still%C2%A0have%20a%20%3Ctable%3E%3Cp%3Econtaining%20a%20paragraph!%3C%2Fp%3E%3Ctr%3E%3Ctd%3Ejust%20because%20I%20can%3C%2Ftd%3E%3C%2Ftr%3E%3C%2Ftable%3E

Mohamed

-- 
Innovimax SARL
Consulting, Training & XML Development
9, impasse des Orteaux
75020 Paris
Tel : +33 9 52 475787
Fax : +33 1 4356 1746
http://www.innovimax.fr
RCS Paris 488.018.631
SARL au capital de 10.000 €

Received on Wednesday, 22 February 2012 23:30:40 UTC