Re: Draft from Jeni Tennison on 2012-02-22 (public-xml-er@w3.org from February 2012)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Wed, 22 Feb 2012 13:51:54 +0000
To: Norman Walsh <ndw@nwalsh.com>
Cc: W3C XML-ER Community Group <public-xml-er@w3.org>
Message-Id: <359825D2-40D4-4831-BD05-3553614768BA@jenitennison.com>

On 21 Feb 2012, at 16:30, Norman Walsh wrote:
> The things that are not XML are well defined. We get to decide what
> things are not XML-ER.
> 
> I'm not sure what the right answer is. Some things seem clearly not to
> be XML-ER. For example, if I feed a JPEG image to the XML-ER parser,
> it's hard to imagine any value coming from any "document" produced by
> parsing that "successfully".
> 
> OTOH, a plain text document is less clearly "not XML-ER" to me. This is
> one place where a schema-agnostic parser is at a disadvantage. If you hand
> 
>  The quick brown fox
> 
> to an HTML parser, it can manufacture a bunch of wrapper elements.
> 
> I was just thinking about this the other day. I wonder if XML-ER
> "documents" that don't have a clear root element should get one:
> 
>  <er:document xmlns:er="whateverwedecide">The quick brown fox</er:document>
> 

I'd suggest that in cases where the input really doesn't look anything like XML (ie whose first non-whitespace character isn't a <), an XML-ER parser does whatever it is that HTML does. HTML is as good a vocabulary as any for representing such content and the rules are already defined and implemented, particularly in the key places where we expect XML-ER to be used.

That would effectively limit the scope of what we have to define for XML-ER parsing, which is a good thing. The side-effect of course is that something like:

  I forgot my document element but I'll still
  have a <table><p>containing a paragraph!</p>
  <tr><td>just because I can</td></tr></table>

would lead to all sorts of strange HTML-specific fix-up taking place, but any documents that are that badly munged are almost bound to actually be HTML anyway :)

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com

Received on Wednesday, 22 February 2012 13:52:21 UTC