RE: Draft from Derek Read on 2012-02-23 (public-xml-er@w3.org from February 2012)

From: Derek Read <derek.read@justsystems.com>
Date: Wed, 22 Feb 2012 16:27:02 -0800
To: "Derek Read" <derek.read@justsystems.com>, "W3C XML-ER Community Group" <public-xml-er@w3.org>
Message-ID: <BECDDDED92C3B949A38F5BC4BF56D21F04B2027B@van-mail.jena.local>
To clarify what I meant by "illegal" vs "should not be used" see:

http://www.w3.org/TR/xml/#charsets


The recommendation actually uses the terms "legal" (and phrases character usage in a positive way) and provides lists of characters to "avoid".

Derek Read
Program Manager, XMetaL
JustSystems Canada Inc.


-----Original Message-----
From: Derek Read [mailto:derek.read@justsystems.com] 
Sent: Wednesday, February 22, 2012 4:08 PM
To: W3C XML-ER Community Group
Subject: RE: Draft

[Warning: I'm wary of the whole XML-ER idea in general because wouldn't the writer of an XML document that is concerned about things at this level of detail write a valid XML document in the first place? i.e. I'm all for the idea that an XML parser should halt after encountering an error]

However, taken one step further to the extreme (back to current thread) should a "document" consisting of text only be transformed into a well-formed document when the text consists of either of the following:

One or more whitespaces and nothing else?
An empty (zero length) file?

In some contexts an element containing only a whitespace could be significant. This would be pretty extreme, but such a file might be embedded as a chunk inside some other XML stream at some point. Such as: <root>some text<ws> </ws>more text</root>, where the space inside <ws/> is "significant" to somebody.

Keeping with the idea of automatically wrapping a document without a root element in an element,
perhaps specific rules should be created for these possibilities:

1) zero length document handling -- clearly not XML?
2) documents containing only whitespace -- could go either way I think
3) documents containing non-whitespace characters (and optionally whitespace) -- this is the case originally highlighted (wrap the text with a default element)
4) documents that match #3 but one or more characters is illegal -- two options: reject the document or replace the character with some kind of placeholder (PI or other)
5) documents that match #3 but containing one or more characters that "should not be used" (not including illegal characters) -- three options: reject the document, replace the character with some kind of placeholder (PI or other), leave character as is

In the case of 4 and 5 I think after the illegal/should not use character handling is done then we could deal with wrapping the text in an element.

Derek Read
Program Manager, XMetaL
JustSystems Canada Inc.


-----Original Message-----
From: innovimax@gmail.com [mailto:innovimax@gmail.com] On Behalf Of Innovimax W3C
Sent: Wednesday, February 22, 2012 3:30 PM
To: Jeni Tennison
Cc: Norman Walsh; W3C XML-ER Community Group
Subject: Re: Draft

On Wed, Feb 22, 2012 at 2:51 PM, Jeni Tennison <jeni@jenitennison.com> wrote:
>
> On 21 Feb 2012, at 16:30, Norman Walsh wrote:
>> The things that are not XML are well defined. We get to decide what
>> things are not XML-ER.
>>
>> I'm not sure what the right answer is. Some things seem clearly not to
>> be XML-ER. For example, if I feed a JPEG image to the XML-ER parser,
>> it's hard to imagine any value coming from any "document" produced by
>> parsing that "successfully".
>>
>> OTOH, a plain text document is less clearly "not XML-ER" to me. This is
>> one place where a schema-agnostic parser is at a disadvantage. If you hand
>>
>>  The quick brown fox
>>
>> to an HTML parser, it can manufacture a bunch of wrapper elements.
>>
>> I was just thinking about this the other day. I wonder if XML-ER
>> "documents" that don't have a clear root element should get one:
>>
>>  <er:document xmlns:er="whateverwedecide">The quick brown fox</er:document>
>>
>
> I'd suggest that in cases where the input really doesn't look anything like XML (ie whose first non-whitespace character isn't a <), an XML-ER parser does whatever it is that HTML does. HTML is as good a vocabulary as any for representing such content and the rules are already defined and implemented, particularly in the key places where we expect XML-ER to be used.
>
> That would effectively limit the scope of what we have to define for XML-ER parsing, which is a good thing. The side-effect of course is that something like:
>
>  I forgot my document element but I'll still
>  have a <table><p>containing a paragraph!</p>
>  <tr><td>just because I can</td></tr></table>
>
> would lead to all sorts of strange HTML-specific fix-up taking place, but any documents that are that badly munged are almost bound to actually be HTML anyway :)
>

Really interesting idea...

But one nasty consequence is that XML-ER parser will have to contain
an HTML5 parser...

http://software.hixie.ch/utilities/js/live-dom-viewer/?I%20forgot%20my%20document%20element%20but%20I'll%20still%C2%A0have%20a%20%3Ctable%3E%3Cp%3Econtaining%20a%20paragraph!%3C%2Fp%3E%3Ctr%3E%3Ctd%3Ejust%20because%20I%20can%3C%2Ftd%3E%3C%2Ftr%3E%3C%2Ftable%3E

Mohamed

-- 
Innovimax SARL
Consulting, Training & XML Development
9, impasse des Orteaux
75020 Paris
Tel : +33 9 52 475787
Fax : +33 1 4356 1746
http://www.innovimax.fr

RCS Paris 488.018.631
SARL au capital de 10.000 €
Received on Thursday, 23 February 2012 00:27:26 UTC