Re: What problem is this task force trying to solve and why? from David Carlisle on 2010-12-23 (public-html-xml@w3.org from December 2010)

From: David Carlisle <davidc@nag.co.uk>
Date: Thu, 23 Dec 2010 13:30:18 +0000
To: Kurt Cagle <kurt.cagle@gmail.com>
CC: public-html-xml@w3.org
Message-ID: <4D134EEA.8060703@nag.co.uk>

On 23/12/2010 03:18, Kurt Cagle wrote:
s to the data. Such heuristics might include the following:
>
> 1) If a default namespace is not defined globally but a an explicit
> namespace is, and the child elements of that namespaces are in the
> default namespace, then put them into the explicit namespace:
>
> <ns1:foo xmlns:ns="myFooNS">
> <bar/>
> <bat/>
> </ns1:foo>
>
> would map to
>
> <ns1:foo xmlns:ns="myFooNS">
> <ns1:bar/>
> <ns1:bat/>
> </ns1:foo>

That's taking something that is already namespace well formed and 
transforming it to another document. Something for xslt not a parser.
>
> 2) if you have an element that repeats without being terminated between
> repeats, then that element will be considered a sibling:
>
> <foo>
> <bar>ABC
> <bar>123
> </foo>
>
> becomes:
>
> <foo>
> <bar>ABC</bar>
> <bar>123</bar>
> </foo>

I'd be very worried about suggesting any such fixup in teh absence of 
schema driven rules. I think the only generic fixup for non well formed 
xml for general elements would be to close any open elements on the 
stack when encountering a close tag, until a matching name is found.
that would match html5 foreign content parsing and xml5 and produce
<foo>
<bar>ABC
<bar>123
</bar></bar></foo>

SGML could do more as it always had a dtd to hand to specify for 
individual elements what the rules for.


>
> 3) An element with mixed content will be considered to contain that
> mixed content until another element of the same name is encountered:

In the absence of a schema you can't tell if it is mixed content or not

>
> 4) Entities would be matched to the HTML core set and converted into
> their equivalent numeric entity codes.

perhaps.
>
> And so forth. as the parser works through these cases, it assigns a
> weight that indicates the likelihood that a given heuristic rule
> determines the correct configuration. After the parsing is done, these
> are used to calculate a confidence level for the XML document - the
> likelihood that the document that is reproduced in the parsing
> corresponds to the intent of the creator of this content. In the case of
> well-formed XML this confidence is 1.

But your first rule tool well formed content and changed it.


David

Received on Thursday, 23 December 2010 13:30:47 UTC