Re: David's less simple example from David Carlisle on 2012-02-28 (public-xml-er@w3.org from February 2012)

From: David Carlisle <davidc@nag.co.uk>
Date: Tue, 28 Feb 2012 20:09:31 +0000
To: Jeni Tennison <jeni@jenitennison.com>
CC: "public-xml-er@w3.org Community Group" <public-xml-er@w3.org>
Message-ID: <4F4D347B.2000906@nag.co.uk>

On 28/02/2012 18:46, Jeni Tennison wrote:

> Yes, I am arguing that the editor use case is an overwhelming
> objection.

Well OK no objection here, as I said I don't really care what is
produced in these cases so long as we say what is produced.

> I would also point out that Oxygen (and probably other editors)
> employ algorithms over non-well-formed content that produce trees,
> and presumably do so in a deterministic fashion (unless they have
> somehow found a way to deliberately insert heisenbugs).

Well, seems that we have some editor developers here which is good.
>
> I am told that, similarly, MarkLogic (and I assume other ingesters)
> perform fixup (in their case based on the DTD/schema for the XML). I
>  know that John Cowan has similarly worked on similar algorithms in
> the past.

yes JC's tagsoup's quite good, my htmlparse xslt does similar things as
well.

>
> My point is that HTML5's algorithm is not the only deterministic
> algorithm that could be used.

well clearly not, but it's there and has overlapping potential clients...

> Some of these other algorithms could produce "better" results
> (always subjective, yes, but if it weren't we'd have nothing to argue
> about). It may be that these algorithms are hideously complicated, I
> don't know. I think we should find out.

agreed.
>
> And to be specific, my suggestion is that when in the Tag name state
>  [2], if the next character is<  then this is a Parse Error, and the
>  parser emits the current token and reprocesses the current input
> character (<) in the data state.

Now you're talking.

> Does that throw everything else in Anne's algorithm out somehow?

Anne?

>> It would be interesting to know how a more "declarative" fix up
>> would fix that example (to any result) rather than just saying the
>>  result is whatever comes out of the parsing algorithm.
>
>
> I guess you mean that as a challenge?
>

Would I?;-) not really. I just meant it would be interesting. I think
the "fixup" part fixing a tokenstream (missing end tagsand the like)
could probably be specified in a much less procedural style than the
current draft. But I can't yet imagine how you can do the hard part (the
tokenisation) without out just basically specifying a grammar, which is
why I thought it useful to look at an example where it's reasonable to
disagree about where the tags begin and end, rather than an example that
just has missing end tags.

David

Received on Tuesday, 28 February 2012 20:09:56 UTC