Re: David's less simple example from Jeni Tennison on 2012-02-28 (public-xml-er@w3.org from February 2012)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Tue, 28 Feb 2012 18:46:32 +0000
To: David Carlisle <davidc@nag.co.uk>
Cc: "public-xml-er@w3.org Community Group" <public-xml-er@w3.org>
Message-Id: <14E7A14F-E145-4440-85E0-5ED2880ED699@jenitennison.com>

On 28 Feb 2012, at 18:09, David Carlisle wrote:
> On 28/02/2012 17:57, Jeni Tennison wrote:
>> I think it is much much more reasonable
> 
> It is more reasonable in that case, but it's a slippery slope. Once you
> get things that might be attributes and might be elements and might be
> some evil geek just making up bad examples on purpose, then it's
> virtually impossible to design a deterministic algorithm that meets
> human intuition (even if you restrict to just one human). You just have
> to a prove to yourself that the algorithm _is_ deterministic and does do
> something sensible in at least the well formed xml input case, and for
> the rest just, accept what came out.

Ah, a 'slippery slope' argument [1], I love those!

> The editor use case might be an "overwhelming objection" to quote
> myself, that says we should be more different than HTML5, but unlike
> "well formed xml" it's a rather vague under specified set of documents
> for which we want to ensure a "reasonable" parse.

Yes, I am arguing that the editor use case is an overwhelming objection. I would also point out that Oxygen (and probably other editors) employ algorithms over non-well-formed content that produce trees, and presumably do so in a deterministic fashion (unless they have somehow found a way to deliberately insert heisenbugs).

I am told that, similarly, MarkLogic (and I assume other ingesters) perform fixup (in their case based on the DTD/schema for the XML). I know that John Cowan has similarly worked on similar algorithms in the past.

My point is that HTML5's algorithm is not the only deterministic algorithm that could be used. Some of these other algorithms could produce "better" results (always subjective, yes, but if it weren't we'd have nothing to argue about). It may be that these algorithms are hideously complicated, I don't know. I think we should find out.

And to be specific, my suggestion is that when in the Tag name state [2], if the next character is < then this is a Parse Error, and the parser emits the current token and reprocesses the current input character (<) in the data state. Does that throw everything else in Anne's algorithm out somehow?

> It would be interesting to know how a more "declarative" fix up would
> fix that example (to any result) rather than just saying the result is
> whatever comes out of the parsing algorithm.

I guess you mean that as a challenge?

Jeni

[1] https://en.wikipedia.org/wiki/Slippery_slope#As_a_fallacy
[2] https://dvcs.w3.org/hg/xml-er/raw-file/d4b6debf3eed/Overview.html#tag-name-state
-- 
Jeni Tennison
http://www.jenitennison.com

Received on Tuesday, 28 February 2012 18:47:00 UTC