Re: Online error recovery implementation from James Clark on 2012-12-05 (public-microxml@w3.org from December 2012)

From: James Clark <jjc@jclark.com>
Date: Wed, 5 Dec 2012 14:38:19 +0700
To: Chris Lahey <clahey@clahey.net>
Cc: David Carlisle <davidc@nag.co.uk>, "public-microxml (public-microxml@w3.org)" <public-microxml@w3.org>
Message-ID: <-4577173907001907834@unknownmsgid>

On Dec 5, 2012, at 11:50 AM, Chris Lahey <clahey@clahey.net> wrote:

I've run into a couple issues with the spec.  Is this a good forum for
the discussion?


As good as any, I guess.

Specifically, when in Main Tokenization Mode, the first listed
possible parse is DATA_CHAR with a default handler, but all possible
strings match this rule, so if you apply the rules in order, the whole
document will just be parsed as a list of DataChars (this is what my
code is doing right now, but I already changed the order, so that's
just debugging that has to happen on my end.)  I think the spec should
specify the order in which matches take precedence?


The rule is to use the longest match.  When this doesn't resolve the
ambiguity, choose the token that is not DATA_CHAR.  There's a paragraph in
the spec that deals with this:

"Recognizing the next lexical token. This consists of finding the longest
initial subsequence of the input that matches one of the lexical tokens
recognized in the current tokenization mode. It is possible for there to be
two choices for the longest matching token (eg S and DATA_CHAR in
UnquoteAttributeValue mode): in this case, the choice that is not DATA_CHAR
must be recognized."

You also don't specify what happens when you get to end of stream when
in Main mode (or a bunch of other modes, actually).  My guess is you
stop outputting things, but I think that should be specified.


Yes and also in some cases emit a StartTagClose abstract token.  The spec
says:

"The tokenization process starts with Main as the current tokenization
mode, and the input to the tokenization process as the current input, and
repeats the tokenization step until the current input is empty. At this
point, if the current tokenization mode is one of Tag, StartAttributeValue,
UnquoteAttributeValue, SingleQuoteAttributeValue or
DoubleQuoteAttributeValue, then a StartTagClose abstract token is emitted."

Also, the default handling rule for NUMERIC_CHAR_REF requires the
original character data if the integer is over 10FFFF, but the
associated data for a NUMERIC_CHAR_REF is the integer.


Good catch.  The spec should make the associated data be a string.

I got my Tokenization code to compile, which is a pretty good step.
There's still a fair amount of work to do, but I'm pretty happy with
the spec so far.


Great. There's a test suite (in tests.json) that you can use, though it
still has some way to go.

James


Thanks much,
     Chris


On Mon, Nov 26, 2012 at 9:53 AM, James Clark <jjc@jclark.com> wrote:

Yes, you're right, thanks.  Fixed now.


James



On Mon, Nov 26, 2012 at 9:48 PM, David Carlisle <davidc@nag.co.uk> wrote:


On 26/11/2012 13:47, James Clark wrote:


The write-up is here:



newlines are normalized by replacing any #xA character or #xD/#xA

character sequence, by a #xA character.


I think that first xA should be xD


David

Received on Wednesday, 5 December 2012 07:39:01 UTC