Re: Online error recovery implementation from Chris Lahey on 2012-12-05 (public-microxml@w3.org from December 2012)

From: Chris Lahey <clahey@clahey.net>
Date: Tue, 4 Dec 2012 23:50:19 -0500
To: James Clark <jjc@jclark.com>
Cc: David Carlisle <davidc@nag.co.uk>, "public-microxml (public-microxml@w3.org)" <public-microxml@w3.org>
Message-ID: <CACy+m56yacA_ED0O2PUoAtdP99jun3uYGt=mEMXGu+D46kq+RQ@mail.gmail.com>

I've run into a couple issues with the spec.  Is this a good forum for
the discussion?

Specifically, when in Main Tokenization Mode, the first listed
possible parse is DATA_CHAR with a default handler, but all possible
strings match this rule, so if you apply the rules in order, the whole
document will just be parsed as a list of DataChars (this is what my
code is doing right now, but I already changed the order, so that's
just debugging that has to happen on my end.)  I think the spec should
specify the order in which matches take precedence?

You also don't specify what happens when you get to end of stream when
in Main mode (or a bunch of other modes, actually).  My guess is you
stop outputting things, but I think that should be specified.

Also, the default handling rule for NUMERIC_CHAR_REF requires the
original character data if the integer is over 10FFFF, but the
associated data for a NUMERIC_CHAR_REF is the integer.

I got my Tokenization code to compile, which is a pretty good step.
There's still a fair amount of work to do, but I'm pretty happy with
the spec so far.

Thanks much,
      Chris

On Mon, Nov 26, 2012 at 9:53 AM, James Clark <jjc@jclark.com> wrote:
> Yes, you're right, thanks.  Fixed now.
>
> James
>
>
> On Mon, Nov 26, 2012 at 9:48 PM, David Carlisle <davidc@nag.co.uk> wrote:
>>
>> On 26/11/2012 13:47, James Clark wrote:
>>>
>>> The write-up is here:
>>
>>
>> newlines are normalized by replacing any #xA character or #xD/#xA
>> character sequence, by a #xA character.
>>
>> I think that first xA should be xD
>>
>> David
>>
>>
>

Received on Wednesday, 5 December 2012 04:51:16 UTC