- From: James Clark <jjc@jclark.com>
- Date: Wed, 5 Dec 2012 14:38:19 +0700
- To: Chris Lahey <clahey@clahey.net>
- Cc: David Carlisle <davidc@nag.co.uk>, "public-microxml (public-microxml@w3.org)" <public-microxml@w3.org>
- Message-ID: <-4577173907001907834@unknownmsgid>
On Dec 5, 2012, at 11:50 AM, Chris Lahey <clahey@clahey.net> wrote: I've run into a couple issues with the spec. Is this a good forum for the discussion? As good as any, I guess. Specifically, when in Main Tokenization Mode, the first listed possible parse is DATA_CHAR with a default handler, but all possible strings match this rule, so if you apply the rules in order, the whole document will just be parsed as a list of DataChars (this is what my code is doing right now, but I already changed the order, so that's just debugging that has to happen on my end.) I think the spec should specify the order in which matches take precedence? The rule is to use the longest match. When this doesn't resolve the ambiguity, choose the token that is not DATA_CHAR. There's a paragraph in the spec that deals with this: "Recognizing the next lexical token. This consists of finding the longest initial subsequence of the input that matches one of the lexical tokens recognized in the current tokenization mode. It is possible for there to be two choices for the longest matching token (eg S and DATA_CHAR in UnquoteAttributeValue mode): in this case, the choice that is not DATA_CHAR must be recognized." You also don't specify what happens when you get to end of stream when in Main mode (or a bunch of other modes, actually). My guess is you stop outputting things, but I think that should be specified. Yes and also in some cases emit a StartTagClose abstract token. The spec says: "The tokenization process starts with Main as the current tokenization mode, and the input to the tokenization process as the current input, and repeats the tokenization step until the current input is empty. At this point, if the current tokenization mode is one of Tag, StartAttributeValue, UnquoteAttributeValue, SingleQuoteAttributeValue or DoubleQuoteAttributeValue, then a StartTagClose abstract token is emitted." Also, the default handling rule for NUMERIC_CHAR_REF requires the original character data if the integer is over 10FFFF, but the associated data for a NUMERIC_CHAR_REF is the integer. Good catch. The spec should make the associated data be a string. I got my Tokenization code to compile, which is a pretty good step. There's still a fair amount of work to do, but I'm pretty happy with the spec so far. Great. There's a test suite (in tests.json) that you can use, though it still has some way to go. James Thanks much, Chris On Mon, Nov 26, 2012 at 9:53 AM, James Clark <jjc@jclark.com> wrote: Yes, you're right, thanks. Fixed now. James On Mon, Nov 26, 2012 at 9:48 PM, David Carlisle <davidc@nag.co.uk> wrote: On 26/11/2012 13:47, James Clark wrote: The write-up is here: newlines are normalized by replacing any #xA character or #xD/#xA character sequence, by a #xA character. I think that first xA should be xD David
Received on Wednesday, 5 December 2012 07:39:01 UTC