Re: [whatwg] Bogus comment state and CDATA section state do not stylistically fit in the tokenizer from Adam Barth on 2014-06-09 (public-whatwg-archive@w3.org from June 2014)

From: Adam Barth <w3c@adambarth.com>
Date: Sun, 8 Jun 2014 22:11:29 -0700
To: Geoffrey Sneddon <foolistbar@googlemail.com>
Cc: WHATWG <whatwg@whatwg.org>
Message-ID: <CAJE5ia_ccHKR74kckm_hKBTzvbWKgFAEcTq6EVJ=1NqncE5fkQ@mail.gmail.com>

In Blink's implementation, we actually use two additional tokenizer
states for CDATA:

CDATASectionRightSquareBracketState,
CDATASectionDoubleRightSquareBracketState,

Adam


On Sun, Jun 8, 2014 at 6:24 PM, Geoffrey Sneddon
<foolistbar@googlemail.com> wrote:
> It would aid programmatic conversion of the spec, and confuse me when
> reading the spec less thereby avoiding bugs like 25871, if these states
> matched the model of the rest of the tokenizer.
>
> Thus I propose the bogus comment state becomes:
>
>> Consume the next input character:
>>
>> U+003E GREATER-THAN SIGN (>):
>>
>> Switch to the data state. Emit the comment token.
>>
>> U+0000 NULL:
>>
>> Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
>>
>> EOF:
>>
>> Switch to the data state. Emit the comment token. Reconsume the EOF character.
>>
>> Anything else:
>>
>> Append the current input character to the comment token's data.
>
> This also necessitates creating a new comment token prior to entering
> the bogus comment state.
>
> The CDATA section state should become:
>
>> Consume the next input character:
>>
>> U+005D RIGHT SQUARE BRACKET (]):
>>
>> If the three characters starting from the current input character are U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN (]]>), then consume those characters and switch to the data state. Otherwise, emit the current input character as a character token.
>>
>> EOF:
>>
>> Switch to the data state. Reconsume the EOF character.
>>
>> Anything else:
>>
>> Append the current input character to the comment token's data.
>
> No changes are needed elsewhere for this. (There is no consistent style
> for lookahead — and most cases are ASCII case-insensitive words — so I
> went with what seems sane here!)
>
> /Geoffrey

Received on Monday, 9 June 2014 05:12:27 UTC