W3C home > Mailing lists > Public > whatwg@whatwg.org > June 2014

Re: [whatwg] Bogus comment state and CDATA section state do not stylistically fit in the tokenizer

From: Adam Barth <w3c@adambarth.com>
Date: Sun, 8 Jun 2014 22:11:29 -0700
Message-ID: <CAJE5ia_ccHKR74kckm_hKBTzvbWKgFAEcTq6EVJ=1NqncE5fkQ@mail.gmail.com>
To: Geoffrey Sneddon <foolistbar@googlemail.com>
Cc: WHATWG <whatwg@whatwg.org>
In Blink's implementation, we actually use two additional tokenizer
states for CDATA:

CDATASectionRightSquareBracketState,
CDATASectionDoubleRightSquareBracketState,

Adam


On Sun, Jun 8, 2014 at 6:24 PM, Geoffrey Sneddon
<foolistbar@googlemail.com> wrote:
> It would aid programmatic conversion of the spec, and confuse me when
> reading the spec less thereby avoiding bugs like 25871, if these states
> matched the model of the rest of the tokenizer.
>
> Thus I propose the bogus comment state becomes:
>
>> Consume the next input character:
>>
>> U+003E GREATER-THAN SIGN (>):
>>
>> Switch to the data state. Emit the comment token.
>>
>> U+0000 NULL:
>>
>> Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
>>
>> EOF:
>>
>> Switch to the data state. Emit the comment token. Reconsume the EOF character.
>>
>> Anything else:
>>
>> Append the current input character to the comment token's data.
>
> This also necessitates creating a new comment token prior to entering
> the bogus comment state.
>
> The CDATA section state should become:
>
>> Consume the next input character:
>>
>> U+005D RIGHT SQUARE BRACKET (]):
>>
>> If the three characters starting from the current input character are U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN (]]>), then consume those characters and switch to the data state. Otherwise, emit the current input character as a character token.
>>
>> EOF:
>>
>> Switch to the data state. Reconsume the EOF character.
>>
>> Anything else:
>>
>> Append the current input character to the comment token's data.
>
> No changes are needed elsewhere for this. (There is no consistent style
> for lookahead — and most cases are ASCII case-insensitive words — so I
> went with what seems sane here!)
>
> /Geoffrey
Received on Monday, 9 June 2014 05:12:27 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 17:00:21 UTC