[whatwg] Bogus comment state and CDATA section state do not stylistically fit in the tokenizer from Geoffrey Sneddon on 2014-06-09 (public-whatwg-archive@w3.org from June 2014)

From: Geoffrey Sneddon <foolistbar@googlemail.com>
Date: Mon, 09 Jun 2014 02:24:22 +0100
To: WHATWG <whatwg@whatwg.org>
Message-ID: <53950CC6.9070004@googlemail.com>

It would aid programmatic conversion of the spec, and confuse me when
reading the spec less thereby avoiding bugs like 25871, if these states
matched the model of the rest of the tokenizer.

Thus I propose the bogus comment state becomes:

> Consume the next input character:
> 
> U+003E GREATER-THAN SIGN (>):
> 
> Switch to the data state. Emit the comment token.
> 
> U+0000 NULL:
> 
> Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
> 
> EOF:
> 
> Switch to the data state. Emit the comment token. Reconsume the EOF character.
> 
> Anything else:
> 
> Append the current input character to the comment token's data.

This also necessitates creating a new comment token prior to entering
the bogus comment state.

The CDATA section state should become:

> Consume the next input character:
> 
> U+005D RIGHT SQUARE BRACKET (]):
> 
> If the three characters starting from the current input character are U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN (]]>), then consume those characters and switch to the data state. Otherwise, emit the current input character as a character token.
> 
> EOF:
> 
> Switch to the data state. Reconsume the EOF character.
> 
> Anything else:
> 
> Append the current input character to the comment token's data.

No changes are needed elsewhere for this. (There is no consistent style
for lookahead — and most cases are ASCII case-insensitive words — so I
went with what seems sane here!)

/Geoffrey

Received on Monday, 9 June 2014 01:24:54 UTC