Re: [css2.1] tokenizer syntax - handling escaped null in badstring

On Wed, Oct 10, 2012 at 3:37 PM, Simon Sapin <simon.sapin@kozea.fr> wrote:

> Le 10/10/2012 01:58, Glenn Adams a écrit :
>
>  it would seem a bit easier to not have to admit < \\, NUL > for
>> implementation reasons; there is really no loss of functionality if this
>> is not supported, since if the author really wants a NUL, then can just
>> use < \\, 0, SPACE > or perhaps < \\, 0 > if the context permits.
>>
>
> What would exactly mean to "not admit" a sequence of codepoints? Abort
> completely the tokenizer and throw away the rest of the stylesheet, have
> the usual error-recovery, or maybe something else?


At present, the CSS2.1 tokenizer grammar specifies

escape{unicode}|\\[^\n\r\f0-9a-f]

which accepts the following inputs as a legal escape, each of which contain
a UNICODE C0 Control Character

U+005C U+0000
U+005C U+0001
...
U+005C U+0008
U+005C U+000B
U+005C U+000E
...
U+005C U+001F

I'm suggesting that the sequence

U+005C U+0000

should *not* be accepted as an escape, which would mean that if it were
encountered, it would handled just like other syntax errors in CSS2.1,
e.g., the longest matching rule would exclude such an escape when
attempting to read a non-terminal that contains such an escape

to take an example, let's say we are trying to match badstring1 as follows

badstring1 \"([^\n\r\f\\"]|\\{nl}|{escape})*\\?

and our input string is

< U+0022 (QUOTATION MARK), U+005C (REVERSE SOLIDUS), U+0000 (NULL) >

we would match only the following (if we don't accept escaped NULL)

< U+0022 (QUOTATION MARK), U+005C (REVERSE SOLIDUS) >

which would then leave U+0000 as the next unconsumed input character

but we also have a related problem, which is whether to accept U+0000 as an
unescaped input character

let's say our input were instead

< U+0022 (QUOTATION MARK), U+0000 (NULL), U+005C (REVERSE SOLIDUS), U+0000
(NULL) >

we will now match badstring1 as

< U+0022 (QUOTATION MARK), U+0000 (NULL), U+005C (REVERSE SOLIDUS) >

this anomaly (of accepting unescaped NULL but not accepting escaped NULL)
is due to the expression

[^\n\r\f\\"]

which matches all C0 code points except for U+000A (\n), U+000C (\f),
U+000D (\r), and thus matches an unescaped U+0000.

to summarize, the current syntax (for badstring1) matches (consumes)
both U+005C
U+0000 and U+0000;

so if we were to remove the escaped form, we would probably want to remove
the unescaped form

i can't personally think of any reason to admit either of these in a CSS
input stream when if the author really wishes to include a U+0000, they can
do so simply by using the unicode escape form, i.e., \0

Received on Wednesday, 10 October 2012 12:36:40 UTC