Re: [css3-syntax] Null bytes and U+0000 from Boris Zbarsky on 2012-10-23 (www-style@w3.org from October 2012)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Tue, 23 Oct 2012 18:23:38 -0400
To: "Tab Atkins Jr." <jackalmage@gmail.com>
CC: www-style@w3.org
Message-ID: <508718EA.1080905@mit.edu>

On 10/23/12 5:57 PM, Tab Atkins Jr. wrote:
>> In any case once Gecko reaches end-of-escape it looks at the resulting hex
>> value.  If that value is 0, it outputs as many '0' as it hex digit chars.
>
> Ah, that violates the "only emit one token per call" invariant that I
> was told was important.

I don't see why.  The "output" there is not tokens.  It's a string. 
Specifically, the string the escape expands to.

So as a concrete example, say you have something like this:

   abc\0000def

Gecko would tokenize this as a single token: the identifier "abc0000def".

Just like given:

   abc\61 def

Gecko would tokenize as the single identifier token "abcadef".

Note that in the typical string encodings people use (UTF-16 and UTF-8), 
a CSS escape can easily expand to multiple code units in general, so the 
only special thing about the \0 stuff is that it can expand into up to 6 
code units, whereas most Unicode chars expand into at most 2 in UTF-16 
and at most 4 in UTF-8.

> A thought occurs to me, though - maybe it makes sense to be consistent
> with my preferred treatment of literal nulls, and make \0 return
> U+FFFD as well?

I can probably live with that too.

> I've reproduced a slightly better testcase as
> http://www.xanthir.com/etc/css-null-testing/escaped-null-in-selector.html
>
> Here's a repro of what I get out of the CSSOM in FF:
>
> p { background-color: red; color: white; }
> .one { background-color: green; }
> .two { background-color: green; }
> \0 .three { background-color: green; }
> .four { background-color: green; }

Ah, you're seeing a bug in the serializer there, looks like.  Parsing 
the original text tokenizes an escaped null as a single identifier char, 
and puts a null in as the tag name in that third selector.  But then 
when you serialize and identifier, Gecko does:

122       // Escape all characters below 0x20

And proceeds to snprintf with a format string of "\\%hX ", which is 
broken for null given how Gecko parses \0.

> Heh, not quite.  If FF encounters an escaped literal NULL inside of a
> string or unquoted url, it truncates the string or url at that point.
> It doesn't treat it as invalid, and otherwise parses the token
> normally - it just throws away the contents of the token from the
> escape onward.

Ah, this is amusing. The parser actually keeps the null just fine.  But 
for strings the object model stores them as a pointer and no length and 
relies on strlen to get the length, which of course truncates at 
embedded null whenever someone (which includes the rendering code) asks 
the object model for anything.

-Boris

Received on Tuesday, 23 October 2012 22:24:07 UTC