W3C home > Mailing lists > Public > www-style@w3.org > October 2012

Re: [css3-syntax] Null bytes and U+0000

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Tue, 23 Oct 2012 18:23:38 -0400
Message-ID: <508718EA.1080905@mit.edu>
To: "Tab Atkins Jr." <jackalmage@gmail.com>
CC: www-style@w3.org
On 10/23/12 5:57 PM, Tab Atkins Jr. wrote:
>> In any case once Gecko reaches end-of-escape it looks at the resulting hex
>> value.  If that value is 0, it outputs as many '0' as it hex digit chars.
>
> Ah, that violates the "only emit one token per call" invariant that I
> was told was important.

I don't see why.  The "output" there is not tokens.  It's a string. 
Specifically, the string the escape expands to.

So as a concrete example, say you have something like this:

   abc\0000def

Gecko would tokenize this as a single token: the identifier "abc0000def".

Just like given:

   abc\61 def

Gecko would tokenize as the single identifier token "abcadef".

Note that in the typical string encodings people use (UTF-16 and UTF-8), 
a CSS escape can easily expand to multiple code units in general, so the 
only special thing about the \0 stuff is that it can expand into up to 6 
code units, whereas most Unicode chars expand into at most 2 in UTF-16 
and at most 4 in UTF-8.

> A thought occurs to me, though - maybe it makes sense to be consistent
> with my preferred treatment of literal nulls, and make \0 return
> U+FFFD as well?

I can probably live with that too.

> I've reproduced a slightly better testcase as
> http://www.xanthir.com/etc/css-null-testing/escaped-null-in-selector.html
>
> Here's a repro of what I get out of the CSSOM in FF:
>
> p { background-color: red; color: white; }
> .one { background-color: green; }
> .two { background-color: green; }
> \0 .three { background-color: green; }
> .four { background-color: green; }

Ah, you're seeing a bug in the serializer there, looks like.  Parsing 
the original text tokenizes an escaped null as a single identifier char, 
and puts a null in as the tag name in that third selector.  But then 
when you serialize and identifier, Gecko does:

122       // Escape all characters below 0x20

And proceeds to snprintf with a format string of "\\%hX ", which is 
broken for null given how Gecko parses \0.

> Heh, not quite.  If FF encounters an escaped literal NULL inside of a
> string or unquoted url, it truncates the string or url at that point.
> It doesn't treat it as invalid, and otherwise parses the token
> normally - it just throws away the contents of the token from the
> escape onward.

Ah, this is amusing. The parser actually keeps the null just fine.  But 
for strings the object model stores them as a pointer and no length and 
relies on strlen to get the length, which of course truncates at 
embedded null whenever someone (which includes the rendering code) asks 
the object model for anything.

-Boris
Received on Tuesday, 23 October 2012 22:24:07 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 17:21:01 GMT