Re: [css3-syntax] Null bytes and U+0000 from Boris Zbarsky on 2012-10-23 (www-style@w3.org from October 2012)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 22 Oct 2012 23:02:39 -0400
To: www-style@w3.org
Message-ID: <508608CF.3080200@mit.edu>
On 10/22/12 8:28 PM, Tab Atkins Jr. wrote:
> * Firefox forces "\0" to be interpreted as escaping a literal "0",
> even if more digits follow it.  (More testing shows that they simply
> refuse to tokenize \0 as a hex escape, regardless of how many 0s there
> are.  This means that I was lied to - they have to do 6-char lookahead
> when parsing stylesheets, not 3-char. ^_^)

There is no 6-char lookahead here.  Pure look-behind.  The code sees a 
backslash, parses starts reading chars one at a time and counting how 
many chars it has read, stopping when it reaches 6 chars or a 
non-hex-digit char.  Note that you have to keep track of how many chars 
you read, because if you read all 6 chars you still have to go ahead and 
swallow the following whitespace, if any.

In any case once Gecko reaches end-of-escape it looks at the resulting 
hex value.  If that value is 0, it outputs as many '0' as it hex digit 
chars.

All of this never requires more than 2-char lookahead that I can see. 
Maybe even 1-char; it's a bit hard to tell from this code.

Note that \0 or \000000 are not valid hex escapes in CSS2.1, which is 
why Gecko never treats them as hex escapes, and I'm pretty surprised 
that WebKit does so.  Guess we never had a test in the test suite for 
little details like section 4.1.3?  ;)

There's a nice code comment here about that:

     // "[at most six hexadecimal digits following a backslash] stand
     // for the ISO 10646 character with that number, which must not be
     // zero. (It is undefined in CSS 2.1 what happens if a style sheet
     // does contain a character with Unicode codepoint zero.)"
     //   -- CSS2.1 section 4.1.3
     //
     // Silently deleting \0 opens a content-filtration loophole (see
     // bug 228856), so what we do instead is pretend the "cancels the
     // meaning of special characters" rule applied.

> Next, I tested an actual escaped null, that is, a \ followed by a null.
...
> * Firefox appears to convert it into a \0, and then act normally.

That's ... odd.  I would expect \ followed by an actual null, assuming 
the null gets down to the CSS parser, to just keep the null as a 
character in the tokenization, the same way that \w would work.  Link to 
the testcase you were using here?

> * Firefox treats it as an invalid value and drops the declaration.
> Otherwise, acts as normal.

Wouldn't this depend on the value?  I'm pretty sure inside a string, 
say, the null would just be preserved....  But of course if you do 
something like this C string: "color: \\\0" then it's not a valid color 
and will be dropped.  Again, what was the actual test here?

> 1. Browsers are remarkably divergent in behavior here, so I can
> probably just spec something sane and be done with it.

Agreed.

> 2. Nobody does anything *useful* with nulls, so getting rid of them in
> the input string is almost certainly just fine.

Modulo issues like https://bugzilla.mozilla.org/show_bug.cgi?id=228856 
cited in the above code comment.

> 3. I'd like to know why Firefox refuses to allow a hex-escaped null.

Because that's what CSS2.1 specs, afaict.

> 1. Go ahead and replace nulls in the input stream with U+FFFD.  Most
> browsers do stupid, stupid things with nulls, and the one good browser
> (FF) should act the same with U+FFFD as it does with U+0000.  Avoiding
> the problem seems to be the easiest path to convergence.
> 2. Unless Firefox has a good reason to disallow \0 (like, the person
> who authored their grammar was just overzealous), I'll allow \0 as a
> valid hex escape.

I can probably live with both of these, though implementing #1 is 
strictly more code than just doing #2 and not using silly 
null-terminated strings in your string representation (which you can't 
do anyway, if you do #2, of course).

-Boris
Received on Tuesday, 23 October 2012 03:03:08 UTC