[csswg-drafts] [css-syntax] Which trailing whitespace characters are allowed after escaped hex digits? (#5835) from Boris Dalstein via GitHub on 2021-01-05 (public-css-archive@w3.org from January 2021)

From: Boris Dalstein via GitHub <sysbot+gh@w3.org>
Date: Tue, 05 Jan 2021 09:24:39 +0000
To: public-css-archive@w3.org
Message-ID: <issues.opened-778796734-1609838678-sysbot+gh@w3.org>

dalboris has just created a new issue for https://github.com/w3c/csswg-drafts:

== [css-syntax] Which trailing whitespace characters are allowed after escaped hex digits? ==
Hi everyone! I'm implementing a CSS parser, and I found some strange edge case in the "consume escaped code point" routine which I haven't seen documented anywhere on the Internet. So I'm asking for clarification, and this could serve as documentation for future implementors.

Here is the relevant part of the specification:

https://www.w3.org/TR/css-syntax-3/#consume-escaped-code-point

When consuming an escaped code point, when this code point is expressed as hex digits, then the specification says that a trailing whitespace should also be consumed. The rationale is well explained in [this article](https://mathiasbynens.be/notes/css-escapes): this makes it possible to express the string `foo@bar` as `foo\A9 bar`. Otherwise, if we write `foo\A9bar`, the `b` and `a` characters would also be considered hex digits.

However, it wasn't completely clear to me which whitespace characters should be considered here:
- Should we just consume a trailing U+0020 SPACE?
- Or should we also consume a trailing U+000A LINE FEED ("newline") or trailing U+0009 CHARACTER TABULATION ("tab")?

The specification as written seems to imply the latter, that is, a trailing newline or trailing tab should also be consumed. However, in the context of [consuming a string token](https://www.w3.org/TR/css-syntax-3/#consume-string-token), the presence of a newline is normally considered a parse error, and a `<bad-string-token>` should be returned. So it seemed strange to me to allow the presence of a newline when it appears after escaped hex digits.

So I checked in Chrome and Firefox, and both agree with what the specification seems to imply: a trailing newline after escaped hex digits in a string is valid. But otherwise a newline in a string is a parse error.

For example, in the following HTML, we only see "World" rendered (because "Hello Beautiful " contains a newline, so it's a bad string, so it's not rendered):

```
<!DOCTYPE html>
<html>
<head>
<style>
p:before {
    content: "Hello
 Beautiful ";
}
</style>
</head>
<body>
<p>World</p>
</body>
</html>
```

But if we escape the letter `o` (U+006F):

```
    content: "Hell\6f
 Beautiful ";
```

Then the "Hello Beautiful " string is now valid despite containing a newline, and the whole "Hello Beautiful World" is rendered.

Since browsers and the specification are consistent, the answer to the question in the title is clear: all three whitespace characters (space, newline, and tab) are allowed as trailing character after escaped hex digits.

However, this behavior is surprising to me. Is it intentional? Would it be useful to add a clarification note in the specification?

Thank you!

Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/5835 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Tuesday, 5 January 2021 09:24:42 UTC