Re: [I18N Core Response][CSS21] out of range unicode escapes from Mark Davis on 2007-06-26 (www-style@w3.org from June 2007)

From: Mark Davis <mark.davis@icu-project.org>
Date: Tue, 26 Jun 2007 14:16:43 -0700
To: "Addison Phillips" <addison@yahoo-inc.com>
Cc: www-style@w3.org, member-i18n-core@w3.org
Message-ID: <30b660a20706261416w1a913c88s23af4b22d8c684d4@mail.gmail.com>

The choice of whether to do #2 or #3 in parsing depends on the environment.
In the case of CSS, it'd be nice to see some specific examples of what
happens. For example, for a programming language, the difference between #2
and #3 is mostly that literals would continue to work but contain U+FFFD.
That is, let's suppose that [X] represents a defective byte sequence
(ill-formed Unicode or escape). Then take the following examples:

Stri[X]ng x = "abcdef"; // line 1
String x = "abc[X]def"; // line 2
String x = "abcdef"; // line 3[X]

With strategy #2: In line 1, we'd get a compile error, and stop (since
U+FFFD isn't allowed in variable names). In line 2, we'd continue, but with
a literal that contained U+FFFD. In line 3, the [X] is in a comment, so it
wouldn't make any difference at all.
With strategy #3: the compiler could just bail when the defective sequence
is encountered. The advantage of this is that the programmer can find and
fix it.

In CSS the situation is somewhat different. An error just means that some
declaration doesn't work; is silently disabled. The user has to notice that
the CSS isn't doing what it is supposed to, and try to debug it. I could see
circumstances in which #2 might be better, just because it may be clearer to
a user what is going on when there is a failure.

But I'm a bit fuzzy on what happens in either case. Take the following:

h1 {
 col[X]or: #990000;
 *background-color: #FC9804;
* ***background-image: url("butter[X]fly.gif");*
**}

Does #3 mean that all of the attributes of h1 are suppressed in the above?
Or only lines 1 and 3?

Mark

On 6/26/07, Addison Phillips <addison@yahoo-inc.com> wrote:
>
>
> Hi,
>
> I'm writing on behalf of the Internationalization Core Working Group. In
> our most recent teleconference, we discussed this issue again.
> Basically, the options for handling out of range Unicode escapes were:
>
> - do nothing/permit the invalid code point
> - replace with U+FFFD
> - generate a parse error
>
> The first option is a security risk and shouldn't be seriously
> considered. Either of the other options could potentially be a valid
> choice.
>
> We note that this issue has to do with an escape sequence representing a
> Unicode character. It shouldn't be associated with transcoding errors
> from legacy encodings, although it could result from a bug in an escape
> generator. That is, such malformed sequences are generated purposefully.
>
> We feel that the best response to this issue is to generate a parse
> error. Use of the replacement character might mask errors in the style
> sheet (since there is no obvious failure or failure location), while it
> is unlikely that the resulting sequence would produce the desired
> stylistic behavior anyway. Therefore, we recommend that the CSS working
> group, for clarity, add this text to 4.1.3 in CSS 2.1 at about
>
>    http://www.w3.org/TR/CSS21/syndata.html#q6
>
>      If the number is outside the range allowed by Unicode (e.g.,
>      "\110000" is above 0x10FFFF, the largest Unicode code point),
>      then the parser should treat this as a parse error and a user agent
>      must ignore any declaration containing this invalid property name
>      or value.
>
> Note that this text is slightly revised from a previous proposal.
>
> We welcome any comments you might have on this issue.
>
> Best Regards,
>
> Addison
>
> --
> Addison Phillips
> Globalization Architect -- Yahoo! Inc.
> Chair -- W3C Internationalization Core WG
>
> Internationalization is an architecture.
> It is not a feature.
>
>

-- 
Mark

Received on Tuesday, 26 June 2007 21:16:48 UTC