RE: [CSS21] out of range unicode escapes from Paul Nelson (ATC) on 2007-05-31 (public-i18n-core@w3.org from April to June 2007)

From: Paul Nelson (ATC) <paulnel@winse.microsoft.com>
Date: Thu, 31 May 2007 02:07:33 -0700
To: David Clarke <d.r.clarke@sheffield.ac.uk>, Mark Davis <mark.davis@icu-project.org>
CC: <www-style@w3.org>, <public-i18n-core@w3.org>
Message-ID: <49C257E2C13F584790B2E302E021B6F9137C2B79@winse-msg-01.segroup.winse.corp.micros>

Of course the issue is how one is consuming the stream of text coming in.

 

For example, the text is going to be displayed it needs to be replaced. Thus, an error in an inline CSS property would have been replace if the .html file has an error as part of the initial parsing/converting to Unicode. If, however, the text is a .CSS file that is not displayed and the css property parser is parsing it is easy to throw a parsing error and move on.

 

When it comes time to render in the UA, who cares about trying to render right if there is invalid Unicode escapes. Whether the character is converted or turned into a replacement character the result is the same…something other than what the author intended…unless they were a malicious person trying to crash your UA.

 

Regards,

 

Paul

 

From: www-style-request@w3.org [mailto:www-style-request@w3.org] On Behalf Of David Clarke
Sent: Thursday, May 31, 2007 4:25 PM
To: Mark Davis
Cc: www-style@w3.org; public-i18n-core@w3.org
Subject: Re: [CSS21] out of range unicode escapes

 

Mark et al,

I stand corrected on the option of parsing of Unicode source sequences and use of the replacement character in general.

As a personal opinion on this, it would seem logical to treat any unexpected character, or sequence of characters in CSS in the same way. This would be for a CSS parser to treat it as a parse error. This would provide a consistent approach, without adding special case complexity to a parser.

I really feel that an invalid Unicode source sequence in a block of CSS is of the same nature as any other invalid sequence of characters. Replacing an invalid Unicode sequence with another character, is likely to hide errors, and produce an unintended result.

Mark Davis wrote: 

This may be based on a mistaken premise. While the primary use of U+FFFD is as stated, it is also used as a replacement for ill-formed Unicode. See http://www.unicode.org/reports/tr22/ for example.

"In the case of illegal source sequences, a conversion routine will typically provide three options. It may stop with an error (or throw an exception). Secondly, it may skip the source sequence. While this is commonly an option, it can also hide corruption problems in the source text. Lastly, it may map to a substitution character such as the Unicode REPLACEMENT CHARACTER (U+FFFD)."

Mark

 
 This behaviour is not appropriate because U+FFFD is specified as a
 Replacement Character to be "used as a substitute for an uninterpretable
 character *from another encoding*". 
 see: http://unicode.org/glossary/#replacement_character .
 
 The correct response to any invalid Unicode escape should be to treat it
 as a parse error (see section 4.1.8), in the same way that any other
 invalid or unexpected character would be.
 
 For clarity Add this text to 4.1.3 at CSS 2.1
 http://www.w3.org/TR/CSS21/syndata.html#q6   :
 
     If the number is outside the range allowed by Unicode (e.g.,
     "\110000" is above the maximum 10FFFF allowed in current Unicode),
     then the parser should treat this as parse error and A user agent 
     must ignore a declaration containing this invalid property name or
 value.
 
 see: http://www.w3.org/TR/CSS21/syndata.html#ignore

 
 ----
 David Clarke 
 
 
 
 
 




-- 
Mark

Received on Thursday, 31 May 2007 09:07:00 UTC