Re: [RC5] character-encoding-038 invalid from Linss, Peter on 2011-01-16 (public-css-testsuite@w3.org from January 2011)

From: Linss, Peter <peter.linss@hp.com>
Date: Sun, 16 Jan 2011 20:47:20 +0000
To: Alan Gresley <alan@css-class.com>
CC: Ms2ger <ms2ger@gmail.com>, fantasai <fantasai.lists@inkedblade.net>, Public CSS test suite mailing list <public-css-testsuite@w3.org>, Ian Hickson <ian@hixie.ch>
Message-ID: <C5A91A74-BB9D-4C02-AE81-412ED3636726@hp.com>

On Jan 16, 2011, at 5:25 AM, Alan Gresley wrote:

On 16/01/2011 10:38 PM, Ms2ger wrote:
{snip}

To clarify something which I noticed from Peter's reply to me if it was
not noticed.

I wrote this in reply to Peter's initial message.

The external stylesheet CSS [1] has this

.t�st { color: white; background: green; }

_has ? within black diamond_

Peter replied and said this.

The external stylesheet is:
.tést { color: white; background: green; }

_has Latin small letter e with acute_

and then this.

So I presume the stylesheet should be updated to be:
.t�st { color: yellow; background: red; }

_has ? within black diamond_

So are Peter I seeing different characters (Peter sees 'e' with acute
and I'm seeing '?' within black diamond)?

We're seeing the same characters. I explicitly sent the 'small e with actue' to show how the stylesheet looks if interpreted as ISO-8859-1 encoding. I sent the '? in diamond' to mean the stylesheet as it is, with only the properties changed.

In the external stylesheet on the original test on Hixie's server I see
this.

.tést { color: white; background: green; }

_has Latin small letter e with acute_

Right, that depends on how your browser is interpreting the encoding. Neither Hixie's original server nor our test server is sending an explicit encoding at the moment. According to the CSS spec, the browser should be interpreting the stylesheet as UTF-8, if you're seeing the 'e with accent', it's using ISO-8859-1.

So if the é (letter e with acute) is (U+00E9), what is the Unicode for �
(? within black diamond)?

The � is U+FFFD, the 'Replacement Character', used to replace an unknown or unprintable character. In this case, the stylesheet contains the octets: 2e 74 e9 73 74
When the stylesheet is interpreted as UTF-8, the upper four bits of the e9 octet mean that the character in question is represented by three octets, the following two should have the upper two bits set to '10' and the actual code point is defined by the lower four bits of the first octet and the lower six bits of the following two octets. Since the next octet does not have its upper two bits set to '10', the e9 octet gets the U+FFFD instead.

Received on Sunday, 16 January 2011 20:49:34 UTC