RE: How browsers display IRI's with mixed encodings from Phillips, Addison on 2011-07-22 (public-iri@w3.org from July 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Thu, 21 Jul 2011 17:30:38 -0700
To: Chris Weber <chris@lookout.net>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A947B0D13@EX-SEA31-D.ant.amazon.com>

So I'm going to ask this question with an attitude of feigned ignorance.........

What's the fixation with ISO-8859-1? 

In IRI terms, there are characters and there are "random octets". When mapping to URI, percent encoding is applied to both. However, the UTF-8 sequences can be decoded back to characters. The random octets not so much.

Who's to say that %FC doesn't represent U+045C (ќ) from ISO 8859-5? Or some other character value in some other encoding?

Yes, I know that Latin-1/windows-1252 is the default non-Unicode encoding for HTML. But that says nothing about the bytes in any given IRI.

In other words, leaving aside the query part for a moment, shouldn't IRI really say that valid UTF-8 sequences are interpreted as characters and invalid UTF-8 sequences are treated as bytes? Forced interpretation of bytes in an unknown encoding leads to errors. Especially since, as Leif pointed out, you can't see the difference in the wire format.

Looking at your test page, I'm not sure how valid a test it is. The page declares an encoding of ISO 8859-1. Having a "UTF-8 encoded path" in the page is a lie. Those bytes are all valid windows-1252 characters (per HTML5, nearly all browsers treat ISO8859-1 as windows-1252). So the path isn't actually "UTF-8 encoded". To me the test looks broken.

Addison


> -----Original Message-----
> From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On Behalf
> Of Chris Weber
> Sent: Thursday, July 21, 2011 5:02 PM
> To: PUBLIC-IRI@W3.ORG
> Subject: How browsers display IRI's with mixed encodings
> 
> I'm going on a tangent from Martin's intent in the previous email, but it seems
> in the same vein overall.  I was including some mixed encoding tests - iso-8859-
> 1 mixed with UTF-8 in a hyperlink on an transitional HTML page served with the
> "iso-8859-1" Content-Type.  The results are similar to Martin's test in the way
> bytes representing UTF-8 will be treated as such (most often) even in an iso-
> 8859-1 page encoding.
> 
>  From the test page at <http://lookout.net/test/iri/mixenc.php> Test 3 mixes
> the raw bytes which would represent U+FF21 FULLWIDTH LATIN CAPITAL
> LETTER A in UTF-8, along with iso-8859-1 raw bytes for the "ü" in "Dürst".  The
> following hyperlink represents the test case where <0xNN> is a raw byte.
> 
> http://www.example.com/D<0xFC>rst/?<0xEF 0xBC 0xA1>
> 
> The results of the display are as follows.
> 
> Opera (11.50, Win7):
>    http://www.example.com/DÃ¼rst/?%EF%BC%A1

> 
> Firefox (5.0, Win7):
>    http://www.example.com/Dürst/?Ａ

> 
> IE (8.0.7601.17514, Win7):
>    http://www.example.com/Dürst/?ï¼¡

> 
> Chrome (12.0.742.122, Win7):St
>    http://www.example.com/Dürst/?Ａ

> 
> Safari (5.0.4 (7533.20.27)):
>    http://www.example.com/Dürst/?Ａ

> 
> With the exception of IE, all of the above generated the following HTTP
> request :
> 
>    GET /D%C3%BCrst/?%EF%BC%A1
> 
> IE of course does not escape the bytes in the query string.
> 
>    GET /D%C3%BCrst/?Ａ
> 
> I tried to capture some of these test results into a table form at:
> <https://spreadsheets0.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZST

> lRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5>
> 
> A question for browser implementers - In some cases it's obvious (Opera and
> MSIE) and others not so much: Do you know if the status bar display is using the
> page encoding or has converted the URI to UTF-8 for
> display?
> 
> Best regards,
> Chris
> 
> 
>

Received on Friday, 22 July 2011 00:31:03 UTC