Re: How browsers display IRI's with mixed encodings from Martin J. Dürst on 2011-07-25 (public-iri@w3.org from July 2011)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Mon, 25 Jul 2011 19:58:14 +0900
To: "Phillips, Addison" <addison@lab126.com>
CC: Chris Weber <chris@lookout.net>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <4E2D4C46.30708@it.aoyama.ac.jp>

On 2011/07/22 9:30, Phillips, Addison wrote:

> In IRI terms, there are characters and there are "random octets". When mapping to URI, percent encoding is applied to both. However, the UTF-8 sequences can be decoded back to characters. The random octets not so much.

> In other words, leaving aside the query part for a moment, shouldn't IRI really say that valid UTF-8 sequences are interpreted as characters and invalid UTF-8 sequences are treated as bytes?

Yes, it should. And RFC 3987 already does.


> Looking at your test page, I'm not sure how valid a test it is. The page declares an encoding of ISO 8859-1. Having a "UTF-8 encoded path" in the page is a lie. Those bytes are all valid windows-1252 characters (per HTML5, nearly all browsers treat ISO8859-1 as windows-1252). So the path isn't actually "UTF-8 encoded". To me the test looks broken.

A test isn't broken if it tests weird coincidences. It may be that the 
description can be improved, though.

Regards,    Martin.

Received on Monday, 25 July 2011 10:59:38 UTC