Considered harmful: An Introduction to Multilingual Web Addresses from Frank Ellermann on 2007-04-04 (www-international@w3.org from April to June 2007)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Wed, 04 Apr 2007 20:57:37 +0200
To: www-international@w3.org
Message-ID: <4613F521.237C@xyzzy.claranet.de>

Richard Ishida wrote:

> See the latest version at
> http://www.w3.org/International/articles/idn-and-iri/#phishing

Hi, I think the terminology in this article is very unclear:  By
definition an URI follows the syntax specified in STD 66, that's
a proper subset of ASCII characters.

If an UA decides to display the URI as IRI it's already using
some assumptions, e.g. treating percent encoded octets as UTF-8
where that makes sense, or using a "ToUnicode" version of what
appears to be IDNA labels in a domain.  The only place where
the latter is supposed to work is the host part of (most) URI
schemes, ignoring "alternate roots" making up their own rules,
or other forms of registered names not belonging to the normal
DNS.

There's no such thing as a valid URI using any raw "non-ASCII"
octets, Latin-1, UTF-8, UTF-16, or EBCDIC alike.

If there's no validator capable of checking the URI syntax as
specified in STD 66 it's harmful to publish invalid pages like
<http://www.w3.org/International/tests/sec-iri-3>

Frank

Received on Wednesday, 4 April 2007 18:58:58 UTC