Re: why use IRIs? from Roy T. Fielding on 2012-07-05 (public-iri@w3.org from July 2012)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Thu, 5 Jul 2012 13:10:52 -0700
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: public-iri@w3.org
Message-Id: <2F5C9EB9-1426-4136-9311-FFFB513C8961@gbiv.com>
On Jul 4, 2012, at 12:42 AM, Bjoern Hoehrmann wrote:

> This doesn't really help me understand where you see problems with IRIs.
> Could you take a simple example like http://björn.höhrmann.de/ and tell
> me of some places where I should be unable to use that even though I can
> use http://bjoern.hoehrmann.de/ in the same place, without arguing about
> limitations of deployed protocols, software, or hardware, and without
> arguing about issues that would arise anyway when displaying URIs, and
> why I should be unable to use the non-URI IRI there?

The harm in the above example is how many aliases are created by
inconsistent encoding of the characters, how difficult we make
it for servers to route based on Host (or equivalents), and how
much risk we want to allow for less-interoperable forms.  These
are all trade-offs; not hard rules.

The main problem with IRIs as protocol elements is aliasing and invalid
characters, not spoofing.  Aliases create security holes if various
routines within the server + OS normalize them in different ways,
reduce cache efficiency, and interfere with page rank.  Invalid UTF-8
sometimes results in the whole code sequence being ignored and other times
results in only the valid part of sequence being ignored (leaving the
next byte to be misinterpreted by the next round of parsing).

These problems can exist with pct-encoded UTF-8 as well, but they are
usually harmless if the origin server consistently redirects non-encoded
non-ASCII to the pct-encoded form and then uses a consistent routine
to do name mapping from URI form to native labels.  In other words,
they are less of a problem because only the origin server needs to
deal with invalid or aliased pct-encodes, and intermediaries that
secure or load-balance based on the target URI can just work on the
pct-encoded patterns (leaving the UTF-8 form to be redirected by the
origin or some server-side intermediary).

IRIs are not used in HTML or XML.  All references in those languages
are parsed as arbitrary strings with language-specific delimiting
and then converted to either a URI or something vaguely like it.
IRIs are not used in browser Location bars -- those are just arbitrary
string parsers that occasionally spit out a URI reference as a result.
IRIs are not used in waka because they would make gateways and fast
pattern matching more difficult and error-prone, which I consider
more of a concern than the potential saving in bytes.

In short, I believe that what potential users of the IRI protocol want
is a set of consistent presentation rules for displaying arbitrary
strings that might include pct-encodes and IDNA, and a simple routine
for converting an arbitrary string reference to a URI reference.
I think the idea of treating IRIs as a separate identifier space has
been harmful to its adoption by folks who already implement non-ASCII
identifiers via presentation and conversion.  It is also confusing
to those who want to create new URI schemes but think that they also
need to define IRI schemes.

....Roy
Received on Thursday, 5 July 2012 20:11:16 UTC