- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Thu, 5 Jul 2012 13:10:52 -0700
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: public-iri@w3.org
On Jul 4, 2012, at 12:42 AM, Bjoern Hoehrmann wrote: > This doesn't really help me understand where you see problems with IRIs. > Could you take a simple example like http://björn.höhrmann.de/ and tell > me of some places where I should be unable to use that even though I can > use http://bjoern.hoehrmann.de/ in the same place, without arguing about > limitations of deployed protocols, software, or hardware, and without > arguing about issues that would arise anyway when displaying URIs, and > why I should be unable to use the non-URI IRI there? The harm in the above example is how many aliases are created by inconsistent encoding of the characters, how difficult we make it for servers to route based on Host (or equivalents), and how much risk we want to allow for less-interoperable forms. These are all trade-offs; not hard rules. The main problem with IRIs as protocol elements is aliasing and invalid characters, not spoofing. Aliases create security holes if various routines within the server + OS normalize them in different ways, reduce cache efficiency, and interfere with page rank. Invalid UTF-8 sometimes results in the whole code sequence being ignored and other times results in only the valid part of sequence being ignored (leaving the next byte to be misinterpreted by the next round of parsing). These problems can exist with pct-encoded UTF-8 as well, but they are usually harmless if the origin server consistently redirects non-encoded non-ASCII to the pct-encoded form and then uses a consistent routine to do name mapping from URI form to native labels. In other words, they are less of a problem because only the origin server needs to deal with invalid or aliased pct-encodes, and intermediaries that secure or load-balance based on the target URI can just work on the pct-encoded patterns (leaving the UTF-8 form to be redirected by the origin or some server-side intermediary). IRIs are not used in HTML or XML. All references in those languages are parsed as arbitrary strings with language-specific delimiting and then converted to either a URI or something vaguely like it. IRIs are not used in browser Location bars -- those are just arbitrary string parsers that occasionally spit out a URI reference as a result. IRIs are not used in waka because they would make gateways and fast pattern matching more difficult and error-prone, which I consider more of a concern than the potential saving in bytes. In short, I believe that what potential users of the IRI protocol want is a set of consistent presentation rules for displaying arbitrary strings that might include pct-encodes and IDNA, and a simple routine for converting an arbitrary string reference to a URI reference. I think the idea of treating IRIs as a separate identifier space has been harmful to its adoption by folks who already implement non-ASCII identifiers via presentation and conversion. It is also confusing to those who want to create new URI schemes but think that they also need to define IRI schemes. ....Roy
Received on Thursday, 5 July 2012 20:11:16 UTC