W3C home > Mailing lists > Public > www-i18n-comments@w3.org > September 2001


From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Mon, 24 Sep 2001 13:39:15 +0100
To: <www-i18n-comments@w3.org>
Message-ID: <JAEBJCLMIFLKLOJGMELDCECPCCAA.jjc@hplb.hpl.hp.com>

I had a request for examples. Here are some.


Suppose we have two http servers, one using UTF-8 the other using 
iso-8859-1 called

utf8.example.org and iso-8859-1.example.org

each has files called

  Duerst                (where ue represents u umlaut)

Then the following table gives the URI, the IRI, and the partially 
escaped IRI (PE IRI) of the files 

The PE IRI is suitable for contexts where the unwise characters may 
not be used.


URI         http://utf8.example.org/simple 
IRI         http://utf8.example.org/simple 
PE IRI      http://utf8.example.org/simple

URI         http://utf8.example.org/braces%7B%7D
IRI         http://utf8.example.org/braces{}
PE IRI      http://utf8.example.org/braces%7B%7D

URI         http://utf8.example.org/percent%25
IRI         http://utf8.example.org/percent%25
PE IRI      http://utf8.example.org/percent%25

URI         http://utf8.example.org/D%C3%BCrst
IRI         http://utf8.example.org/D#C3#BCrst
PE IRI      http://utf8.example.org/D#C3#BCrst
  where #C3#BC is the two bytes with those values.
  Note: both the IRI and the PE IRI may be encoded in other 
  character encodings, in which case the #C3#BC should be replaced
  by the representation of u umlaut in such an encoding.
  UCS-based encodings must use a NFC representation of u umlaut
  (i.e. that u umlaut unicode character) and not an abnormal
  form such as a u and an umlaut. Non UCS-based encodings
  do not have this restriction.

URI         http://utf8.example.org/D%C3%BCrst/percent%25
IRI         http://utf8.example.org/D#C3#BCrst/percent%25
PE IRI      http://utf8.example.org/D#C3#BCrst/percent%25

URI         http://utf8.example.org/D%C3%BCrst/braces%7B%7D
IRI         http://utf8.example.org/D#C3#BCrst/braces{}
PE IRI      http://utf8.example.org/D#C3#BCrst/braces%7B%7D


URI         http://iso-8859-1.example.org/simple 
IRI         http://iso-8859-1.example.org/simple 
PE IRI      http://iso-8859-1.example.org/simple

URI         http://iso-8859-1.example.org/braces%7B%7D
IRI         http://iso-8859-1.example.org/braces{}
PE IRI      http://iso-8859-1.example.org/braces%7B%7D

URI         http://iso-8859-1.example.org/percent%25
IRI         http://iso-8859-1.example.org/percent%25
PE IRI      http://iso-8859-1.example.org/percent%25

URI         http://iso-8859-1.example.org/D%FCrst
IRI         http://iso-8859-1.example.org/D%FCrst
PE IRI      http://iso-8859-1.example.org/D%FCrst
  note #FC is the representation of u umlaut in iso-8859-1
  since the IRI is UTF-8 based this must be escaped in all
  representations of the URI. We also note that the
  decode algorithm for creating a human readable form of the IRI
  fails. The server decode algorithm for the URI (which uses
  the server specified encoding of iso-8859-1) succeeds.

URI         http://iso-8859-1.example.org/D%FCrst/percent%25
IRI         http://iso-8859-1.example.org/D%FCrst/percent%25
PE IRI      http://iso-8859-1.example.org/D%FCrst/percent%25

URI         http://iso-8859-1.example.org/D%FCrst/braces%7B%7D
IRI         http://iso-8859-1.example.org/D%FCrst/braces{}
PE IRI      http://iso-8859-1.example.org/D%FCrst/braces%7B%7D

  Note: the characters { and } only do not need escaping since 
  the encoding of them under ISO-8859-1 and UTF-8 are identical.
  Any instance of an IRI may be replaced by a (partially) encoded
  IRI, because the encoding algorithm is idempotent.

With all of these, the IRI encoding algorithm produces the URI (whether 
applied to the URI, the IRI or the PE IRI). With all of the UTF-8 based
examples the IRI-decoding algorithm, whether applied to the URI, the 
IRI or the PE IRI produce the IRI.

Hence, no layer of software needs to specify of its input which one of
URI, IRI or PE IRI, in order to be able to produce a URI output.
Received on Monday, 24 September 2001 08:50:54 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 October 2009 08:32:28 GMT