- From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- Date: Mon, 24 Sep 2001 13:39:15 +0100
- To: <www-i18n-comments@w3.org>
I had a request for examples. Here are some. Examples: Suppose we have two http servers, one using UTF-8 the other using iso-8859-1 called utf8.example.org and iso-8859-1.example.org each has files called simple braces{} percent% Duerst (where ue represents u umlaut) Duerst/percent% Duerst/braces{} Then the following table gives the URI, the IRI, and the partially escaped IRI (PE IRI) of the files The PE IRI is suitable for contexts where the unwise characters may not be used. utf8.example.org URI http://utf8.example.org/simple IRI http://utf8.example.org/simple PE IRI http://utf8.example.org/simple URI http://utf8.example.org/braces%7B%7D IRI http://utf8.example.org/braces{} PE IRI http://utf8.example.org/braces%7B%7D URI http://utf8.example.org/percent%25 IRI http://utf8.example.org/percent%25 PE IRI http://utf8.example.org/percent%25 URI http://utf8.example.org/D%C3%BCrst IRI http://utf8.example.org/D#C3#BCrst PE IRI http://utf8.example.org/D#C3#BCrst where #C3#BC is the two bytes with those values. Note: both the IRI and the PE IRI may be encoded in other character encodings, in which case the #C3#BC should be replaced by the representation of u umlaut in such an encoding. UCS-based encodings must use a NFC representation of u umlaut (i.e. that u umlaut unicode character) and not an abnormal form such as a u and an umlaut. Non UCS-based encodings do not have this restriction. URI http://utf8.example.org/D%C3%BCrst/percent%25 IRI http://utf8.example.org/D#C3#BCrst/percent%25 PE IRI http://utf8.example.org/D#C3#BCrst/percent%25 URI http://utf8.example.org/D%C3%BCrst/braces%7B%7D IRI http://utf8.example.org/D#C3#BCrst/braces{} PE IRI http://utf8.example.org/D#C3#BCrst/braces%7B%7D iso-8859-1.example.org URI http://iso-8859-1.example.org/simple IRI http://iso-8859-1.example.org/simple PE IRI http://iso-8859-1.example.org/simple URI http://iso-8859-1.example.org/braces%7B%7D IRI http://iso-8859-1.example.org/braces{} PE IRI http://iso-8859-1.example.org/braces%7B%7D URI http://iso-8859-1.example.org/percent%25 IRI http://iso-8859-1.example.org/percent%25 PE IRI http://iso-8859-1.example.org/percent%25 URI http://iso-8859-1.example.org/D%FCrst IRI http://iso-8859-1.example.org/D%FCrst PE IRI http://iso-8859-1.example.org/D%FCrst note #FC is the representation of u umlaut in iso-8859-1 since the IRI is UTF-8 based this must be escaped in all representations of the URI. We also note that the decode algorithm for creating a human readable form of the IRI fails. The server decode algorithm for the URI (which uses the server specified encoding of iso-8859-1) succeeds. URI http://iso-8859-1.example.org/D%FCrst/percent%25 IRI http://iso-8859-1.example.org/D%FCrst/percent%25 PE IRI http://iso-8859-1.example.org/D%FCrst/percent%25 URI http://iso-8859-1.example.org/D%FCrst/braces%7B%7D IRI http://iso-8859-1.example.org/D%FCrst/braces{} PE IRI http://iso-8859-1.example.org/D%FCrst/braces%7B%7D Note: the characters { and } only do not need escaping since the encoding of them under ISO-8859-1 and UTF-8 are identical. Any instance of an IRI may be replaced by a (partially) encoded IRI, because the encoding algorithm is idempotent. With all of these, the IRI encoding algorithm produces the URI (whether applied to the URI, the IRI or the PE IRI). With all of the UTF-8 based examples the IRI-decoding algorithm, whether applied to the URI, the IRI or the PE IRI produce the IRI. Hence, no layer of software needs to specify of its input which one of URI, IRI or PE IRI, in order to be able to produce a URI output.
Received on Monday, 24 September 2001 08:50:54 UTC