- From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- Date: Mon, 24 Sep 2001 13:39:15 +0100
- To: <www-i18n-comments@w3.org>
I had a request for examples. Here are some.
Examples:
Suppose we have two http servers, one using UTF-8 the other using
iso-8859-1 called
utf8.example.org and iso-8859-1.example.org
each has files called
simple
braces{}
percent%
Duerst (where ue represents u umlaut)
Duerst/percent%
Duerst/braces{}
Then the following table gives the URI, the IRI, and the partially
escaped IRI (PE IRI) of the files
The PE IRI is suitable for contexts where the unwise characters may
not be used.
utf8.example.org
URI http://utf8.example.org/simple
IRI http://utf8.example.org/simple
PE IRI http://utf8.example.org/simple
URI http://utf8.example.org/braces%7B%7D
IRI http://utf8.example.org/braces{}
PE IRI http://utf8.example.org/braces%7B%7D
URI http://utf8.example.org/percent%25
IRI http://utf8.example.org/percent%25
PE IRI http://utf8.example.org/percent%25
URI http://utf8.example.org/D%C3%BCrst
IRI http://utf8.example.org/D#C3#BCrst
PE IRI http://utf8.example.org/D#C3#BCrst
where #C3#BC is the two bytes with those values.
Note: both the IRI and the PE IRI may be encoded in other
character encodings, in which case the #C3#BC should be replaced
by the representation of u umlaut in such an encoding.
UCS-based encodings must use a NFC representation of u umlaut
(i.e. that u umlaut unicode character) and not an abnormal
form such as a u and an umlaut. Non UCS-based encodings
do not have this restriction.
URI http://utf8.example.org/D%C3%BCrst/percent%25
IRI http://utf8.example.org/D#C3#BCrst/percent%25
PE IRI http://utf8.example.org/D#C3#BCrst/percent%25
URI http://utf8.example.org/D%C3%BCrst/braces%7B%7D
IRI http://utf8.example.org/D#C3#BCrst/braces{}
PE IRI http://utf8.example.org/D#C3#BCrst/braces%7B%7D
iso-8859-1.example.org
URI http://iso-8859-1.example.org/simple
IRI http://iso-8859-1.example.org/simple
PE IRI http://iso-8859-1.example.org/simple
URI http://iso-8859-1.example.org/braces%7B%7D
IRI http://iso-8859-1.example.org/braces{}
PE IRI http://iso-8859-1.example.org/braces%7B%7D
URI http://iso-8859-1.example.org/percent%25
IRI http://iso-8859-1.example.org/percent%25
PE IRI http://iso-8859-1.example.org/percent%25
URI http://iso-8859-1.example.org/D%FCrst
IRI http://iso-8859-1.example.org/D%FCrst
PE IRI http://iso-8859-1.example.org/D%FCrst
note #FC is the representation of u umlaut in iso-8859-1
since the IRI is UTF-8 based this must be escaped in all
representations of the URI. We also note that the
decode algorithm for creating a human readable form of the IRI
fails. The server decode algorithm for the URI (which uses
the server specified encoding of iso-8859-1) succeeds.
URI http://iso-8859-1.example.org/D%FCrst/percent%25
IRI http://iso-8859-1.example.org/D%FCrst/percent%25
PE IRI http://iso-8859-1.example.org/D%FCrst/percent%25
URI http://iso-8859-1.example.org/D%FCrst/braces%7B%7D
IRI http://iso-8859-1.example.org/D%FCrst/braces{}
PE IRI http://iso-8859-1.example.org/D%FCrst/braces%7B%7D
Note: the characters { and } only do not need escaping since
the encoding of them under ISO-8859-1 and UTF-8 are identical.
Any instance of an IRI may be replaced by a (partially) encoded
IRI, because the encoding algorithm is idempotent.
With all of these, the IRI encoding algorithm produces the URI (whether
applied to the URI, the IRI or the PE IRI). With all of the UTF-8 based
examples the IRI-decoding algorithm, whether applied to the URI, the
IRI or the PE IRI produce the IRI.
Hence, no layer of software needs to specify of its input which one of
URI, IRI or PE IRI, in order to be able to produce a URI output.
Received on Monday, 24 September 2001 08:50:54 UTC