Re: IRI regex quiz! from Martin Duerst on 2006-01-24 (public-iri@w3.org from January 2006)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Tue, 24 Jan 2006 10:56:56 +0900
To: Jeremy Carroll <jjc@hpl.hp.com>, Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, public-iri@w3.org
Message-Id: <6.0.0.20.2.20060124103847.075a64f0@localhost>

At 06:37 06/01/24, Jeremy Carroll wrote:
 >
 >My reading of RFC 3987, 3986 and 2616 (http) all allow %C0 in the path, 
query, fragment and userinfo  components, and none allow it in the hostname.

Correct.

 >e.g.
 >[[
 >      For example, it is possible to have a URI reference of
 >     "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the 
document name is encoded in iso-8859-1 based on server settings, but where 
the fragment identifier is encoded in UTF-8 according to
 >
 >Page 32
 >
 >     [XPointer]. The IRI corresponding to the above URI would be (in XML 
notation)
 >     "http://www.example.org/r%E9sum%E9.xml#r駸u";.
 >]]

[sorry for my mailer messing Latin-1 up]

 >The hostname is restricted to [a-zA-Z0-9-.] or something like that in 
2616. 3986 allows UTF-8 encoded hostnames.

Yes, after %-encoding. Glad to see that you found that, an earlier
mail of yours indicated that you didn't know about it.

 >3987 allows non-ASCII chars, to be understood as an IDNA, for when the 
scheme is known to support the DNS syntax (e.g. http, with its dependency 
on the hostport production).

Slightly wrong, or misleading. 3987 permits non-ASCII chars in any scheme
(modulo scheme-specific restrictions). It allows (MAY) conversion to
IDNA (i.e. punycode) as a backwards compatibility measure when the
scheme is known to use DNS.

 >Neither 3986 or 3987 restrict %escape sequences to UTF-8 except for the 
hostname component.
 >
 >3987 is a little odd in what it says about legacy encodings. There's a 
lot that generic IRI software is prohibited from doing; but that seems 
overly restrictive.
 >
 >Contrast:
 >[[
 >  Conversions from URIs to IRIs MUST NOT use any character encoding other 
than UTF-8 in steps 3 and 4, even if it might be possible to guess from the 
context that another character encoding than UTF-8 was used in the URI
 >]]
 >With:
 >[[
 >  Second, it may include URIs constructed based on character encodings 
other than UTF-8. These URIs may be produced by user agents that do not 
conform to this specification and that use legacy character encodings to 
convert non-ASCII characters to URIs. Whether this is necessary, and what 
character encodings to cover, depends on a number of factors, such as the 
legacy character encodings used locally and the distribution of various 
versions of user agents. For example, software for Japanese may accept URIs 
in Shift_JIS and/or EUC-JP in addition to UTF-8.
 >]]

There is definitely a contrast here. It's the contrast between
normative language and purely advisory language for backwards
compatibility.


 >It seems to me that a useful generic IRI implementation

Of what? An IRI generator? An IRI resolver? Some support on the
server side? Please note that Section 7.7 starts with
"Software that interprets IRIs as the names of local resources",
which clearly restricts its applicability.

 >should support setting the encoding, possibly also allowing both UTF-8 
and one legacy encoding, with the legacy encoding used for %sequences that 
are not legal UTF-8

I think this is a bad idea in general. In some contexts, e.g. on the
server side in some cases, it may work reasonably well. But for
other contexts, i.e. on the client side, the variety of stuff
that may come by would just be too big.


Regards,    Martin.

Received on Tuesday, 24 January 2006 04:49:56 UTC