- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Tue, 24 Jan 2006 10:56:56 +0900
- To: Jeremy Carroll <jjc@hpl.hp.com>, Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, public-iri@w3.org
At 06:37 06/01/24, Jeremy Carroll wrote: > >My reading of RFC 3987, 3986 and 2616 (http) all allow %C0 in the path, query, fragment and userinfo components, and none allow it in the hostname. Correct. >e.g. >[[ > For example, it is possible to have a URI reference of > "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the document name is encoded in iso-8859-1 based on server settings, but where the fragment identifier is encoded in UTF-8 according to > >Page 32 > > [XPointer]. The IRI corresponding to the above URI would be (in XML notation) > "http://www.example.org/r%E9sum%E9.xml#r駸u";. >]] [sorry for my mailer messing Latin-1 up] >The hostname is restricted to [a-zA-Z0-9-.] or something like that in 2616. 3986 allows UTF-8 encoded hostnames. Yes, after %-encoding. Glad to see that you found that, an earlier mail of yours indicated that you didn't know about it. >3987 allows non-ASCII chars, to be understood as an IDNA, for when the scheme is known to support the DNS syntax (e.g. http, with its dependency on the hostport production). Slightly wrong, or misleading. 3987 permits non-ASCII chars in any scheme (modulo scheme-specific restrictions). It allows (MAY) conversion to IDNA (i.e. punycode) as a backwards compatibility measure when the scheme is known to use DNS. >Neither 3986 or 3987 restrict %escape sequences to UTF-8 except for the hostname component. > >3987 is a little odd in what it says about legacy encodings. There's a lot that generic IRI software is prohibited from doing; but that seems overly restrictive. > >Contrast: >[[ > Conversions from URIs to IRIs MUST NOT use any character encoding other than UTF-8 in steps 3 and 4, even if it might be possible to guess from the context that another character encoding than UTF-8 was used in the URI >]] >With: >[[ > Second, it may include URIs constructed based on character encodings other than UTF-8. These URIs may be produced by user agents that do not conform to this specification and that use legacy character encodings to convert non-ASCII characters to URIs. Whether this is necessary, and what character encodings to cover, depends on a number of factors, such as the legacy character encodings used locally and the distribution of various versions of user agents. For example, software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8. >]] There is definitely a contrast here. It's the contrast between normative language and purely advisory language for backwards compatibility. >It seems to me that a useful generic IRI implementation Of what? An IRI generator? An IRI resolver? Some support on the server side? Please note that Section 7.7 starts with "Software that interprets IRIs as the names of local resources", which clearly restricts its applicability. >should support setting the encoding, possibly also allowing both UTF-8 and one legacy encoding, with the legacy encoding used for %sequences that are not legal UTF-8 I think this is a bad idea in general. In some contexts, e.g. on the server side in some cases, it may work reasonably well. But for other contexts, i.e. on the client side, the variety of stuff that may come by would just be too big. Regards, Martin.
Received on Tuesday, 24 January 2006 04:49:56 UTC