- From: Jeremy Carroll <jjc@hpl.hp.com>
- Date: Mon, 23 Jan 2006 13:37:25 -0800
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, public-iri@w3.org
My reading of RFC 3987, 3986 and 2616 (http) all allow %C0 in the path, query, fragment and userinfo components, and none allow it in the hostname. e.g. [[ For example, it is possible to have a URI reference of "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the document name is encoded in iso-8859-1 based on server settings, but where the fragment identifier is encoded in UTF-8 according to Page 32 [XPointer]. The IRI corresponding to the above URI would be (in XML notation) "http://www.example.org/r%E9sum%E9.xml#résumé";. ]] The hostname is restricted to [a-zA-Z0-9-.] or something like that in 2616. 3986 allows UTF-8 encoded hostnames. 3987 allows non-ASCII chars, to be understood as an IDNA, for when the scheme is known to support the DNS syntax (e.g. http, with its dependency on the hostport production). Neither 3986 or 3987 restrict %escape sequences to UTF-8 except for the hostname component. 3987 is a little odd in what it says about legacy encodings. There's a lot that generic IRI software is prohibited from doing; but that seems overly restrictive. Contrast: [[ Conversions from URIs to IRIs MUST NOT use any character encoding other than UTF-8 in steps 3 and 4, even if it might be possible to guess from the context that another character encoding than UTF-8 was used in the URI ]] With: [[ Second, it may include URIs constructed based on character encodings other than UTF-8. These URIs may be produced by user agents that do not conform to this specification and that use legacy character encodings to convert non-ASCII characters to URIs. Whether this is necessary, and what character encodings to cover, depends on a number of factors, such as the legacy character encodings used locally and the distribution of various versions of user agents. For example, software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8. ]] It seems to me that a useful generic IRI implementation should support setting the encoding, possibly also allowing both UTF-8 and one legacy encoding, with the legacy encoding used for %sequences that are not legal UTF-8 Jeremy Bjoern Hoehrmann wrote: > * Frank Ellermann wrote: >> Bjoern Hoehrmann wrote: >> >>> I'm not sure what the actual requirement might be. Perhaps >>> RFC 3987 defines this by now though. >> You lost me here. 3987 explains how to transform an IRI into >> an URI. Something like (legacy ->) NFC -> UTF-8 followed by >> further processing for the "authority" part using IDNA. >> >> But it does not say "any URI with %C0 is invalid, because %C0 >> can't be UTF-8". > > There are two issues here, > > http://bj%f6rn.example.org/ > http://example.org/~björn/ > > The former is not allowed per RFC 3986 and RFC 3987 but matches the ABNF > grammar of both; the latter is not allowed per RFC 2396, RFC 2616, RFC > 3986, but allowed per ABNF and prose of RFC 3987 except that RFC 3987 > requires in the prose to meet the constraints in RFC 2616, e.g. > > When stored or transmitted in digital representation, bidirectional > IRIs MUST be in full logical order and MUST conform to the IRI syntax > rules (which includes the rules relevant to their scheme). > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > It also says > > Scheme-specific restrictions are applied to IRIs by converting > IRIs to URIs and checking the URIs against the scheme-specific > restrictions. > > This goes along with several occurences of the term "IRI scheme" in RFC > 3987, one of which says 'here is no such thing as an "IRI scheme"' which > makes the other occurences of this term look odd. I'm not sure yet what > to make of this. I agree that at the moment http://example.org/%C0 is > not illegal per any RFC though. > >>> I would appreciate if a proposal is made to change the ABNF >>> to fully express the constraints. >> There are no constraints on general URIs in addition to STD 66, >> anything more depends on the scheme. A scheme could restrict >> e.g. the path to "MUST be percent-encoded UTF-8", and then any >> %C0 is an error. I don't see how a 3987bis DS could do more >> than it does now. Did I drop a ball or miss a clue somewhere ? > > I said the ABNF, not the specification. The ABNF does not capture that > http://bj%f6rn.example.org/ is not allowed; it's not allowed though and > it seems this could be expressed in the ABNF. I'm not sure about the > other issues Jeremy mentioned.
Received on Monday, 23 January 2006 21:37:32 UTC