Re: IRI regex quiz! from Jeremy Carroll on 2006-01-23 (public-iri@w3.org from January 2006)

From: Jeremy Carroll <jjc@hpl.hp.com>
Date: Mon, 23 Jan 2006 13:37:25 -0800
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, public-iri@w3.org
Message-ID: <43D5019F.9040103@hpl.hp.com>
My reading of RFC 3987, 3986 and 2616 (http) all allow %C0 in the path, 
query, fragment and userinfo  components, and none allow it in the hostname.

e.g.
[[
      For example, it is possible to have a URI reference of
     "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the 
document name is encoded in iso-8859-1 based on server settings, but 
where the fragment identifier is encoded in UTF-8 according to

Page 32

     [XPointer]. The IRI corresponding to the above URI would be (in XML 
notation)
     "http://www.example.org/r%E9sum%E9.xml#résumé";.
]]

The hostname is restricted to [a-zA-Z0-9-.] or something like that in 
2616. 3986 allows UTF-8 encoded hostnames. 3987 allows non-ASCII chars, 
to be understood as an IDNA, for when the scheme is known to support the 
DNS syntax (e.g. http, with its dependency on the hostport production).

Neither 3986 or 3987 restrict %escape sequences to UTF-8 except for the 
hostname component.

3987 is a little odd in what it says about legacy encodings. There's a 
lot that generic IRI software is prohibited from doing; but that seems 
overly restrictive.

Contrast:
[[
  Conversions from URIs to IRIs MUST NOT use any character encoding 
other than UTF-8 in steps 3 and 4, even if it might be possible to guess 
from the context that another character encoding than UTF-8 was used in 
the URI
]]
With:
[[
  Second, it may include URIs constructed based on character encodings 
other than UTF-8. These URIs may be produced by user agents that do not 
conform to this specification and that use legacy character encodings to 
convert non-ASCII characters to URIs. Whether this is necessary, and 
what character encodings to cover, depends on a number of factors, such 
as the legacy character encodings used locally and the distribution of 
various versions of user agents. For example, software for Japanese may 
accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8.
]]

It seems to me that a useful generic IRI implementation should support 
setting the encoding, possibly also allowing both UTF-8 and one legacy 
encoding, with the legacy encoding used for %sequences that are not 
legal UTF-8

Jeremy

Bjoern Hoehrmann wrote:
> * Frank Ellermann wrote:
>> Bjoern Hoehrmann wrote:
>>
>>> I'm not sure what the actual requirement might be. Perhaps
>>> RFC 3987 defines this by now though.
>> You lost me here.  3987 explains how to transform an IRI into
>> an URI.  Something like (legacy ->) NFC -> UTF-8 followed by
>> further processing for the "authority" part using IDNA.
>>
>> But it does not say "any URI with %C0 is invalid, because %C0
>> can't be UTF-8".  
> 
> There are two issues here,
> 
>   http://bj%f6rn.example.org/
>   http://example.org/~björn/
> 
> The former is not allowed per RFC 3986 and RFC 3987 but matches the ABNF
> grammar of both; the latter is not allowed per RFC 2396, RFC 2616, RFC
> 3986, but allowed per ABNF and prose of RFC 3987 except that RFC 3987
> requires in the prose to meet the constraints in RFC 2616, e.g.
> 
>   When stored or transmitted in digital representation, bidirectional
>   IRIs MUST be in full logical order and MUST conform to the IRI syntax
>   rules (which includes the rules relevant to their scheme).
>          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> It also says
> 
>   Scheme-specific restrictions are applied to IRIs by converting
>   IRIs to URIs and checking the URIs against the scheme-specific
>   restrictions. 
> 
> This goes along with several occurences of the term "IRI scheme" in RFC
> 3987, one of which says 'here is no such thing as an "IRI scheme"' which
> makes the other occurences of this term look odd. I'm not sure yet what
> to make of this. I agree that at the moment http://example.org/%C0 is
> not illegal per any RFC though.
> 
>>> I would appreciate if a proposal is made to change the ABNF
>>> to fully express the constraints.
>> There are no constraints on general URIs in addition to STD 66,
>> anything more depends on the scheme.  A scheme could restrict
>> e.g. the path to "MUST be percent-encoded UTF-8", and then any
>> %C0 is an error.  I don't see how a 3987bis DS could do more
>> than it does now.  Did I drop a ball or miss a clue somewhere ?
> 
> I said the ABNF, not the specification. The ABNF does not capture that 
> http://bj%f6rn.example.org/ is not allowed; it's not allowed though and
> it seems this could be expressed in the ABNF. I'm not sure about the
> other issues Jeremy mentioned.
Received on Monday, 23 January 2006 21:37:32 UTC