URIEquivalence-15: influence on IRIs (was: Re: [Minutes] 27 Jan 2003 TAG teleconf (..., IRIEverywhere-27, ...))

At 20:20 03/01/27 -0500, Ian B. Jacobs wrote:

>Minutes of the 27 Jan 2003 TAG teleconf available as
>HTML [1] and as text below.

>   2.3 IRIEverywhere-27

>      [25] http://www.w3.org/2001/tag/ilist#IRIEverywhere-27

>    [Ian]
>           CL: There is a bigger effect on IRI spec and suggestions for
>           RFC2396.

>    [Chris]
>           this has more effect on IRI comparison (which is done by
>           transformation to URI and then comparing)

[the current IRI draft does not mandate this kind of comparing IRIs]


>    [Chris]
>           it means that the *actual kanji* and the sequence of hexifyied
>           octets compare to the same
>           which helps in roundtripping a very great deal

URIEquivalent-15 and IRIs are indeed very strongly related.

In the I18N WG, we have discussed which solution would be better
for internationalization:

1) "%7e" and "%7E" and "~" are not necessarily equivalent for all
    kinds of processing.

2) "%7e" and "%7E" and "~" are equivalent in all cases.

As Chris points out above, solution 2) is better for round-tripping,
and may therefore be better for gradual acceptance and overall
interoperability.

However, there is also a strong feeling that being able to escape
in all cases without any losses will lead to a lot of downgrading,
and hopelessly confusing long sequences of %-escaping rather than
'the real thing' (i.e. the actual IRI characters). Also, while
escaping is always possible and relatively easy, un-escaping is
a bit more difficult and needs to be done carefully to avoid
converting non-UTF-8 octet sequences, to avoid to convert to
characters that are not allowed in IRIs (yes, there are a few
of these), and to avoid potential security issues.
(please see
http://www.w3.org/International/iri-edit/draft-duerst-iri.html#URItoIRI
for details).

So overall, from an IRI and internationalization viewpoint, it is not
clear that always comparing AFTER hex-escaping is the right way to go.

What is very clear is that the solution chosen should be consistent
across URIs/IRIs. I.e.

if '%7e' is always equal to '%7E' and '~', then '%4C' and '%4c'
and 'X' should always be equal, as well as e.g. é (in HTML),
'%c3%a9', '%c3%A9', '%C3%a9', '%C3%A9', and vice versa (i.e.
as an alternative, all these are different)
[the only exception being for reserved characters as listed in
  RFC 2396]


Regards,    Martin.

Received on Monday, 3 February 2003 14:34:34 UTC