- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 03 Feb 2003 14:33:20 -0500
- To: "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
At 20:20 03/01/27 -0500, Ian B. Jacobs wrote:
>Minutes of the 27 Jan 2003 TAG teleconf available as
>HTML [1] and as text below.
> 2.3 IRIEverywhere-27
> [25] http://www.w3.org/2001/tag/ilist#IRIEverywhere-27
> [Ian]
> CL: There is a bigger effect on IRI spec and suggestions for
> RFC2396.
> [Chris]
> this has more effect on IRI comparison (which is done by
> transformation to URI and then comparing)
[the current IRI draft does not mandate this kind of comparing IRIs]
> [Chris]
> it means that the *actual kanji* and the sequence of hexifyied
> octets compare to the same
> which helps in roundtripping a very great deal
URIEquivalent-15 and IRIs are indeed very strongly related.
In the I18N WG, we have discussed which solution would be better
for internationalization:
1) "%7e" and "%7E" and "~" are not necessarily equivalent for all
kinds of processing.
2) "%7e" and "%7E" and "~" are equivalent in all cases.
As Chris points out above, solution 2) is better for round-tripping,
and may therefore be better for gradual acceptance and overall
interoperability.
However, there is also a strong feeling that being able to escape
in all cases without any losses will lead to a lot of downgrading,
and hopelessly confusing long sequences of %-escaping rather than
'the real thing' (i.e. the actual IRI characters). Also, while
escaping is always possible and relatively easy, un-escaping is
a bit more difficult and needs to be done carefully to avoid
converting non-UTF-8 octet sequences, to avoid to convert to
characters that are not allowed in IRIs (yes, there are a few
of these), and to avoid potential security issues.
(please see
http://www.w3.org/International/iri-edit/draft-duerst-iri.html#URItoIRI
for details).
So overall, from an IRI and internationalization viewpoint, it is not
clear that always comparing AFTER hex-escaping is the right way to go.
What is very clear is that the solution chosen should be consistent
across URIs/IRIs. I.e.
if '%7e' is always equal to '%7E' and '~', then '%4C' and '%4c'
and 'X' should always be equal, as well as e.g. é (in HTML),
'%c3%a9', '%c3%A9', '%C3%a9', '%C3%A9', and vice versa (i.e.
as an alternative, all these are different)
[the only exception being for reserved characters as listed in
RFC 2396]
Regards, Martin.
Received on Monday, 3 February 2003 14:34:34 UTC