Re: [IRIEverywhere-27] (was: Re: [Minutes] 7 Apr 2003 TAG teleconf (arch doc, contentTypeOverride-24, namespaceDocument-8, IRIEverywhere-27)) from Martin Duerst on 2003-04-09 (www-tag@w3.org from April 2003)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 09 Apr 2003 19:14:53 -0400
To: "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org, Tim Bray <tbray@textuality.com>
Cc: public-iri@w3.org
Message-Id: <4.2.0.58.J.20030409163233.039997e0@localhost>
At 18:41 03/04/07 -0400, Ian B. Jacobs wrote:

>Hello,
>
>The minutes of the 7 Apr 2003 TAG teleconf are
>available as HTML [1] and as text below.

>2. Technical
>
>     1. [15]Architecture document
>     2. [16]contentTypeOverride-24
>     3. [17]namespaceDocument-8
>     4. [18]IRIEverywhere-27

>   2.4 IRIEverywhere-27
>
>      * [34]IRIEverywhere-27
>           + Action CL 2003/03/31: Revised position statement on use of
>             IRIs. See [35]email from CL on next steps.
>
>      [34] http://www.w3.org/2001/tag/open-summary.html#IRIEverywhere-27
>      [35] http://lists.w3.org/Archives/Public/www-tag/2003Apr/0033.html

>    [Ian]
>           CL: I'd like more discussion this week; deadline 21 April? Also
>           need to convey to Martin

The conveying has been done.

>           the desirability of seeing
>           [36]http://www.w3.org/International/iri-edit/ updated to
>           include iDNS and published as draft 4, and to move to RFC soon.

I have read that in the TAG minutes here, a week ago, and before.
But I'm wondering how we are supposed to go to RFC if the TAG
hasn't made up its mind, and seems to make very slow progress,
if not going in circles.


>    [TBray]
>           Kanji example in
>           [37]http://www.tbray.org/ongoing/When/200x/2003/03/31/IRI

Very good piece. While we are at it, two comments:
- Google wouldn't need the additional parameters in the search URI
   (&ie=UTF-8&oe=UTF-8). They could just do a check on byte patterns,
   and recognize the extremely specific UTF-8 byte pattern, and then
   assume UTF-8 as the default. If we assume that 'ie' stands for
   input encoding and 'oe' for output encoding, they could also
   at least assume that they are the same, and do the right thing
   if only 'ie' is present.

- I don't understand "This has the potential to play hell with caches
   and proxies and webcrawlers and XML namespaces, since you can have
   the same IRI ending up as a bunch of wildly different looking URIs."

   The only things that can end up differently is the case of the
   hex characters after the %. And caches and webcrawlers take that
   into account. As for XML namespaces, just don't change anything,
   and you are fine.


>    [Ian]
>           TBL: I think there is a fundamental question that has not been
>           resolved, and is causing a lot of tension.: Whether,
>           fundamentally, the fundamental string is the URI, or whether we
>           are dispensing with URIs in favor of Unicode strings instead.
>
>    [Chris]
>           position A says IRI is a way of talking about URI
>           position B says IRI is the real thing and URI is a subset that
>           should go away

And the current position is to use A for resolution, and B for
identification (namespaces,...).


>    [Ian]
>           TB: "Is any URI an IRI?" I think the answer is No;

I don't understand that. It has always been the case that any URI
is an IRI. I don't see a need to change that.

>           lots of sloppiness about hexifying.

>    [Ian]
>           TB: I think IRIs need to be rigid and deterministic about when
>           hexifying is used, and that the mapping (to URI) process be
>           well-characterized.

Do you want to say that unnecessarily hexified IRIs should be
outlawed? I wouldn't mind to discourage unnecessary hexification,
but there is no reason to outlaw it. This is very parallel to URIs,
for example, http://www.tbray.org/ongoing/When/200x/200%33/0%33/%331/IRI
(replacing every '3' by '%33') isn't a particularly good URI, but
it's not outlawed.

About 'when to hexify', the current draft is pretty clear:
"as late as possible, but not later" (in some more words).

About the mapping from IRIs to URIs: That's described in all
detail. Some changes are needed re. IDNs, but these are already
in the draft as an alternative. If you find any problem with
the description, please tell us.


>    [Ian]
>           TBL:

>           Suppose
>           that identifiers are unicode strings and you don't have to
>           canonicalize them to compare them. There is ten years of
>           software using 7-bit fields for this quantity.

>    [Ian]
>           TBL: If suddenly you allow namespaces that can differ in 7-bit
>           systems but not in 8-bit systems.

First, this is not fundamentally about 7-bit systems vs. 8-bit systems.
There are some cases where URIs get transmitted over 7-bit channels
(e.g. in an email labeled as charset=US-ASCII), email these days
can handle 8-bit messages, and systems that really only handle 7 bits
are fortunately exceptions. Squeezing special information into the
8th bit is rarely done these days.

It is fundamentally a question of character repertoire (ASCII vs.
Unicode). And it is important to observe that the cases where
URIs are used as real *identifiers* (namespaces and RDF), thanks
to W3Cs work over the past close to 10 years, there is really no
problem with using Unicode as a repertoire. This is not true for
resolution, but then for resolution, more equivalences than
strict string equivalence have always been used, and we are
not proposing to change that.


>           TB: I'm not sure that CL and I disagree that much. In terms of
>           URIs, it's a fact that everyone uses string comp in namespace
>           applications. We have ample evidence that this is prone and
>           shakey to failure, but people do it anyway.

Could you please provide this 'ample evidence that this is prone
and shakey to failure'?


>    [Ian]
>           TB: I think the anyURI type is underspecified. Last time I
>           looked, there were very few syntactic restrictions on what you
>           can put in an anyURI. I don't think that that's a good
>           solution.

There are very few restrictions because there are actually very few
restrictions on URI (and IRI) syntax. And there is also explicit
text that implementations should not check agressively. This is
one way to ensure extensibility and independence of different
components, and is in line with the principle that URIs
are opaque.


>           TBL: How about if we tell people to refer to the IRI draft.
>           Then we explain to people what that means: When you refer to
>           the IRI spec, you take on that %7e and %7E are the same.

Why should IRIs have to fix what URIs didn't manage to keep?


Regards,     Martin.
Received on Thursday, 10 April 2003 10:07:09 UTC