RE: TAG scheme - some comments from McDonald, Ira on 2004-10-25 (uri@w3.org from October 2004)

From: McDonald, Ira <imcdonald@sharplabs.com>
Date: Mon, 25 Oct 2004 09:22:41 -0700
To: "'Tim Kindberg'" <timothy@hpl.hp.com>, "McDonald, Ira" <imcdonald@sharplabs.com>
Cc: "Hammond, Tony" <T.Hammond@nature.com>, uri@w3.org, sandro hawke <sandro@w3.org>
Message-ID: <CFEE79A465B35C4385389BA5866BEDF00C7936@mailsrvnt02.enet.sharplabs.com>

Hi Tim,

Yes - problems of transcription by users as well as
problems of two TAG URIs that _appear_ EXACTLY identical,
but in fact have (for instance) their diacritical marks
not in canonical order (i.e., according to Unicode std).

A better starting point than Nameprep (RFC 3454) would be:

"String Profile for Internet Small Computer Systems Interface
(iSCSI) Names", RFC 3722, April 2004.

...which is not limited to I18N of host names.

I'd suggest stealing ideas and text from RFC 3722.

I think that TAG URIs are inherently fragile when deployed
without some such normalization (and disallowed characters).

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Blue Roof Music / High North Inc
PO Box 221  Grand Marais, MI  49839
phone: +1-906-494-2434
email: imcdonald@sharplabs.com

-----Original Message-----
From: uri-request@w3.org [mailto:uri-request@w3.org]On Behalf Of Tim
Kindberg
Sent: Monday, October 25, 2004 9:58 AM
To: McDonald, Ira
Cc: Hammond, Tony; uri@w3.org; sandro hawke
Subject: Re: TAG scheme - some comments

Hi Ira,

Thanks for your message. If I understand you correctly, you're talking 
about normalisation w.r.t. some of the subtle differences (e.g. 
nearly-identical visual appearance, ordering of diacritical marks) that 
sometimes occur between characters in the Universal Character Set. Maybe 
that's exactly what Tony meant too -- in which case I'm sorry for 
dispatching his point over-hastily.

I'm so unused to thinking about internationalisation issues that I 
hadn't thought about the problem of *people* producing subtle 
differences when they transcribe tags. (I don't see a problem otherwise, 
since tags, once minted, are not meant to be "deconstructed" by machines.)

But what I don't want to have to do is define our own "tagprep" 
derivation of stringprep -- and get embroiled in another error-prone 
excursion. I'd like to borrow something good enough from elsewhere -- 
like nameprep, which I would have thought would be suitable except that 
it says it's specifically for IDNs.

Does anyone out there have any advice?

Cheers,

Tim.

McDonald, Ira wrote:

> Hi,
> 
> 
>>><Tony Hammond wrote...>
>>>6. Note that normalization issues are ducked. :) Probably wisely too. Not
>>>sure what the ramifications of this might be especially wrt TAG
> 
> processors
> 
>>>and %-encoding.
> 
> 
>><Tim Kindberg replied...>
>>Yes, we decided that tags that are different as strings (with same 
>>character encoding) are different, full stop. It's nice and easy to 
>>understand and there's no compelling need for a more sophisticated 
>>criterion for equality.
> 
> 
> While neither RFC 2717 nor draft RFC 2717bis address it,
> most existing URI scheme RFCs actually do identify rules for
> "comparison of two XXX URIs".  Since TAG values can be UTF-8
> (percent-encoded), there are certainly string comparison
> issues to be addressed (like underlying UTF-8 normalization
> to NFC or NFKC forms).  Using a Stringprep profile (RFC 3454) 
> is a good approach (RFC 3454).  I suggest looking at:
> 
> "Nameprep: A Stringprep Profile for Internationalized Domain Names"
> RFC 3491, March 2003
> 
> 
> Cheers,
> - Ira
> 
> Ira McDonald (Musician / Software Architect)
> Blue Roof Music / High North Inc
> PO Box 221  Grand Marais, MI  49839
> phone: +1-906-494-2434
> email: imcdonald@sharplabs.com

-- 

Tim Kindberg
hewlett-packard laboratories
filton road
stoke gifford
bristol bs34 8qz
uk

purl.org/net/TimKindberg
timothy@hpl.hp.com
voice +44 (0)117 312 9920
fax +44 (0)117 312 8003

Received on Monday, 25 October 2004 16:30:56 UTC