Re: proposal for Issue #23 (relax requirement for NFC transcoding) from Martin J. Dürst on 2010-10-01 (public-iri@w3.org from October 2010)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Fri, 01 Oct 2010 17:32:46 +0900
To: "Phillips, Addison" <addison@lab126.com>
CC: Bjoern Hoehrmann <derhoermi@gmx.net>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <4CA59CAE.5090200@it.aoyama.ac.jp>

Hello Addison, Björn,

On 2010/10/01 13:42, Phillips, Addison wrote:
> (personal response)
>
> Bjoern wrote:
>
>>> 1. Change the text above to read:
>>>
>>>    If the IRI or IRI reference is an octet stream in some known
>> non-
>>>    Unicode character encoding, convert the IRI to a sequence of
>>>    characters from the UCS.
>>>
>>>    In other cases (written on paper, read aloud, or otherwise
>>>    represented independent of any character encoding) represent
>> the IRI
>>>    as a sequence of characters from the UCS.
>>
>> IRIs are by definition a sequence of characters from the UCS.

Well, they can also be on paper or in sound. The "from the UCS" is 
relevant even for paper and sound because even there, IRIs cannot 
contain characters that aren't part of Unicode (e.g. logo-like stuff,...).

>> With
>> the requirement gone, I do not think there is a point in having this
>> section in the document.
>
> 3.1 is not superfluous! The first step in processing an IRI is to obtain it as a sequence of Unicode characters. It will not occur to all users (or implementers) that the IRI is the UCS sequence, not necessarily the octets you find floating in your tag soup.
>
>>
>>> 2. Add the following text just after the second paragraph above:
>>>
>>> NOTE: Some character encodings or transcriptions can be converted
>> to or
>>> represented by more than one sequence of Unicode characters.
>> Ideally the
>>> resulting IRI would use a normalized form, such as Unicode
>> Normalization
>>> Form C (NFC, [UTR15]), since that ensures a stable, consistent
>>> representation that is most likely to produce the intended results.
>>> Implementers and users are cautioned that, while denormalized
>> character
>>> sequences are valid, they might be difficult for other users or
>>> processes to guess and might produce unexpected results.
>>
>> Normalization is already discussed in 5.3.2.2 "Character
>> Normalization", any discussion of it should be moved there if it's not already
>> covered.

Reading section 5.3.2.2, it contains MUSTs that are very much predicated 
on the assumption of normalizing transcoders and use of NFC when 
creating IRIs. I think this section requires some rework.

> We could do that. However, it is probably worth pointing out that there are normalizing and non-normalizing transcodings. Perhaps something more terse and aimed at the normalization section:
>
> NOTE: Sometimes a particular character encoding or transcription can be represented by more than one sequence of Unicode characters. To help ensure interoperability, ideally the resulting IRI would use a normalized form. See Character Normalization [5.3.2.2].

I don't understand "character encoding or transcription". "Character 
encoding" refers to thing such as UTF-8, Shift_JIS,...
Of course I can represent 'UTF-8' also by 'utf-8', but that's not what 
we are talking about :-).

Regards,   Martin.


-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Received on Friday, 1 October 2010 08:33:31 UTC