W3C home > Mailing lists > Public > public-iri@w3.org > May 2004

Re: draft-duerst-iri-07.txt: 2 week mailing list last call

From: Graham Klyne <gk@ninebynine.org>
Date: Wed, 12 May 2004 13:02:49 +0100
Message-Id: <5.1.0.14.2.20040512130226.02dbdea8@127.0.0.1>
To: Martin Duerst <duerst@w3.org>, public-iri@w3.org

I'm entirely satisfied with this response.

#g
--

At 17:17 12/05/04 +0900, Martin Duerst wrote:
>Hello Graham,
>
>I have removed the uri list for this issue, because it's
>really iri-specific.
>This is issue
>
>At 12:02 04/05/10 +0100, Graham Klyne wrote:
>
>>Section 3.1:
>>
>>There is a subtlety here that is not obvious to one not well-versed in 
>>Unicode specifics:
>>[[
>>       Variant B) If the IRI is in some digital representation (e.g. an
>>          octet stream) in some known non-Unicode character encoding:
>>          Convert the IRI to a sequence of characters from the UCS
>>          normalized according to NFC.
>>
>>       Variant C) If the IRI is in an Unicode-based character encoding
>>          (for example UTF-8 or UTF-16): Do not normalize. Move directly
>>          to Step 2.
>>]]
>>
>>This raises two questions in my mind:
>>
>>(a) what is the implication of this NFC stuff;  I think a brief example 
>>would help.
>
>Non-Unicode encodings are less or more prone to variability when
>transcoding. For example, when transcoding from the windows-1258
>hharset (Vietnamese), you can either transcode codepoint-by-codepoint,
>or you can normalize. For example, Vietnam is written
>     Vi&#x1EC7;t Nam
>i.e. a single "LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW" in
>Unicode (in particular NFC/NFKC), whereas in windows-1258, you have
>to use the following characters:
>     Vi&#xEA;&#x323;t Nam
>i.e. "LATIN SMALL LETTER E WITH CIRCUMFLEX" followed by
>"COMBINING DOT BELOW", because the character &#x1EC7; just
>cannot be encoding in windows-1252. Similar issues exist
>with all other 8-bit encodings for Vietnamese. Encodings
>for other languages are also affected, but to a lesser extent.
>
>I have added a note using this example.
>
>
>>(b) by saying "Move directly to Step 2" it sounds as if this is saying 
>>that step 2 should be operated directly on the "Unicode-based character 
>>encoding" rather than on the UCS characters, which I don't think is what 
>>you intend.  I think something like this is intended:
>>[[
>>       Variant C) If the IRI is in an Unicode-based character encoding
>>          (for example UTF-8 or UTF-16): Do not normalize.  Apply step 2
>>          directly to the encoded Unicode character sequence.
>>]]
>
>This is a helpful clarification, and a good catch, which I have
>integrated (capitalizing 'Step' in 'Step 2').
>
>
>I have tentatively closed this issue; please tell me if the
>above changes address your issue.
>
>Regards,    Martin.

------------
Graham Klyne
For email:
http://www.ninebynine.org/#Contact
Received on Wednesday, 12 May 2004 09:26:24 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 30 April 2012 19:51:53 GMT