Re: draft-duerst-iri-07.txt: 2 week mailing list last call from Martin Duerst on 2004-05-12 (public-iri@w3.org from May 2004)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 12 May 2004 17:17:01 +0900
To: Graham Klyne <GK@ninebynine.org>, public-iri@w3.org
Message-Id: <4.2.0.58.J.20040512163152.05e87988@localhost>

Hello Graham,

I have removed the uri list for this issue, because it's
really iri-specific.
This is issue

At 12:02 04/05/10 +0100, Graham Klyne wrote:

>Section 3.1:
>
>There is a subtlety here that is not obvious to one not well-versed in 
>Unicode specifics:
>[[
>       Variant B) If the IRI is in some digital representation (e.g. an
>          octet stream) in some known non-Unicode character encoding:
>          Convert the IRI to a sequence of characters from the UCS
>          normalized according to NFC.
>
>       Variant C) If the IRI is in an Unicode-based character encoding
>          (for example UTF-8 or UTF-16): Do not normalize. Move directly
>          to Step 2.
>]]
>
>This raises two questions in my mind:
>
>(a) what is the implication of this NFC stuff;  I think a brief example 
>would help.

Non-Unicode encodings are less or more prone to variability when
transcoding. For example, when transcoding from the windows-1258
hharset (Vietnamese), you can either transcode codepoint-by-codepoint,
or you can normalize. For example, Vietnam is written
     Vi&#x1EC7;t Nam
i.e. a single "LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW" in
Unicode (in particular NFC/NFKC), whereas in windows-1258, you have
to use the following characters:
     Vi&#xEA;&#x323;t Nam
i.e. "LATIN SMALL LETTER E WITH CIRCUMFLEX" followed by
"COMBINING DOT BELOW", because the character &#x1EC7; just
cannot be encoding in windows-1252. Similar issues exist
with all other 8-bit encodings for Vietnamese. Encodings
for other languages are also affected, but to a lesser extent.

I have added a note using this example.


>(b) by saying "Move directly to Step 2" it sounds as if this is saying 
>that step 2 should be operated directly on the "Unicode-based character 
>encoding" rather than on the UCS characters, which I don't think is what 
>you intend.  I think something like this is intended:
>[[
>       Variant C) If the IRI is in an Unicode-based character encoding
>          (for example UTF-8 or UTF-16): Do not normalize.  Apply step 2
>          directly to the encoded Unicode character sequence.
>]]

This is a helpful clarification, and a good catch, which I have
integrated (capitalizing 'Step' in 'Step 2').


I have tentatively closed this issue; please tell me if the
above changes address your issue.

Regards,    Martin.

Received on Wednesday, 12 May 2004 05:27:32 UTC