- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 12 May 2004 17:17:01 +0900
- To: Graham Klyne <GK@ninebynine.org>, public-iri@w3.org
Hello Graham,
I have removed the uri list for this issue, because it's
really iri-specific.
This is issue
At 12:02 04/05/10 +0100, Graham Klyne wrote:
>Section 3.1:
>
>There is a subtlety here that is not obvious to one not well-versed in
>Unicode specifics:
>[[
> Variant B) If the IRI is in some digital representation (e.g. an
> octet stream) in some known non-Unicode character encoding:
> Convert the IRI to a sequence of characters from the UCS
> normalized according to NFC.
>
> Variant C) If the IRI is in an Unicode-based character encoding
> (for example UTF-8 or UTF-16): Do not normalize. Move directly
> to Step 2.
>]]
>
>This raises two questions in my mind:
>
>(a) what is the implication of this NFC stuff; I think a brief example
>would help.
Non-Unicode encodings are less or more prone to variability when
transcoding. For example, when transcoding from the windows-1258
hharset (Vietnamese), you can either transcode codepoint-by-codepoint,
or you can normalize. For example, Vietnam is written
Việt Nam
i.e. a single "LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW" in
Unicode (in particular NFC/NFKC), whereas in windows-1258, you have
to use the following characters:
Việt Nam
i.e. "LATIN SMALL LETTER E WITH CIRCUMFLEX" followed by
"COMBINING DOT BELOW", because the character ệ just
cannot be encoding in windows-1252. Similar issues exist
with all other 8-bit encodings for Vietnamese. Encodings
for other languages are also affected, but to a lesser extent.
I have added a note using this example.
>(b) by saying "Move directly to Step 2" it sounds as if this is saying
>that step 2 should be operated directly on the "Unicode-based character
>encoding" rather than on the UCS characters, which I don't think is what
>you intend. I think something like this is intended:
>[[
> Variant C) If the IRI is in an Unicode-based character encoding
> (for example UTF-8 or UTF-16): Do not normalize. Apply step 2
> directly to the encoded Unicode character sequence.
>]]
This is a helpful clarification, and a good catch, which I have
integrated (capitalizing 'Step' in 'Step 2').
I have tentatively closed this issue; please tell me if the
above changes address your issue.
Regards, Martin.
Received on Wednesday, 12 May 2004 05:27:32 UTC