- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 12 May 2004 17:17:01 +0900
- To: Graham Klyne <GK@ninebynine.org>, public-iri@w3.org
Hello Graham, I have removed the uri list for this issue, because it's really iri-specific. This is issue At 12:02 04/05/10 +0100, Graham Klyne wrote: >Section 3.1: > >There is a subtlety here that is not obvious to one not well-versed in >Unicode specifics: >[[ > Variant B) If the IRI is in some digital representation (e.g. an > octet stream) in some known non-Unicode character encoding: > Convert the IRI to a sequence of characters from the UCS > normalized according to NFC. > > Variant C) If the IRI is in an Unicode-based character encoding > (for example UTF-8 or UTF-16): Do not normalize. Move directly > to Step 2. >]] > >This raises two questions in my mind: > >(a) what is the implication of this NFC stuff; I think a brief example >would help. Non-Unicode encodings are less or more prone to variability when transcoding. For example, when transcoding from the windows-1258 hharset (Vietnamese), you can either transcode codepoint-by-codepoint, or you can normalize. For example, Vietnam is written Việt Nam i.e. a single "LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW" in Unicode (in particular NFC/NFKC), whereas in windows-1258, you have to use the following characters: Việt Nam i.e. "LATIN SMALL LETTER E WITH CIRCUMFLEX" followed by "COMBINING DOT BELOW", because the character ệ just cannot be encoding in windows-1252. Similar issues exist with all other 8-bit encodings for Vietnamese. Encodings for other languages are also affected, but to a lesser extent. I have added a note using this example. >(b) by saying "Move directly to Step 2" it sounds as if this is saying >that step 2 should be operated directly on the "Unicode-based character >encoding" rather than on the UCS characters, which I don't think is what >you intend. I think something like this is intended: >[[ > Variant C) If the IRI is in an Unicode-based character encoding > (for example UTF-8 or UTF-16): Do not normalize. Apply step 2 > directly to the encoded Unicode character sequence. >]] This is a helpful clarification, and a good catch, which I have integrated (capitalizing 'Step' in 'Step 2'). I have tentatively closed this issue; please tell me if the above changes address your issue. Regards, Martin.
Received on Wednesday, 12 May 2004 05:27:32 UTC