- From: Graham Klyne <gk@ninebynine.org>
- Date: Wed, 12 May 2004 13:02:49 +0100
- To: Martin Duerst <duerst@w3.org>, public-iri@w3.org
I'm entirely satisfied with this response. #g -- At 17:17 12/05/04 +0900, Martin Duerst wrote: >Hello Graham, > >I have removed the uri list for this issue, because it's >really iri-specific. >This is issue > >At 12:02 04/05/10 +0100, Graham Klyne wrote: > >>Section 3.1: >> >>There is a subtlety here that is not obvious to one not well-versed in >>Unicode specifics: >>[[ >> Variant B) If the IRI is in some digital representation (e.g. an >> octet stream) in some known non-Unicode character encoding: >> Convert the IRI to a sequence of characters from the UCS >> normalized according to NFC. >> >> Variant C) If the IRI is in an Unicode-based character encoding >> (for example UTF-8 or UTF-16): Do not normalize. Move directly >> to Step 2. >>]] >> >>This raises two questions in my mind: >> >>(a) what is the implication of this NFC stuff; I think a brief example >>would help. > >Non-Unicode encodings are less or more prone to variability when >transcoding. For example, when transcoding from the windows-1258 >hharset (Vietnamese), you can either transcode codepoint-by-codepoint, >or you can normalize. For example, Vietnam is written > Việt Nam >i.e. a single "LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW" in >Unicode (in particular NFC/NFKC), whereas in windows-1258, you have >to use the following characters: > Việt Nam >i.e. "LATIN SMALL LETTER E WITH CIRCUMFLEX" followed by >"COMBINING DOT BELOW", because the character ệ just >cannot be encoding in windows-1252. Similar issues exist >with all other 8-bit encodings for Vietnamese. Encodings >for other languages are also affected, but to a lesser extent. > >I have added a note using this example. > > >>(b) by saying "Move directly to Step 2" it sounds as if this is saying >>that step 2 should be operated directly on the "Unicode-based character >>encoding" rather than on the UCS characters, which I don't think is what >>you intend. I think something like this is intended: >>[[ >> Variant C) If the IRI is in an Unicode-based character encoding >> (for example UTF-8 or UTF-16): Do not normalize. Apply step 2 >> directly to the encoded Unicode character sequence. >>]] > >This is a helpful clarification, and a good catch, which I have >integrated (capitalizing 'Step' in 'Step 2'). > > >I have tentatively closed this issue; please tell me if the >above changes address your issue. > >Regards, Martin. ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact
Received on Wednesday, 12 May 2004 09:26:24 UTC