- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 16 Apr 2003 13:52:07 -0400
- To: Paul Hoffman / IMC <phoffman@imc.org>, public-iri@w3.org
At 19:48 03/04/15 -0700, Paul Hoffman / IMC wrote: >>At 08:14 03/04/08 -0700, Paul Hoffman / IMC wrote: >> >>>Technical issues: >> >>>I do not understand the logic of having Variants (B) and (C) in step 1 >>>in section 3.1. One is normalized, the other one isn't. Doesn't this >>>sound like a recipe for disaster? Why did you differentiate between >>>these two cases? >> >>This is listed as issue >>http://www.w3.org/International/iri-edit/Overview.html#legacyNFC-06 >> >>This is carefully based on the principle of early uniform normalization >>as described in the W3C Character Model. The assumption is that >>Unicode-based encodings are for the most part already in NFC >>(and where they are not, this may be on purpose). However, >>for non-Unicode encodings, normalization when converting is >>sometimes necessary (the most obvious example is windows-1258, >>for Vietnamese). > >I guess this goes back to my question from the previous message: how do >you know what encoding you are looking at? The encoding issues in http://lists.w3.org/Archives/Public/public-iri/2003Apr/0013.html (this is later in the archives, but I'm assuming this is the message you are talking about) is on a different level than the encoding we are talking here. What we are talking about here is that e.g. you receive an email from Vietnam encoded in windows-1258, and this email contains an IRI with some Vietnamese characters. Then to convert this IRI into an URI, you have to use variant B) of step 1) in section 3.1, which will apply NFC when converting to Unicode in order to convert the decompositions that occur in windows-1258 into precomposed characters before then converting into UTF-8 and using %-escaping. Regards, Martin.
Received on Wednesday, 16 April 2003 15:09:02 UTC