- From: Giovanni Campagna <scampa.giovanni@gmail.com>
- Date: Sun, 29 Mar 2009 14:31:38 +0100
2009/3/29 Anne van Kesteren <annevk at opera.com>: > On Sun, 29 Mar 2009 15:01:51 +0200, Giovanni Campagna > <scampa.giovanni at gmail.com> wrote: >> >> 2009/3/29 Anne van Kesteren <annevk at opera.com>: >>> >>> I'm not sure if you're correct about those differences, but even if you >>> are they are not the only differences. E.g. LEIRIs perform normalization if >>> the input encoding is non-Unicode. URLs do not. URLs can encode their query >>> component per the input encoding (and do so for HTML and some APIs). >>> LEIRIs cannot. >> >> What is the problem with normalization? Is there a standard for >> conversion to non-Unicode to Unicode? >> I guess no, so normalization (which should always be done) is perfectly >> legal. > > It's about Unicode Normalization. (And it should not always be done.) If I convert from ISO-8859-1 and find "?" (decimal 192), I can emit "?" U+00C0 LATIN CAPITAL A WITH GRAVE or "A" U+0041 LATIN CAPITAL LETTER A followed by " ?" U+0300 COMBINING GRAVE ACCENT One is NFC, the other is NFD, and both are legal and simple. >> In addition, IRIs are defined as a sequence of Unicode codepoints. It >> does not matter how those codepoints are stored (ASCII, ISO-8859-1, >> UTF-8), only the Unicode version of them. > > Please read the IRI specification again. Specifically section 3.1. Specification says that IRIs must be a in normalized UCS when created from user input, else it must be converted to Unicode if not already (and the conversion should be normalizing), else it must be converted from UTF-8 / 16 / 32 to UCS but not normalized. I don't see a particular problem in this. >> This is the same as URL5s, by the way, because none of them is defined >> on octets and both use the RFC3986 method for percent-encoding (using >> UTF-8) > > No, it's not always using UTF-8. RFC3986 never creates percent encoding (percent-encoding is used for unspecified binary data) but says that text components should be encoded as UTF-8 and that rules are estabilished by scheme specific syntaxes. > -- > Anne van Kesteren > http://annevankesteren.nl/ > Giovanni
Received on Sunday, 29 March 2009 06:31:38 UTC