- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 03 Jul 2007 11:24:31 +0200
- To: Addison Phillips <addison@yahoo-inc.com>
- Cc: Martin Duerst <duerst@it.aoyama.ac.jp>, public-i18n-core@w3.org, public-iri@w3.org
* Addison Phillips wrote: >I don't understand this comment, however. I'm not sure what disputes >could arise here, since this section specifies a process for mapping >IRIs to URIs. It doesn't specify that any particular Unicode encoding >(or that it use one at all), but it does require that the text be a >sequence of characters in the Unicode character set. I note that the >whole of XML, for example, is based on this exact same idea [1]. My application receives the following document over the HTTP: <?xml version="1.0" encoding="encoding-a"?> <!DOCTYPE foo SYSTEM "part2.dtd" [ <!ENTITY part1 "..."> ]> <foo bar="&part1;&part2;" /> Where part2.dtd is: <?xml version="1.0" encoding="encoding-b"?> <!ENTITY part2 "..."> The 'bar' attribute holds a IRI reference and my application needs to convert it into a URI. How does the resulting URI look like? If the answer depends on information I've not given, please list all possible results and why they are consistent with RFC 3987 and the W3C Character Model. A pseudo-code implementation of the process would be best. I believe the possible answers are 1. concat($part1, $part2) 2. nfc(concat($part1, $part2)) 3. concat(nfc($part1), $part2) 4. concat($part1, nfc($part2)) where each option has two possible outcomes, depending on whether you resolve character references before or after NFC normalization. Note in particular that even if encoding-a and encoding-b are utf-8, I understand Martin Dürst to argue the result would be 1. (you do not normalize because utf-8 is a unicode encoding) while e.g. Henry S. Thompson argues the result would be 2 (in the "Infoset", all values are sequences of Unicode code points, so you must normalize, cf. 3.1 step 1, option a). A third position is that you don't normalize because the application operates on a DOM where everything is UTF-16 encoded, you can find proponents of all three positions in the W3C list archive. The only appropriate resolution to this problem is to define that IRIs are inherently a sequence of unicode code points and you never normalize them. Where normalization of data formats containing IRIs is important, normalizing transcoders are used before you see any IRI in instances of the data formats. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 3 July 2007 09:24:47 UTC