- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 13 Dec 2005 14:50:47 +0100
- To: "Felix Sasaki" <fsasaki@w3.org>
- Cc: www-tag@w3.org
* Felix Sasaki wrote: >> As XML and most formats based on XML allow use of non-Unicode encodings, >> allowing IRIs in such formats would make the formats inconsistent with >> the architectural requirements set forth in the reference processing >> model http://www.w3.org/TR/2005/REC-charmod-20050215/#sec-RefProcModel >> and http://www.w3.org/TR/2005/REC-charmod-20050215/#C014 in particular. > >Could you please elaborate why - in your opinion - the use of IRIs is >against the reference processing model? Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual ^^^^^^^^^^^^^^^^^^^^^^^^^ character encoding, the specified behavior MUST be the same as if ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the processing happened as follows: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to ^^^^^^^^^^^^^^^^^^^^^^^^^^ transcoding the data object to some Unicode encoding form, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ adjusting any character encoding label if necessary, and ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ receiving it in that Unicode encoding form. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Which is to say, if you have <?xml version="1.0" encoding="us-ascii"?> <foo bar="Björn" /> processing must be as if you process <?xml version="1.0" encoding="utf-8"?> <foo bar="Björn" /> Implementations of RFC 3987 must violate this constraint if the bar attribute contains a IRI Reference, Applications MUST map IRIs to URIs by using the following two steps. Step 1. Generate a UCS character sequence from the original IRI format. This step has the following three variants, depending on the form of the input: ... b. If the IRI is in some digital representation (e.g., an octet stream) in some known non-Unicode character encoding, convert the IRI to a sequence of characters from the UCS normalized according to NFC. c. If the IRI is in a Unicode-based character encoding (for example, UTF-8 or UTF-16), do not normalize (see section 5.3.2.2 for details). Apply step 2 directly to the encoded Unicode character sequence. While this does not really define processing in trivial cases like foo.ent: <?xml version="1.0" encoding="us-ascii"?> <!ENTITY bar "Björn"> foo.xml <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE foo SYSTEM "foo.ent"> <foo bar="&bar;" /> or foo.dtd: <?xml version="1.0" encoding="us-ascii"?> <!ATTLIST foo bar CDATA #FIXED "Björn"> foo.xml <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE foo SYSTEM "foo.dtd"> <foo bar="&bar;" /> it is clear that RFC 3987 requires encoding-dependent text processing behavior, which is prohibed by the reference processing model [1]. This aspect of the reference processing model is very important, you can't really implement something else in a sane manner. [1] Unless you'd try to argue that text processing occurs only at e.g. some octets-to-Infoset level and IRI-to-URI processing is thus not constrained by C014, or if you argue that the requirement does not apply to XML at all, because it's all read into a DOM and thus all text is in a Unicode-encoding before IRI-to-URI processing can occur. This isn't really news. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 13 December 2005 13:50:51 UTC