- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 13 Dec 2005 14:50:47 +0100
- To: "Felix Sasaki" <fsasaki@w3.org>
- Cc: www-tag@w3.org
* Felix Sasaki wrote:
>> As XML and most formats based on XML allow use of non-Unicode encodings,
>> allowing IRIs in such formats would make the formats inconsistent with
>> the architectural requirements set forth in the reference processing
>> model http://www.w3.org/TR/2005/REC-charmod-20050215/#sec-RefProcModel
>> and http://www.w3.org/TR/2005/REC-charmod-20050215/#C014 in particular.
>
>Could you please elaborate why - in your opinion - the use of IRIs is
>against the reference processing model?
Specifications MAY choose to disallow or deprecate some character
encodings and to make others mandatory. Independent of the actual
^^^^^^^^^^^^^^^^^^^^^^^^^
character encoding, the specified behavior MUST be the same as if
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
the processing happened as follows:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* The character encoding of any textual data object received
by the application implementing the specification MUST be
determined and the data object MUST be interpreted as a
sequence of Unicode characters - this MUST be equivalent to
^^^^^^^^^^^^^^^^^^^^^^^^^^
transcoding the data object to some Unicode encoding form,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
adjusting any character encoding label if necessary, and
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
receiving it in that Unicode encoding form.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Which is to say, if you have
<?xml version="1.0" encoding="us-ascii"?>
<foo bar="Björn" />
processing must be as if you process
<?xml version="1.0" encoding="utf-8"?>
<foo bar="Björn" />
Implementations of RFC 3987 must violate this constraint if the bar
attribute contains a IRI Reference,
Applications MUST map IRIs to URIs by using the following two steps.
Step 1. Generate a UCS character sequence from the original IRI
format. This step has the following three variants,
depending on the form of the input:
...
b. If the IRI is in some digital representation (e.g., an
octet stream) in some known non-Unicode character
encoding, convert the IRI to a sequence of characters
from the UCS normalized according to NFC.
c. If the IRI is in a Unicode-based character encoding (for
example, UTF-8 or UTF-16), do not normalize (see section
5.3.2.2 for details). Apply step 2 directly to the
encoded Unicode character sequence.
While this does not really define processing in trivial cases like
foo.ent:
<?xml version="1.0" encoding="us-ascii"?>
<!ENTITY bar "Björn">
foo.xml
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE foo SYSTEM "foo.ent">
<foo bar="&bar;" />
or
foo.dtd:
<?xml version="1.0" encoding="us-ascii"?>
<!ATTLIST foo bar CDATA #FIXED "Björn">
foo.xml
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE foo SYSTEM "foo.dtd">
<foo bar="&bar;" />
it is clear that RFC 3987 requires encoding-dependent text processing
behavior, which is prohibed by the reference processing model [1]. This
aspect of the reference processing model is very important, you can't
really implement something else in a sane manner.
[1] Unless you'd try to argue that text processing occurs only at e.g.
some octets-to-Infoset level and IRI-to-URI processing is thus not
constrained by C014, or if you argue that the requirement does not
apply to XML at all, because it's all read into a DOM and thus all
text is in a Unicode-encoding before IRI-to-URI processing can
occur.
This isn't really news.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 13 December 2005 13:50:51 UTC