Re: IRIEverywhere-27 (was: Re: Agenda of 13 December 2005 TAG teleconference) from Bjoern Hoehrmann on 2005-12-13 (www-tag@w3.org from December 2005)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 13 Dec 2005 14:50:47 +0100
To: "Felix Sasaki" <fsasaki@w3.org>
Cc: www-tag@w3.org
Message-ID: <qfitp19u9qs4eib2ss166gbphht88481ki@hive.bjoern.hoehrmann.de>
* Felix Sasaki wrote:
>> As XML and most formats based on XML allow use of non-Unicode encodings,
>> allowing IRIs in such formats would make the formats inconsistent with
>> the architectural requirements set forth in the reference processing
>> model http://www.w3.org/TR/2005/REC-charmod-20050215/#sec-RefProcModel
>> and http://www.w3.org/TR/2005/REC-charmod-20050215/#C014 in particular.
>
>Could you please elaborate why - in your opinion - the use of IRIs is  
>against the reference processing model?

  Specifications MAY choose to disallow or deprecate some character
  encodings and to make others mandatory. Independent of the actual
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^
  character encoding, the specified behavior MUST be the same as if
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  the processing happened as follows: 
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    * The character encoding of any textual data object received
      by the application implementing the specification MUST be
      determined and the data object MUST be interpreted as a
      sequence of Unicode characters - this MUST be equivalent to
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
      transcoding the data object to some Unicode encoding form,
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      adjusting any character encoding label if necessary, and
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      receiving it in that Unicode encoding form.
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Which is to say, if you have

  <?xml version="1.0" encoding="us-ascii"?>
  <foo bar="Bjo&#x308;rn" />

processing must be as if you process

  <?xml version="1.0" encoding="utf-8"?>
  <foo bar="Bjo&#x308;rn" />

Implementations of RFC 3987 must violate this constraint if the bar
attribute contains a IRI Reference,

  Applications MUST map IRIs to URIs by using the following two steps.

  Step 1.  Generate a UCS character sequence from the original IRI
           format.  This step has the following three variants,
           depending on the form of the input:
  ...
           b. If the IRI is in some digital representation (e.g., an
              octet stream) in some known non-Unicode character
              encoding, convert the IRI to a sequence of characters
              from the UCS normalized according to NFC.

           c. If the IRI is in a Unicode-based character encoding (for
              example, UTF-8 or UTF-16), do not normalize (see section
              5.3.2.2 for details).  Apply step 2 directly to the
              encoded Unicode character sequence.

While this does not really define processing in trivial cases like

  foo.ent:
  <?xml version="1.0" encoding="us-ascii"?>
  <!ENTITY bar "Bjo&#x308;rn">

  foo.xml
  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE foo SYSTEM "foo.ent">
  <foo bar="&bar;" />

or

  foo.dtd:
  <?xml version="1.0" encoding="us-ascii"?>
  <!ATTLIST foo bar CDATA #FIXED "Bjo&#x308;rn">

  foo.xml
  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE foo SYSTEM "foo.dtd">
  <foo bar="&bar;" />

it is clear that RFC 3987 requires encoding-dependent text processing
behavior, which is prohibed by the reference processing model [1]. This
aspect of the reference processing model is very important, you can't
really implement something else in a sane manner.

[1] Unless you'd try to argue that text processing occurs only at e.g.
    some octets-to-Infoset level and IRI-to-URI processing is thus not
    constrained by C014, or if you argue that the requirement does not
    apply to XML at all, because it's all read into a DOM and thus all
    text is in a Unicode-encoding before IRI-to-URI processing can
    occur.

This isn't really news.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 13 December 2005 13:50:51 UTC