W3C home > Mailing lists > Public > www-tag@w3.org > December 2005

Re: IRIEverywhere-27 (was: Re: Agenda of 13 December 2005 TAG teleconference)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 13 Dec 2005 14:50:47 +0100
To: "Felix Sasaki" <fsasaki@w3.org>
Cc: www-tag@w3.org
Message-ID: <qfitp19u9qs4eib2ss166gbphht88481ki@hive.bjoern.hoehrmann.de>

* Felix Sasaki wrote:
>> As XML and most formats based on XML allow use of non-Unicode encodings,
>> allowing IRIs in such formats would make the formats inconsistent with
>> the architectural requirements set forth in the reference processing
>> model http://www.w3.org/TR/2005/REC-charmod-20050215/#sec-RefProcModel
>> and http://www.w3.org/TR/2005/REC-charmod-20050215/#C014 in particular.
>Could you please elaborate why - in your opinion - the use of IRIs is  
>against the reference processing model?

  Specifications MAY choose to disallow or deprecate some character
  encodings and to make others mandatory. Independent of the actual
  character encoding, the specified behavior MUST be the same as if
  the processing happened as follows: 

    * The character encoding of any textual data object received
      by the application implementing the specification MUST be
      determined and the data object MUST be interpreted as a
      sequence of Unicode characters - this MUST be equivalent to
      transcoding the data object to some Unicode encoding form,
      adjusting any character encoding label if necessary, and
      receiving it in that Unicode encoding form.

Which is to say, if you have

  <?xml version="1.0" encoding="us-ascii"?>
  <foo bar="Bjo&#x308;rn" />

processing must be as if you process

  <?xml version="1.0" encoding="utf-8"?>
  <foo bar="Bjo&#x308;rn" />

Implementations of RFC 3987 must violate this constraint if the bar
attribute contains a IRI Reference,

  Applications MUST map IRIs to URIs by using the following two steps.

  Step 1.  Generate a UCS character sequence from the original IRI
           format.  This step has the following three variants,
           depending on the form of the input:
           b. If the IRI is in some digital representation (e.g., an
              octet stream) in some known non-Unicode character
              encoding, convert the IRI to a sequence of characters
              from the UCS normalized according to NFC.

           c. If the IRI is in a Unicode-based character encoding (for
              example, UTF-8 or UTF-16), do not normalize (see section
     for details).  Apply step 2 directly to the
              encoded Unicode character sequence.

While this does not really define processing in trivial cases like

  <?xml version="1.0" encoding="us-ascii"?>
  <!ENTITY bar "Bjo&#x308;rn">

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE foo SYSTEM "foo.ent">
  <foo bar="&bar;" />


  <?xml version="1.0" encoding="us-ascii"?>
  <!ATTLIST foo bar CDATA #FIXED "Bjo&#x308;rn">

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE foo SYSTEM "foo.dtd">
  <foo bar="&bar;" />

it is clear that RFC 3987 requires encoding-dependent text processing
behavior, which is prohibed by the reference processing model [1]. This
aspect of the reference processing model is very important, you can't
really implement something else in a sane manner.

[1] Unless you'd try to argue that text processing occurs only at e.g.
    some octets-to-Infoset level and IRI-to-URI processing is thus not
    constrained by C014, or if you argue that the requirement does not
    apply to XML at all, because it's all read into a DOM and thus all
    text is in a Unicode-encoding before IRI-to-URI processing can

This isn't really news.
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Tuesday, 13 December 2005 13:50:51 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:32:47 UTC