Re: IRIEverywhere-27 (was: Re: Agenda of 13 December 2005 TAG teleconference) from Felix Sasaki on 2005-12-13 (www-tag@w3.org from December 2005)

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 13 Dec 2005 23:09:21 +0900
To: "Bjoern Hoehrmann" <derhoermi@gmx.net>
Cc: www-tag@w3.org, "www-international@w3.org" <www-international@w3.org>
Message-ID: <op.s1p6xvj5x1753t@ibm-60d333fc0ec.customers.eurospot.com>
Some comments below. On Tue, 13 Dec 2005 22:50:47 +0900, Bjoern Hoehrmann  
<derhoermi@gmx.net> wrote:

>
> * Felix Sasaki wrote:
>>> As XML and most formats based on XML allow use of non-Unicode  
>>> encodings,
>>> allowing IRIs in such formats would make the formats inconsistent with
>>> the architectural requirements set forth in the reference processing
>>> model http://www.w3.org/TR/2005/REC-charmod-20050215/#sec-RefProcModel
>>> and http://www.w3.org/TR/2005/REC-charmod-20050215/#C014 in particular.
>>
>> Could you please elaborate why - in your opinion - the use of IRIs is
>> against the reference processing model?
>
>   Specifications MAY choose to disallow or deprecate some character
>   encodings and to make others mandatory. Independent of the actual
>                                           ^^^^^^^^^^^^^^^^^^^^^^^^^
>   character encoding, the specified behavior MUST be the same as if
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   the processing happened as follows:
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>     * The character encoding of any textual data object received
>       by the application implementing the specification MUST be
>       determined and the data object MUST be interpreted as a
>       sequence of Unicode characters - this MUST be equivalent to
>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
>       transcoding the data object to some Unicode encoding form,
>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       adjusting any character encoding label if necessary, and
>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       receiving it in that Unicode encoding form.
>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Which is to say, if you have
>
>   <?xml version="1.0" encoding="us-ascii"?>
>   <foo bar="Bjo&#x308;rn" />
>
> processing must be as if you process
>
>   <?xml version="1.0" encoding="utf-8"?>
>   <foo bar="Bjo&#x308;rn" />
>
> Implementations of RFC 3987 must violate this constraint if the bar
> attribute contains a IRI Reference,
>
>   Applications MUST map IRIs to URIs by using the following two steps.
>
>   Step 1.  Generate a UCS character sequence from the original IRI
>            format.  This step has the following three variants,
>            depending on the form of the input:
>   ...
>            b. If the IRI is in some digital representation (e.g., an
>               octet stream) in some known non-Unicode character
>               encoding, convert the IRI to a sequence of characters
>               from the UCS normalized according to NFC.
>
>            c. If the IRI is in a Unicode-based character encoding (for
>               example, UTF-8 or UTF-16), do not normalize (see section
>               5.3.2.2 for details).  Apply step 2 directly to the
>               encoded Unicode character sequence.
>
> While this does not really define processing in trivial cases like
>
>   foo.ent:
>   <?xml version="1.0" encoding="us-ascii"?>
>   <!ENTITY bar "Bjo&#x308;rn">
>
>   foo.xml
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE foo SYSTEM "foo.ent">
>   <foo bar="&bar;" />
>
> or
>
>   foo.dtd:
>   <?xml version="1.0" encoding="us-ascii"?>
>   <!ATTLIST foo bar CDATA #FIXED "Bjo&#x308;rn">
>
>   foo.xml
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE foo SYSTEM "foo.dtd">
>   <foo bar="&bar;" />
>
> it is clear that RFC 3987 requires encoding-dependent text processing
> behavior, which is prohibed by the reference processing model [1]. This
> aspect of the reference processing model is very important, you can't
> really implement something else in a sane manner.
>
> [1] Unless you'd try to argue that text processing occurs only at e.g.
>     some octets-to-Infoset level and IRI-to-URI processing is thus not
>     constrained by C014, or if you argue that the requirement does not
>     apply to XML at all, because it's all read into a DOM and thus all
>     text is in a Unicode-encoding before IRI-to-URI processing can
>     occur.
>> This isn't really news.


Yes. I asked for your elaboration to see if you have new arguments,  
compared to the ones you gave in the last half year to the drafts of CSS,  
XSL, .... It seems that you don't. And I am sure that you remember  
Martin's answer(s) on the issue: The *must not* of the normalization step  
for encodings which are already in Unicode is s.t. which he is willing to  
discuss. Nevertheless, you are throwing out the baby with the bathwater.  
People will not use W3C technology if they are not allowed to use IRIs  
*now*. The number of people who will suffer from your proposal not to  
adopt IRI is higher than the number of people who might suffer from the  
issues you are mentioning (/me currently on a conference on localization  
and internationalization). Please, please don't provide more examples on   
rare cases where IRIs might have problems, but see that this standard,  
which has been created very well in 99.9%, is deeply needed - by  
specification developers, technology implementers and - after all - users.

Regards, Felix.
Received on Tuesday, 13 December 2005 14:10:44 UTC