Re: IRIEverywhere-27 from Jeremy Carroll on 2005-12-13 (www-international@w3.org from October to December 2005)

From: Jeremy Carroll <jjc@hpl.hp.com>
Date: Tue, 13 Dec 2005 14:42:05 +0000
To: Felix Sasaki <fsasaki@w3.org>
CC: Bjoern Hoehrmann <derhoermi@gmx.net>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <439EDDBD.5080909@hpl.hp.com>
I haven't followed Bjoern's examples in the past ... so forgive me if 
this is a repeat of someone else comments.

I found the example uncompelling.


With
 >>   foo.ent:
 >>   <?xml version="1.0" encoding="us-ascii"?>
 >>   <!ENTITY bar "Bjo&#x308;rn">
 >>
 >>   foo.xml
 >>   <?xml version="1.0" encoding="utf-8"?>
 >>   <!DOCTYPE foo SYSTEM "foo.ent">
 >>   <foo bar="&bar;" />

A normalizing transcoder from us-ascii to unicode is used to read the 
XML and creates a sequence of unicode code points B, j, o, &#308;, r, n, 
  (the interesting subsequence). This is already in Unicode at the point 
that it is viewed as an IRI, and so the two specification say the same 
thing, of no further normalization.
i.e. the &#308; is an explicit unicode character and so

 >>            c. If the IRI is in a Unicode-based character encoding (for
 >>               example, UTF-8 or UTF-16), do not normalize (see section
 >>               5.3.2.2 for details).  Apply step 2 directly to the
 >>               encoded Unicode character sequence.

must be used by the time we are interpreting the &#308; as an NCR.

Jeremy


Felix Sasaki wrote:
> 
> Some comments below. On Tue, 13 Dec 2005 22:50:47 +0900, Bjoern 
> Hoehrmann <derhoermi@gmx.net> wrote:
> 
>>
>> * Felix Sasaki wrote:
>>>> As XML and most formats based on XML allow use of non-Unicode 
>>>> encodings,
>>>> allowing IRIs in such formats would make the formats inconsistent with
>>>> the architectural requirements set forth in the reference processing
>>>> model http://www.w3.org/TR/2005/REC-charmod-20050215/#sec-RefProcModel
>>>> and http://www.w3.org/TR/2005/REC-charmod-20050215/#C014 in particular.
>>>
>>> Could you please elaborate why - in your opinion - the use of IRIs is
>>> against the reference processing model?
>>
>>   Specifications MAY choose to disallow or deprecate some character
>>   encodings and to make others mandatory. Independent of the actual
>>                                           ^^^^^^^^^^^^^^^^^^^^^^^^^
>>   character encoding, the specified behavior MUST be the same as if
>>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>   the processing happened as follows:
>>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>>     * The character encoding of any textual data object received
>>       by the application implementing the specification MUST be
>>       determined and the data object MUST be interpreted as a
>>       sequence of Unicode characters - this MUST be equivalent to
>>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>       transcoding the data object to some Unicode encoding form,
>>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>       adjusting any character encoding label if necessary, and
>>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>       receiving it in that Unicode encoding form.
>>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> Which is to say, if you have
>>
>>   <?xml version="1.0" encoding="us-ascii"?>
>>   <foo bar="Bjo&#x308;rn" />
>>
>> processing must be as if you process
>>
>>   <?xml version="1.0" encoding="utf-8"?>
>>   <foo bar="Bjo&#x308;rn" />
>>
>> Implementations of RFC 3987 must violate this constraint if the bar
>> attribute contains a IRI Reference,
>>
>>   Applications MUST map IRIs to URIs by using the following two steps.
>>
>>   Step 1.  Generate a UCS character sequence from the original IRI
>>            format.  This step has the following three variants,
>>            depending on the form of the input:
>>   ...
>>            b. If the IRI is in some digital representation (e.g., an
>>               octet stream) in some known non-Unicode character
>>               encoding, convert the IRI to a sequence of characters
>>               from the UCS normalized according to NFC.
>>
>>            c. If the IRI is in a Unicode-based character encoding (for
>>               example, UTF-8 or UTF-16), do not normalize (see section
>>               5.3.2.2 for details).  Apply step 2 directly to the
>>               encoded Unicode character sequence.
>>
>> While this does not really define processing in trivial cases like
>>
>>   foo.ent:
>>   <?xml version="1.0" encoding="us-ascii"?>
>>   <!ENTITY bar "Bjo&#x308;rn">
>>
>>   foo.xml
>>   <?xml version="1.0" encoding="utf-8"?>
>>   <!DOCTYPE foo SYSTEM "foo.ent">
>>   <foo bar="&bar;" />
>>
>> or
>>
>>   foo.dtd:
>>   <?xml version="1.0" encoding="us-ascii"?>
>>   <!ATTLIST foo bar CDATA #FIXED "Bjo&#x308;rn">
>>
>>   foo.xml
>>   <?xml version="1.0" encoding="utf-8"?>
>>   <!DOCTYPE foo SYSTEM "foo.dtd">
>>   <foo bar="&bar;" />
>>
>> it is clear that RFC 3987 requires encoding-dependent text processing
>> behavior, which is prohibed by the reference processing model [1]. This
>> aspect of the reference processing model is very important, you can't
>> really implement something else in a sane manner.
>>
>> [1] Unless you'd try to argue that text processing occurs only at e.g.
>>     some octets-to-Infoset level and IRI-to-URI processing is thus not
>>     constrained by C014, or if you argue that the requirement does not
>>     apply to XML at all, because it's all read into a DOM and thus all
>>     text is in a Unicode-encoding before IRI-to-URI processing can
>>     occur.
>>> This isn't really news.
> 
> 
> Yes. I asked for your elaboration to see if you have new arguments, 
> compared to the ones you gave in the last half year to the drafts of 
> CSS, XSL, .... It seems that you don't. And I am sure that you remember 
> Martin's answer(s) on the issue: The *must not* of the normalization 
> step for encodings which are already in Unicode is s.t. which he is 
> willing to discuss. Nevertheless, you are throwing out the baby with the 
> bathwater. People will not use W3C technology if they are not allowed to 
> use IRIs *now*. The number of people who will suffer from your proposal 
> not to adopt IRI is higher than the number of people who might suffer 
> from the issues you are mentioning (/me currently on a conference on 
> localization and internationalization). Please, please don't provide 
> more examples on  rare cases where IRIs might have problems, but see 
> that this standard, which has been created very well in 99.9%, is deeply 
> needed - by specification developers, technology implementers and - 
> after all - users.
> 
> Regards, Felix.
> 
>
Received on Tuesday, 13 December 2005 14:45:00 UTC