Re: [CSS21] uri() from Bjoern Hoehrmann on 2005-05-07 (www-style@w3.org from May 2005)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 07 May 2005 16:33:03 +0200
To: Martin Duerst <duerst@it.aoyama.ac.jp>
Cc: www-style@w3.org, public-i18n-core@w3.org
Message-ID: <4288a9d3.153550218@smtp.bjoern.hoehrmann.de>
* Martin Duerst wrote:
> >For an example of the difference, see:
> >
> >   http://lists.w3.org/Archives/Public/www-style/2005Mar/0102
>
>This example is well worked out, but quite theoretical.

That's a highly theoretical remark as implementations have to implement
this regardless of how theoretical the example might be. It's in fact
far more likely that implementations encounter cases like this one than
style sheets in windows-1258 that depend on implementations doing NFC
normalization.

>The original character encoding is part of the Infoset. See point 6 at
>http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.document.
>Given this, I don't think that claims like "nigh-on-impossible" are
>justified. Getting at that info may be a bit of a hack in some
>implementations, but "nigh-on-impossible" it is not.

There is no finite deterministic algorithm that maps each possible input
to a well-defined consistent result. It is thus by definition impossible
to interoperably implement the requirement. Such an algorithm would need
to define

  * what is considered a "non-Unicode character encoding"
  * when is a IRI considered to be in such an encoding
  * what happens if the IRI comes from a distinct textual data object
  * which version of NFC is to be used
  * etc.

none of which RFC 3987 clearly defines. Much worse, it assumes that it
is possible to get from the "IRI" to information about the digial en-
coding of it, most deployed architecture however assumes that this is
not relevant. Note that Ian said it is nigh-on impossible to implement
this *sanely*, not to implement it at all. I strongly agree with that
and have yet to see evidence of the contrary. It is in my opinion highly
unrealistic to expect implementers to add this complexity to their im-
plementations (possibly breaking existing content) just to cater for a
few users who create content that 

  * uses some legacy encoding against advice
  * is not NFC normalized against advice
  * breaks in many implementations
  * breaks in many usage scenarios
  * etc.

Anyway, maybe you could explain how adopting IRIs in CSS would not make
the specification non-conforming to http://www.w3.org/TR/charmod/#C014?
My understanding of item 3 ("this MUST be equivalent to transcoding the
data object to some Unicode encoding form, adjusting any character
encoding label if necessary, and receiving it in that Unicode encoding
form") is that processing of

  @charset "iso-8859-1";
  element { background-image: url(Bjo\000308rn) }

must be defined such that it is equivalent to processing

  @charset "utf-8";
  element { background-image: url(Bjo\000308rn) }

while my understanding of RFC 3987 is that it must not be equivalent.

>But such feedback should not be used to throw
>out the baby with the bathwater. Making it clear that CSS interprets
>IRIs using UTF-8, rather than the encoding of the document (such
>implementations still exist, although you claim that getting at
>that encoding is "nigh-on-impossible") is a very high priority.

I do not know of any CSS implementation that does this for e.g.

  @charset "us-ascii";
  element { background-image: url(Björn) }

as that is obviously impossible. So these implementations do some-
thing different from what you think they do. When I tested this 2
years ago, my results were

  * Internet Explorer 6.0 SP 1 for Windows
    -> fails tests containing \F6 in path/query or björn in query:
       url(bj\F6rn)       => bj/F6rn
       url(björn?bj\F6rn) => bj%C3%B6rn?bj\F6rn
       url(björn?björn)   => bj%C3%B6rn?bj<F6>rn (<F6> is byte 0xF6)

  * Opera 7.11 for Windows
    -> fails tests containing unescaped/unquoted 'ö's:
       url(björn)         => ignored
       url(bj\F6rn?björn) => ignored

  * Amaya 7.0 for Windows
    -> fails all tests:
       url(björn)         => bj%f6rn
       url("björn")       => bj%f6rn
       url('björn')       => 'bj%f6rn'
       url('björn#björn') => 'bj%f6rn

  * Mozilla 1.3a for Windows
    -> passes all tests

(IIRC MacIE failed all tests as well but for different reasons and
Safari passed all or most tests, where "all tests" can be derived
from the cited cases just in all possible variations; this did not
include NFD/NFC tests but more recent tests indicate all of them
would fail). What do the implementations you mention do if the style
sheet is changed through scripting like

  var ss = document.styleSheets.item(0);
  var ln = ss.cssRules.length;
  ss.insertRule("#test1 { background-image: url(Björn) }", ln);
  ss.insertRule("#test2 { background-image: url(Bjo\308rn) }", ln);

from internal/external scripts and style sheets in different
encodings? I do not know what RFC 3987 might require in this case,
but it seems unlikely they would behave as you describe. But as you
apparently wrote tests for this, maybe you could contribute them to
the CSS Working Group?

Proper IRI testing would require thousands of tests, (NFC, IDN, error
handling, cross-technology tests involving at least the CSS DOM and
fragment identifier unescaping e.g. when using some SVG fragment as
background-image, dealing with base IRIs, many character encodings,
character encoding scheme detection e.g. when the encoding scheme is
determined by a charset attribute on an <a> element three documents
ago, etc.) it'd sure help a lot to have a complete test suite to both
implement this in browsers and properly specify it.

Proposals for text to include in CSS 3 would help too, we need to
specify e.g. what happens if a string is not a proper IRI, is it
considered an illegal value and ignored per CSS or should it be
defined as in SVG where implementations are not required to check
for malformed IRIs but rather implement random behavior instead,
or should we define generic error recovery requirements?
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Saturday, 7 May 2005 14:32:35 UTC