Re: SVG12: IRI Processing rules and xlink:href from Bjoern Hoehrmann on 2007-07-03 (public-iri@w3.org from July 2007)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 03 Jul 2007 11:24:31 +0200
To: Addison Phillips <addison@yahoo-inc.com>
Cc: Martin Duerst <duerst@it.aoyama.ac.jp>, public-i18n-core@w3.org, public-iri@w3.org
Message-ID: <9u2k83lr95ldonhp7a066kfhd5dae22gnk@hive.bjoern.hoehrmann.de>

* Addison Phillips wrote:
>I don't understand this comment, however. I'm not sure what disputes 
>could arise here, since this section specifies a process for mapping
>IRIs to URIs. It doesn't specify that any particular Unicode encoding 
>(or that it use one at all), but it does require that the text be a 
>sequence of characters in the Unicode character set. I note that the 
>whole of XML, for example, is based on this exact same idea [1].

My application receives the following document over the HTTP:

  <?xml version="1.0" encoding="encoding-a"?>
  <!DOCTYPE foo SYSTEM "part2.dtd" [
    <!ENTITY part1 "...">
  ]>
  <foo bar="&part1;&part2;" />

Where part2.dtd is:

  <?xml version="1.0" encoding="encoding-b"?>
  <!ENTITY part2 "...">

The 'bar' attribute holds a IRI reference and my application needs
to convert it into a URI. How does the resulting URI look like? If
the answer depends on information I've not given, please list all
possible results and why they are consistent with RFC 3987 and the
W3C Character Model. A pseudo-code implementation of the process
would be best.

I believe the possible answers are

  1. concat($part1, $part2)
  2. nfc(concat($part1, $part2))
  3. concat(nfc($part1), $part2)
  4. concat($part1, nfc($part2))

where each option has two possible outcomes, depending on whether
you resolve character references before or after NFC normalization.

Note in particular that even if encoding-a and encoding-b are utf-8,
I understand Martin Dürst to argue the result would be 1. (you do not
normalize because utf-8 is a unicode encoding) while e.g. Henry S.
Thompson argues the result would be 2 (in the "Infoset", all values
are sequences of Unicode code points, so you must normalize, cf. 3.1
step 1, option a).

A third position is that you don't normalize because the application
operates on a DOM where everything is UTF-16 encoded, you can find
proponents of all three positions in the W3C list archive.

The only appropriate resolution to this problem is to define that
IRIs are inherently a sequence of unicode code points and you never
normalize them. Where normalization of data formats containing IRIs
is important, normalizing transcoders are used before you see any
IRI in instances of the data formats.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Tuesday, 3 July 2007 09:24:44 UTC