RE: Advice on making IRI document suitable for reference by HTML (and other specs) from Phillips, Addison on 2009-12-29 (public-iri@w3.org from December 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 29 Dec 2009 11:24:15 -0800
To: "Roy T. Fielding" <fielding@gbiv.com>, Larry Masinter <masinter@adobe.com>
CC: "julian.reschke@gmx.de" <julian.reschke@gmx.de>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <C7A5719F1E562149BA9171F58BEE2CA412994E6E91@EX-IAD6-B.ant.amazon.com>

(this is a personal response)

Roy T. Fielding wrote:

> >> and then change HTML5 so that it uses
> >> anyString (or whatever you want to call it) as the attribute
> >> definition.
> >
> > That's what was intended by:
> > http://lists.w3.org/Archives/Public/public-html/2009Nov/att-

> 0670/iri-rewrite-draft.html
> > Do you think this is the right direction, then?
> 
> I think it would be easier to simply define how to process
> a Web reference (not an address yet) into a Web address in
> the form of an IRI or URI.

I prefer Roy's interpretation here. Mapping a reference (such as href) to an IRI (or URI) makes more sense than trying to stretch IRI to cover all possible reference forms. If nothing else, some references turn out to be invalid.

> Furthermore,
> what do we do then for documents that are not Unicode based,
> do not have references that are Unicode based, and will not
> work with IRI conversion to UTF-8?  Should those be called
> IRIs as well?

HTML (and XML) documents are "Unicode-based" (the document character set is Unicode/ISO10646), even if the character encoding used serialize the document is not an encoding of Unicode. Not that many modern document formats are not Unicode based. But for any document structured enough to represent an IRI, the first step is always to map to a sequence of Unicode characters. This works even when the document is a cocktail napkin or the side of a bus.

A resource may have an address that uses a legacy (non-Unicode) character encoding (or any sequence of bytes, for that matter), although the recommendation is not to do so. The existence of non-Unicode-encoded resources must be accounted for by IRI via an escaping mechanism (percent-encoding) just as the existence of non-ASCII-encoded references are accounted for by URI. That is "http://example.com/%E0%80" is both an IRI and a URI. The bytes 0xE0 and 0x80 are not UTF-8 and do not represent Unicode characters. But the xRI can still work to access the resource in question.

A document in Shift-JIS might have the bytes 0xE0 and 0x80 in a reference similar to my example. In that case, the reference is interpreted into Unicode first. You could then have: "http://example.com/烙" (that last character is U+70D9, which is encoded as 0xE0 0x80 in Windows code page 932) or "http://example.com/%E7%83%99" (mapped as a URI). If a document means "%E0%80", it had best specify that and not rely on the document's character encoding to imply it. The document could easily, for example, be converted to UTF-8 or EUC-JP, losing the important addressing information conveyed by the Shift-JIS encoding.

By switching to a consistently encoded referencing system (UTF-8 based), we address the problem that people would prefer graphic characters in their references but have no idea which bytes should be used to represent that reference as the actual address of the resource (I had to look up the bytes in the example above). This is a good thing, I think. And references that are broken as a result will be broken in other contexts already.

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

Received on Tuesday, 29 December 2009 19:24:49 UTC