Re: iDNR, an alternative name resolution protocol from Roy T. Fielding on 1998-09-09 (uri@w3.org from September 1998)

From: Roy T. Fielding <fielding@kiwi.ics.uci.edu>
Date: Tue, 08 Sep 1998 22:50:02 -0700
To: Sam Sun <ssun@CNRI.Reston.VA.US>
cc: "Martin J. Duerst" <duerst@w3.org>, URI distribution list <uri@Bunyip.Com>
Message-ID: <9809082250.aa10731@paris.ics.uci.edu>

>Are you suggesting that any URI reference in HTML document takes the
>encoding of the HTML document? For example, if the HTML document uses
>"shift_jis" encoding, the URI references in the document will be "shift_jis"
>encoded.
>
>If so, does this mean that URIs in "shift-jis" encoded HTML document can not
>use UTF-8 encoding? (Otherwise you get mixed encoding here.)

I mean that all of the characters in an HTML document, including
the characters that might appear within an <a href="...">, are in a
single encoding which could be anything from "shift_jis" to UTF-8,
and further that the actual data represented by those characters
might be encoded by SGML character entities (like &oumlaut; or &#45;).
In order to understand the href attribute, an HTML parser must read all
the characters in whatever encoding the document has, translate the
encoding to an internal representation of the document character set,
translate any SGML character entities to the actual characters they
represent within the document character set, and finally consider the
result (a string of characters in the HTML document character set of
ISO-10646) as being a URI reference.

All this translation is done before any knowledge about URI has
entered the picture, so defining a URI scheme according to how
it might appear in an HTML document will just confuse the heck out
people who need to implement it.  This is why URI are defined in terms
of characters, not the encoding that might be used to represent those
characters within a given document, TV screen, or flying banner.

....Roy

Received on Wednesday, 9 September 1998 01:52:33 UTC