- From: Roy T. Fielding <fielding@kiwi.ics.uci.edu>
- Date: Tue, 08 Sep 1998 22:50:02 -0700
- To: Sam Sun <ssun@CNRI.Reston.VA.US>
- cc: "Martin J. Duerst" <duerst@w3.org>, URI distribution list <uri@Bunyip.Com>
>Are you suggesting that any URI reference in HTML document takes the >encoding of the HTML document? For example, if the HTML document uses >"shift_jis" encoding, the URI references in the document will be "shift_jis" >encoded. > >If so, does this mean that URIs in "shift-jis" encoded HTML document can not >use UTF-8 encoding? (Otherwise you get mixed encoding here.) I mean that all of the characters in an HTML document, including the characters that might appear within an <a href="...">, are in a single encoding which could be anything from "shift_jis" to UTF-8, and further that the actual data represented by those characters might be encoded by SGML character entities (like öaut; or -). In order to understand the href attribute, an HTML parser must read all the characters in whatever encoding the document has, translate the encoding to an internal representation of the document character set, translate any SGML character entities to the actual characters they represent within the document character set, and finally consider the result (a string of characters in the HTML document character set of ISO-10646) as being a URI reference. All this translation is done before any knowledge about URI has entered the picture, so defining a URI scheme according to how it might appear in an HTML document will just confuse the heck out people who need to implement it. This is why URI are defined in terms of characters, not the encoding that might be used to represent those characters within a given document, TV screen, or flying banner. ....Roy
Received on Wednesday, 9 September 1998 01:52:33 UTC