- From: Gary Adams - Sun Microsystems Labs BOS <Gary.Adams@east.sun.com>
- Date: Wed, 30 Apr 1997 07:46:18 -0400
- To: Dan.Oscarsson@trab.se, masinter@parc.xerox.com, uri@bunyip.com
> From: Dan Oscarsson <Dan.Oscarsson@trab.se> ... > > characters in the listing, and that the interpretation > > of URLs accept both the raw UTF-8 or the hex-encoded version. > > > > This is not right. A directory listing service generates a html document > that is sent back to the web browser. All URLs within a html document > should use the same character set as the document uses. That is, > if the document uses iso 8859-1, the URLs will be in iso 8859-1, and > if the document is in UTF-8, the URLs will be in UTF-8. I have been experimenting with a mixed character set directory just to see what would happen if a site did need to support multiple encodings. Don't know if this would be helpful in the current discussion : http://www.sunlabs.com/research/ila/test_i18n/ (If your browser can't access the directory try viewing the README.txt or the cgi.html file in that directory first.) I'm not sure I understand the comment above about the document character set and the encoding of the URLs. If I start with a Unix server with EUC-jp encoded filenames and generate a directory listing, today the 8-bit bytes are %HH escaped for the URL references (so they are now safe ASCII transmitable sequences) and the "document text" is transmitted as 8-bit bytes assumed to be iso-8859-1 characters. For the most part all "import/export" type operations such as "cut-and-paste" of document text or "save-url-in-hotlist" are "non-destructive" . e.g., The text content is preserved and the subsequent url operations are still functional, even though the document text is not rendered correctly and the url encoding is ambiguous. > > If the browser knows how to handle the character set of the html document, > it also should know how to translate the embedded URLs into UTF-8 when > the user follows a link. I think it is important to keep the document text and the URL "text" separate entities. An EUC-jp file name is a reasonable restriction on some platforms that might provide storage for SJIS encoded documents. When a web server delivers the content of that document on the web the document content would be appropriately labeled or converted for transport and the URL could be encoded in any form acceptable for retreiving the file in a later transaction. > > In general, URLs used without a context that defines the characters used, > should be encoded using UTF-8. URLs used within a context where the > meaning of the characters is defined should use the character encoding > of the context. I'm not sure that it is a good idea to tie the URL encoding interpretation to its immediate context. If I attempt to "Save" a document from the browser (or a spider agent is gathering documents automatically from the web), then the characters of the URL are often used to form a local file system name for the fetched object. So I fetch a SJIS named file via an http server and save it in my ~/public_html EUC-jp file system. Using UTF8 on the wire (if prearranged) allows both sites to use meaningful names for their local resources and to safely share the public handles for the information. \ /gra
Received on Wednesday, 30 April 1997 07:47:04 UTC