- From: Dan Oscarsson <Dan.Oscarsson@trab.se>
- Date: Fri, 2 May 1997 11:52:32 +0200 (MET DST)
- To: masinter@parc.xerox.com, Gary.Adams@east.sun.com
- Cc: uri@bunyip.com
> > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A> > > > > > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8, > > > no matter what the user sees. > > > > > > > If you use hex-encoding, yes. But NOT if you use the native character set > > of the document. In that case, the 'this-is-the-URL' part must > > use the same character set as the rest of the html document. Raw UTF-8 > > may only be used in a UTF-8 encoded html document, not in a iso 8859-1 > > encoded document. > > The document character set for HTML 2.0 and 3.2 was iso 8859-1. > The document character set for HTML 4.0 and XML will be iso 10646. As iso 8859-1 is a true subset of iso 10646 I assume that html 4.0 will also handle iso 8859-1 encoded documents, otherwise it will break a lot of html pages and software of today. > > A large amount of html documents are hand written in a text editor. A user > > can not be expected to use a different encoding when typing the URLs > > in a document. > > But they might have to use a different encoding when saving the file > to disk. And the document itself might be converted as it is saved > to disk. These are common functions in a multibyte plain text editor, > just as intelligent cut and paste functions are needed in a shared > desktop environment. > > I think your point about "authoring URLs" within HTML documents with > a "plain text editor" is that the user will have a local input > method for entering native characters (e.g., compose key sequences, > virtual keyboard, radical composition, etc.) which will be operating > in the same manner for document text and for URL characters. Since the > authoring tools did not offer a means of recording the character encoding > information, it is not possible for a web server to make a distinction > when a document is transmitted on the wire. >From another mail: >> >> In general, URLs used without a context that defines the characters used, >> should be encoded using UTF-8. URLs used within a context where the >> meaning of the characters is defined should use the character encoding >> of the context. > >I'm not sure that it is a good idea to tie the URL encoding >interpretation to its immediate context. If I attempt to >"Save" a document from the browser (or a spider agent is gathering >documents automatically from the web), then the characters of the URL >are often used to form a local file system name for the fetched object. >So I fetch a SJIS named file via an http server and save it in my >~/public_html EUC-jp file system. Using UTF8 on the wire (if >prearranged) allows both sites to use meaningful names for their local >resources and to safely share the public handles for the information. Maybe I was unclear. Text that is handled on a system does normally have a defined character set. If I do cut/copy the text that is copied does have a known character set and will be converted into a new if pasted into a document of a different character set (if used on a system that handles different character sets at the same time). If an editor edits the characters in ISO 10646, it can save them in a totally different character set by converting the character the other character set. When I edit a html document with a text editor, it is just text. URLs enbedded in the text is written using the same character set that all other text is in, if I paste a filename from a file listning in an other tool, the filename will end up in the same character set as all other characters in the text. URLs I write will contain 8-bit characters using the same character set as the rest of the text. When I use a web browser it will fetch html documents containing URLs. If I click on a link the browser need to extract the URL from the text, translate it into UTF-8 and send it to a web server. If I "Save" a document I fetched, the filename proposed will be in the character set of my filesystem. All this if the browser is international UTF-8 URL aware. Otherwise only %XX encoded URLs will work for sure. UTF-8 should be used on the wire when the protocol says: here is a URL. If the protocol says: here is a html document, the document need not be in UTF-8, it may be in iso 8859-1, UCS-2, UCS-4 and embedded URLs will be in the same character set. It is a simple matter for a web browser to extract the embedded URLs and translate them into UTF-8 for the wire, it is a very hevy burden for a web server to parse every html document and translate the embedded URLs into UTF-8. I think it is important that the document text and the URL "text" of in the document embedded URLs are of the same character set. If you have a system with SJIS encoded documents and EUC-jp for file names, I assume that editors in that system knows that when you save a document to a file it will use EUC-jp for the filename and if you copy a piece of the SJIS text into the file name dialog field, it will convert the text from SJIS to EUC-jp. No problem then with extrating text from a document and using if for something with a different character set. Is it clear now that URLs (and file names) typed in a document need to be in the same character set as the document? Dan
Received on Friday, 2 May 1997 05:53:19 UTC