Date: Tue, 22 Apr 1997 17:13:42 +0200 (MET DST) From: "Martin J. Duerst" <firstname.lastname@example.org> To: Keld J|rn Simonsen <email@example.com> Cc: John C Klensin <firstname.lastname@example.org>, Dan Oscarsson <Dan.Oscarsson@trab.se>, Subject: Re: revised "generic syntax" internet draft In-Reply-To: <199704221106.NAA15049@dkuug.dk> Message-Id: <Pine.SUN.3.96.970422164228.245X-100000@enoshima> On Tue, 22 Apr 1997, Keld J|rn Simonsen wrote: > "Martin J. Duerst" writes: > > In particular, the "FORM-UTF8: Yes" I proposed is very similar > > to your proposal. To be able to label arbitrary "charset"s is > > an extension, but I don't think it is needed at this stage of > > ISO 10646 and Internet development. The way I put it usually > > is that currently, we have "chaos". There is no need to proceed > > to "labeled chaos" when we can proceed to "order" directly. > > The Universal Character Set really shows off its strength most > > directly for short and widely used strings such as URLs. > > My "URL-Charset:" header also goes along the "labelled chaos" that > we already have with HTML, Yes, it is similar to what we have with HTML. But there are significant differences in the properties of HTML and URLs that suggest that using different approaches might be a good idea: - Length: HTML is much longer than URLs, and tagging is therefore less of a burden. - Length again: HTML can benefit from using different "charset"s as a kind of "compression", this is less of an issue for URLs. - Round-trip vs. one way: URLs make a round trip from the originator and back to it, and they have to arrive there safely. HTML is more downstreams only, and never needs an exact match after many transformations. - Transcription by paper: URLs are transcribed on paper. Adding a charset tag on paper is very clumsy (think about http:[us-ascii]//www.ibm.com printed in a newspaper). It may look like we don't need that tag, because the characters are all identified, yet if we want to use the current URL software which compares URLs using octet identity, we have to transform the characters back into the octets that they originated from. > and then the coding of URLs in > anchors etc in the HTML markup. The natural thing there is that URLs > are encoded in the charset of the HTML document. So a request > for the URL would then have a header with the URL and then the > "URL-charset" of the HTML document. Straightforward. And we could > use equivalent mechanisms whether the URL was typed in or came from > a HTML document. This is indeed true, and part of our proposal. But this solves only part of the problem, namely the question: What characters do the octets you are currently manipulating in your computer actually represent. So for example if I type an "o" with a "/" on a Mac, it will be represented as 0xBF internally, and it is (implicitly and naturally) tagged Mac-Roman. And if I cut-copy- paste that character into another document, it will keep its identity, but because it is in a web page editor, it might change its representation e.g. to 0xF8, and be (implicitly and naturally) tagged Latin-1. Printed on paper, it will still be an "o with /", but it doesn't need any tagging for representation because the tagging is automatically and implicitly added when it is input again. The problem here starts when this "o with /" is converted to %HH, and when it is send to a server. Now I am not anymore interested in the encoding in which I currently keep the character (which is usually not too difficult to know), but in the encoding that the server is assuming the character will arive in. And if I don't take special measures, I have absolutely no idea about what that could be. Now there are several possibilities: 1) Add another tag, this time explicit, that has to be carried around *all the time* and separately from the information that might be around implicitly. As I said above, this is very ugly, and no current software is prepared for it. Also, it introduces the problem that the browser (which strips the tag and converts) has to know about a large number of charset's, more than just the pages it is used to display and stuff that it is otherwise used to. Knowledge that could be centralized has to be widely distributed. 2) Send the URL as is, with a "charset" information. The server would get URLs in all kinds of charsets, and would have to care on its own for how to convert them to the charset it is using. Also, we can't freely convert to %HH, because then we need to add a tag as to what we used when converting to %HH. 3) Define a single encoding (this obviously is UTF-8). This means that when you see an URL with beyond-ASCII characters in it, you will know that to convert it to %HH and send it to the browser, you have to use UTF-8. It's like the tag above, but just that there is only one possibility, and that this therefore doesn't have to be specified. 4) Have a knowledge database about different protocols/ schemes and the encoding they use (if they use a single one). Is very clumsy to write general URL software with nice interface. 5) Have a way to ask the server what charset it accepts. Again, this needs new protocol, the tag, instead of making a roundtrip, is served by the server on demand. This gets difficult especially if you have various encodings in various areas of the same server. Also, the client needs to know about lots of encodings. > Also the responsibility of handling the character > encoding incl conversion would be at the server side, which normally > would be the "offender" allowing strange things like non-ASCII URLs. Your proposal is probably very close to 2) above. I think it would be probable to deploy it for HTTP, but it would put more heavy burdens on the server than with UTF-8 (where the server just has to know UTF-8 and whatever it wants to use locally). Also, it would need to add a tag for when we convert to %HH. Regards, Martin.