Date: Sat, 19 Apr 1997 19:06:27 +0200 (MET DST) From: "Martin J. Duerst" <email@example.com> To: John C Klensin <firstname.lastname@example.org> Cc: Harald.T.Alvestrand@uninett.no, fielding@kiwi.ICS.UCI.EDU, email@example.com, Subject: Re: revised "generic syntax" internet draft In-Reply-To: <SIMEON.9704161008.G@tp7.Jck.com> Message-Id: <Pine.SUN.3.96.970419185030.708e-100000@enoshima> On Wed, 16 Apr 1997, John C Klensin wrote: > Harald.T.Alvestrand@uninett.no wrote: > > > Factoid: > > > > UTF-8 is not user-friendly in 8859-1; the standard coding octets for > > putting the 8859-1 charset into UTF-8 insert one character in front of > > each character, and also change the last character for the 4 uppermost > > columns of the 8859-1 character table. > > My apologies. I should have said something more like "more > user-friendly for Latin-1 than it is for upper-end > ideographic characters, where it deteriorates even more > severely :-( You might come to the state where you have to view UTF-8 with a terminal emulator or editor not set to view it, where the above effects are occurring, but this should actually be rare. And it wouldn't be better if you looked at ideographic characters with an 8859-1 editor or so. First, we don't want to have UTF-8 and 8859-1 (or any other legacy coding) mixed in the same document. Once everything is working as envisioned, if you transport a Western European URL in 8859-1, you transport the characters, as 8859-1. It's only when this is changed to %HH, or to binary 8-bit URLs as such which lack any information on character encoding, that you change to UTF-8. So you would edit a list of 8-bit URLs with an UTF-8 editor, and you would edit a Japanese HTML document with some URLs e.g. with an EUC editor (the two editors may be the same and use autodetection). If you do cut-and-paste between the two editors (or the two windows), the characters should stay the same, while the underlying representation will change. That is what will be expected by all other kinds of text processing. > Given the bad behavior *even* for 8859-1, could someone > please remind me why we are pushing the thing rather than a > straight 16 or 32-bit encoding with compression if needed? Given that for URLs intended for global exchangability, pure ASCII is still the best choice, and that enormous amounts of energy can be saved if we don't invent everything for new, given that the bad behaviour described above can happen as an accident, but is not part of what should happen, and given that designing a compression scheme for short strings such as URLs is not exactly easy, I think using UTF-8, which is supported by a lot of software and used in many other places, is not the worst thing to do. Regards, Martin.