- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Thu, 27 Mar 2008 12:25:29 +0900
- To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>, ietf-http-wg@w3.org
At 02:04 08/03/27, Frank Ellermann wrote: > >Martin D$B—S(Bst wrote: > >> [I was a co-author, but that was a long time ago.] > >It's still interesting for reconstructions how precisely >the Web and the Internet at large ended up with Unicode, >later UTF-8, now net-utf8 (in essence NFC). Ok. This mail contains the historic bits of the answers to your mail. >My private theory how this all happened is that Harald >considered ISO 2022 as a hopeless case after seriously >trying to make it work, and you consider anything where >it is not intuitively possible to use "$B—(B as broken by >design. I don't know about Harald. I don't think that the Umlaut in my name had too much to do with my interest in internationalization. I still use a mailer where I cannot write my name correctly. It's much more that I was always fascinated by foreign scripts, and when I saw the first information about Unicode, that seemed just the obviously right thing to do, listing up all the characters of the world and giving them numbers. Before I got involved with internet/WWW internationalization, I had internationalized ET++, an application framework famous for its contributions to object-oriended design patterns. I never tried ISO 2022 except for implementing conversion from iso-2022-jp to Unicode for practical needs. I already knew Francois Yergeau from the ET++ work, and I knew Glenn Adams from some Unicode conferences, and then Gavin Nicol came out with a paper that pointed out that if the Web is going to be one global application, we better make sure we know what characters we deal with on it. The biggest problem with iso 2022, in a very big perspective, is not that operations such as string concatenation are hopelessly complex. The main problem is that you have data in all these different codepages that you switch around, and in many cases, it's the same characters for a user (think about German written in iso-8859-1 vs. written in iso-8859-2, even the bytes will be the same in this case), but there is nothing in the iso 2022 architecture that tells you that it's actually the same data. Gavin put this together with the problem of figuring out what SGML numeric character references should mean in an internationalized context, and came up with the idea that they should just mean Unicode numbers. When he started explaining that on the IETF HTML WG list, in terms of "SGML document character set" and such, it seemed pretty much obvious to me (but not too many others at the time), so then together with Francois and Glenn, we worked on RFC 2070. >>| see [NICOL2] for some details and a proposal. > >What was NICOL2, was that your heuristic to "sniff" UTF-8 ? I don't remember. It definitely wasn't the UTF-8 heuristic, I started to understand that UTF-8 was extremely easy to detect in the discussion about FTP internationalization (now RFC 2640) and then worked through some examples and details and published them as http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf. My guess is that it was some description of either a scheme to use extensions to indicate encodings in file names, or something akin to Apache asis (http://httpd.apache.org/docs/2.2/mod/mod_asis.html), i.e. a file including all the HTTP headers to be sent. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Thursday, 27 March 2008 03:26:49 UTC