OT: History (was: Re: Default charsets for text media types [i20]) from Martin Duerst on 2008-03-27 (ietf-http-wg@w3.org from January to March 2008)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 27 Mar 2008 12:25:29 +0900
To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>, ietf-http-wg@w3.org
Message-Id: <6.0.0.20.2.20080327114132.078e1740@localhost>

At 02:04 08/03/27, Frank Ellermann wrote:
>
>Martin D��st wrote:
> 
>> [I was a co-author, but that was a long time ago.]
>
>It's still interesting for reconstructions how precisely
>the Web and the Internet at large ended up with Unicode,
>later UTF-8, now net-utf8 (in essence NFC).  

Ok. This mail contains the historic bits of the answers
to your mail.

>My private theory how this all happened is that Harald
>considered ISO 2022 as a hopeless case after seriously
>trying to make it work, and you consider anything where
>it is not intuitively possible to use "�� as broken by
>design.  

I don't know about Harald. I don't think that the Umlaut
in my name had too much to do with my interest in
internationalization. I still use a mailer where I cannot
write my name correctly.

It's much more that I was always fascinated by foreign scripts,
and when I saw the first information about Unicode, that seemed
just the obviously right thing to do, listing up all the characters
of the world and giving them numbers. Before I got involved with
internet/WWW internationalization, I had internationalized
ET++, an application framework famous for its contributions
to object-oriended design patterns. I never tried ISO 2022
except for implementing conversion from iso-2022-jp to
Unicode for practical needs. I already knew Francois Yergeau
from the ET++ work, and I knew Glenn Adams from some Unicode
conferences, and then Gavin Nicol came out with a paper that
pointed out that if the Web is going to be one global application,
we better make sure we know what characters we deal with on it.

The biggest problem with iso 2022, in a very big perspective,
is not that operations such as string concatenation are hopelessly
complex. The main problem is that you have data in all these
different codepages that you switch around, and in many cases,
it's the same characters for a user (think about German written
in iso-8859-1 vs. written in iso-8859-2, even the bytes will
be the same in this case), but there is nothing in the iso 2022
architecture that tells you that it's actually the same data.

Gavin put this together with the problem of figuring out what
SGML numeric character references should mean in an internationalized
context, and came up with the idea that they should just mean
Unicode numbers. When he started explaining that on the IETF
HTML WG list, in terms of "SGML document character set" and
such, it seemed pretty much obvious to me (but not too many
others at the time), so then together with Francois and Glenn,
we worked on RFC 2070.

>>| see [NICOL2] for some details and a proposal.
>
>What was NICOL2, was that your heuristic to "sniff" UTF-8 ?

I don't remember. It definitely wasn't the UTF-8 heuristic,
I started to understand that UTF-8 was extremely easy to detect
in the discussion about FTP internationalization (now RFC 2640)
and then worked through some examples and details and published
them as http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.

My guess is that it was some description of either a scheme to use
extensions to indicate encodings in file names, or something akin to
Apache asis (http://httpd.apache.org/docs/2.2/mod/mod_asis.html), i.e.
a file including all the HTTP headers to be sent.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Thursday, 27 March 2008 03:26:49 UTC