RE: Charset policy - Post Munich from Martin J. Dürst on 1997-09-19 (ietf-charsets@w3.org from July to September 1997)

From: Martin J. Dürst <mduerst@ifi.unizh.ch>
Date: Fri, 19 Sep 1997 12:14:45 +0200 (MET DST)
To: Chris Weider <cweider@microsoft.com>
Cc: 'Ned Freed' <Ned.Freed@INNOSOFT.COM>, ietf-charsets@INNOSOFT.COM
Message-id: <Pine.SUN.3.96.970919114715.361l-100000@enoshima>
On Mon, 1 Sep 1997, Chris Weider wrote:

> I think Ned is completely correct here. The workshop report thought long
> and hard about requiring language tagging and mandatory UTF-8 and
> realized that this is the only way to make things work with the stupid
> machines we have now :^)

Chris - In general, I very much agree. Pushing UTF-8, and strongly
telling people to include language tagging into their protocols, is
very much what is needed. I definitely do not want to argue about
this.

However, both for UTF-8 and for language tags, as well as for
other internationalization issues, I think it is important to
not preclude further developments, and not to express requirements
in an absolute and unchangeable way that makes protocol developers
and implementers think that if only the do X, all their internatio-
nalization problems go away.

Specifically, I agree that UTF-8 in most cases is the best
solution, especially in text-based protocols which are the
majority of the current application area protocols. But there
are also binary protocols (the Internet printing protocol
currently under discussion is a recent examlpe), and for
such protocols, there may be situations where 16-bit alignement
is necessary in general and so using UTF-16 is a neat choice.
In general, machines have moved from 8-bit to 16-bit to 32-bit
to 64-bit architectures, and byte sizes have moved from
5 to 6 to 7 to 8 (with some cases of 9) bits. It may well be
that it future, we see hardware that works with 128-bit words,
but which has to use shift-and-mask techniques for 8-bit bytes,
because the smallest size it supports directly is 16-bit bytes.
Not that I think that these things will appear very soon in
great numbers, but in ten years, they could very well be around.
If we can word our policy so that it doesn't read ridiculously
shortsighted if and when this happens, that would not be a bad
thing, and wouldn't hurt at all.

Specifically, I also agree that language tags are a big help
to current stupid machines. But if we put an absolute requirement
for language tags into our policy, a requirement that in the
extreme might say: "Every protocol has to be able to language
tag all the characters it sends around, with potentially
different tags for each character.", and we thereby give
implementors the impression that that's all they have to do,
and text-to-speech conversion, machine-translation, spelling
and grammar checks, hyphenation, high-quality display, and
subtile glyph distinctions maybe necessary for names, and so on,
will work magically and perfectly, then we clearly create the
wrong impression.

Ned Freed gave a very good examlpe in his mail about a case
where langage information can be very helpful, namely the
case where somebody receives email from a user and is not
able to find out what language it is written in and therefore
whom he should forward it for further processing.
But for this case, automatic language detection seems to
be extremely well suited. The idea has been around for years,
but it is interesting to see that in the last few months and
weeks (in particular at the recent Unicode conference in San
Jose), I have heard from at least four companies that are
seriously working on this issue, with good results, and
with some of the adding new languages almost each week.
In particular, Microsoft is doing such work, and Alis
Technologies gave a floppy disk with a test program to
selected users/clients. I guess that Ned could greatly
benefit e.g. from the Alis program; I would be glad to
organize the contact. Please note that this was on a
floppy disk, covering something between 20 and 50 languages,
so it's something that may well be a standard PC OS
component in a few years.

Again, let me make it very clear that I am not at all against
UTF-8 or language tagging; I guess I have enough of a track
record to make that clear (for UTF-8, have a look e.g. at
the archives of the FTPEXT WG, for language tagging, see
RFC 2070). But I think it is crucial that we word the policy
in such a way that it lets protocol developpers take their
own requirements into account, and take care of the specific
internationalization needs, and not just follow a few words
blindly.


Regrads,	Martin.


--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Friday, 19 September 1997 11:35:37 UTC