Re: Suggested character set policy for the IETF from Martin J. Duerst on 1997-07-02 (ietf-charsets@w3.org from July to September 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Wed, 02 Jul 1997 18:19:38 +0200 (MET DST)
To: Ned Freed <Ned.Freed@INNOSOFT.COM>
Cc: ietf-charsets@INNOSOFT.COM, IETF Languages <ietf-languages@uninett.no>
Message-id: <Pine.SUN.3.96.970702135759.253R-100000@enoshima>

On Tue, 1 Jul 1997, Ned Freed wrote:

> > I thought it was obvious: We currently say that a charset is a mapping from a
> > series of octets to a sequence of graphic characters. UTF-8 produces a lot more
> > than graphic characters.
> 
> 
> > I suppose you could argue that US-ASCII does too, but CR and LF are
> > specifically dealt with as an exception in MIME, whereas no comparable prose
> > exists in MIME to allow, say, directionality indicators.
> 
> A small correction here: MIME part II actually does contain an exception
> that allows for directionality indicators as well.I forgot that I added
> this at the last minute.
> 
> However, given that Unicode has all sorts of control information in it besides
> directionality indicators, there is still a problem. And I don't think having
> to revise MIME every time additional sorts of control information are added to
> a character set (something the UTC is planning to do) is a good idea.

No, it's not a good idea. I think it's fair to say that MIME part II (RFC 2046)
does a good job at trying to give examlpes of what is and what is not part
of plain text, and that it can be left at that.

As an example, the "stacking of several characters in the same position"
is allowed. This takes care of cases such as Tibetan, Hebrew and Arabic
with points, Thai, and most of decomposed Latin/Greek/... Strictly
speaking, it does not take care of character inversion or surrounding
such as it occurs in most Brahmi-related scripts in South Asia.
But these are neither forbidden, and so it's rather reasonable to
assume that they are allowed, because they are just the application
of the concept of plain text to these languages/scripts, and the
way these scripts have been coded for years. Similarly, zero-width
non-joiner can be subsumed by this because it is a very similar
concept, in this case for Persian. If the MIME specification would
have decided that such things are unacceptable (while stacking
is allowed), it would have said so.

So as a conclusion, we can say that MIME tries to distinguish between
characters useful for plain text and characters/formatting associated
to rich text. It does a pretty good job giving explicit examlpes for
both, but leaves some area open, so that phenomena unknown to it's
authors are not ruled out if they make sense. Given the variety of
phenomena that exist in writing, this is a rather sensible approach.

Regards,	Martin.

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Wednesday, 2 July 1997 09:31:56 UTC