Re: Suggested character set policy for the IETF from Martin J. Duerst on 1997-07-01 (ietf-charsets@w3.org from July to September 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Tue, 01 Jul 1997 13:07:47 +0200 (MET DST)
To: Ned Freed <Ned.Freed@INNOSOFT.COM>
Cc: Chris Newman <Chris.Newman@INNOSOFT.COM>, ietf-charsets@INNOSOFT.COM, IETF Languages <ietf-languages@uninett.no>
Message-id: <Pine.SUN.3.96.970701123506.253F-100000@enoshima>
On Mon, 30 Jun 1997, Ned Freed wrote:

> > > The "related presentation information" is a missing portion of the
> > > definition.  There are things like CRLF, character directionality, Unicode
> > > joiner/no-joiners, etc. which effect presentation but are not "characters"
> > > in the traditional sense.
> 
> > I see. The cases you mention are of course perfectly reasonable and
> > necessary. They are also subsumed under the term character in the
> > sense it is used in standards, which distinguishes (or should I say
> > distinguished?) between control characters and graphic characters.
> 
> I respectfully beg to differ. The definition given for "character" in RFC2130
> Appendix C is:
> 
>    Character - A single graphic symbol represented by sequence of one or
>    more bytes.
> 
> I don't know of an earlier definition of "character" in an RFC. (Nathaniel and
> I deliberately avoided having one in MIME.) There was a terminology document
> floating around some time ago that defined all this stuff but I don't think it
> ever became an RFC. And I believe it defined "character" the same way that
> RFC2130 does in any case.

No problem to base on the definition above.


> Now, there may be some standards group out there that uses the term "character"
> consistently to mean "graphic or control character", but if so I don't know
> what that group is. (It certainly isn't the ISO, as ISO terminology for this
> stuff has flitted all over the place over time.)

ISO terminology may not be completely consistent, and indeed it has
developped over time. However, I think it is fair to say that the
recent documents in this area, ISO rather consistently uses the
division of characters into graphic characters and control characters.
And with the reneval of all ISO standards relating to characters, they
will all be based on ISO 10646, and so terminology will be even more
streamlined.


> Both because of this definition as well as other interoperability issues the
> definition the definition of a character set in MIME pretty much has to change. 
> For one thing, registering UTF-8 as a chaset is technicall illegal right now.

Can you explain that? What's the problem?


> > > Suggestions for making it more precise would be helpful.  It'd be nice to
> > > get this right in the next revision of the MIME specification.
> 
> > Well, in my oppinion, including something like "presentation" is
> > very dangerous. Soon you have people claiming that font information,
> > or whatever, has to be part of a "charset". Making the definition
> > more precise would be nice, but would probably take too much lines.
> > Just leaving it at "characters", and maybe refering to some of the
> > ISO work in that area for somebody who really wants to check, should
> > be okay.
> 
> As far as your opinion of the term "presentation" goes, my position is that the
> term we use is largely irrelevant, and if makes you happier I'll use "control
> information" instead. What matters is that the definition allow  this sort of
> information as an output of the charset to character conversion process.

"Presentation" and "control information" share the slippery slope problems
(see below).


> We could of course do this by amending the definition of a character in RFC2130
> to mean "graphic or control character". But then we're left with the task of
> defining a "control character". Because of this I actually prefer language that
> equates "character" with "graphic symbol" and talking about the conversion
> process also producing control information an output. I think we can get
> away with not defining "control information" specifically; I don't think the
> same is true for "control character".

I don't think that makes any difference. Quite to the contrary, "control
character" at least has a long and rather clear usage history, whereas
"control information" can just be about anything.


> One final note about all this. You and others are constantly raising the
> spectre of there being a "slippery slope" here that we have to avoid: Once we
> allow XXX (presentation information, language tags, take your pick) the doors
> will open and all of HTML will end up as a charset, and there's the seventh
> seal blown open right there.  (I'm exaggerating here, of course, although your
> tone sometimes makes me wonder.)
> 
> I must say that I for one have no difficulty believing that this is a real
> issue for, say, the UTC and the ISO. I'm sure the UTC has seen all sorts of
> proposals that attempt to turn Unicode into HTML. Or maybe even PostScript! For
> this reason I have no difficulty believing that the UTC has to fight this sort
> of stuff off constantly or there will be real trouble for them.

Well, just recently you have been participating in a still ongoing
discussion on such a case. I think there is general agreement that
a charset can also contain CR, LF, Tabs, page breaks, and spaces,
even that a charset probably SHOULD or MUST contain some of these
(CRLF is mandatory for MIME unless you want to restrict messages
to single line, the others could probably be done without in some
circumstances).

What I definitely want to avoid, and what I think also the IETF has
some interest to avoid (even if the danger for the IETF is smaller
than for Unicode) is that somebody comes and says: 1) A charset is
defined as containing characters and presentation information,
2) presentation information XXX is vital in my application, therefore
3) charsets have to contain this information.

Not really for fonts per se, but in the context of language tags,
claims along this line have been made.


> However, that doesn't mean it is a valid issue for the IETF. For one thing,
> history says otherwise. The IETF has had a largely unconotrlled charset
> registration process in place for well over 5 years now. And a bunch of stuff
> has been registered which at a minimum should be marked as "unsuitable for use
> in MIME text/plain". Yet in spite of this chaotic history I am unware of anyone
> registering a charset that includes, say, general font-switching machinery.
> (And it isn't like similar machinery doesn't already exist in ANSI X3.4 under
> the general rubric of "control character", BTW.)

Well, there is iso-8859-[6|8]-[i|e], which includes bidirectionality.


> In other words, while you may believe that the IETF definition of "character"
> included "control character" all along, a fair number of other people
> effectively did not and worse, acted on this belief, and worse still, their
> actions made it into some widely used products. And the result has been serious
> trouble and serious interoperability problems -- so much so that I had to
> tighten up the prose in the last go-round on MIME to make it clear that _some_
> presentation information is present in plain text, when it is there it has to
> be acted on, and when it isn't nothing should be done. But I didn't fix the
> definition of "charset" to match this, so we now have a standard that says one
> thing in one place and another in another place, which isn't acceptable and is
> going to have to change.

Nothing against this, not at all. But it's never a bad idea to be safe
on both sides, i.e. to both say that a minimum of presentation information
is there and has to be acted upon, and say that this presentation
information is really only a minimum and not, or at least not necessarily,
more.

Regards,	Martin.



--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Tuesday, 1 July 1997 04:31:03 UTC