- From: Ned Freed <Ned.Freed@INNOSOFT.COM>
- Date: Mon, 30 Jun 1997 17:10:28 -0700 (PDT)
- To: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
- Cc: Chris Newman <Chris.Newman@INNOSOFT.COM>, ietf-charsets@INNOSOFT.COM, IETF Languages <ietf-languages@uninett.no>
> > > The first part of your definition, "mapping from octets to characters", > > > is very widely known and used. The second part of the definition, "related > > > presentation information", is new to me. Is this your own definition, > > > or where did you find it? What exactly does the term "presetation > > > information" mean for you? How do you assure that it means the same > > > thing for others? > > The "related presentation information" is a missing portion of the > > definition. There are things like CRLF, character directionality, Unicode > > joiner/no-joiners, etc. which effect presentation but are not "characters" > > in the traditional sense. > I see. The cases you mention are of course perfectly reasonable and > necessary. They are also subsumed under the term character in the > sense it is used in standards, which distinguishes (or should I say > distinguished?) between control characters and graphic characters. I respectfully beg to differ. The definition given for "character" in RFC2130 Appendix C is: Character - A single graphic symbol represented by sequence of one or more bytes. I don't know of an earlier definition of "character" in an RFC. (Nathaniel and I deliberately avoided having one in MIME.) There was a terminology document floating around some time ago that defined all this stuff but I don't think it ever became an RFC. And I believe it defined "character" the same way that RFC2130 does in any case. Now, there may be some standards group out there that uses the term "character" consistently to mean "graphic or control character", but if so I don't know what that group is. (It certainly isn't the ISO, as ISO terminology for this stuff has flitted all over the place over time.) Both because of this definition as well as other interoperability issues the definition the definition of a character set in MIME pretty much has to change. For one thing, registering UTF-8 as a chaset is technicall illegal right now. And I happen to despise standards that are worded to allow this sort of clearly bogus reading, as in general they tend to weaken the standards process. > > Suggestions for making it more precise would be helpful. It'd be nice to > > get this right in the next revision of the MIME specification. > Well, in my oppinion, including something like "presentation" is > very dangerous. Soon you have people claiming that font information, > or whatever, has to be part of a "charset". Making the definition > more precise would be nice, but would probably take too much lines. > Just leaving it at "characters", and maybe refering to some of the > ISO work in that area for somebody who really wants to check, should > be okay. I'm sorry, but it is not OK, unless you think that not being able to register UTF-8 under the new rules and not being able to advance MIME to full standard is OK. As far as your opinion of the term "presentation" goes, my position is that the term we use is largely irrelevant, and if makes you happier I'll use "control information" instead. What matters is that the definition allow this sort of information as an output of the charset to character conversion process. We could of course do this by amending the definition of a character in RFC2130 to mean "graphic or control character". But then we're left with the task of defining a "control character". Because of this I actually prefer language that equates "character" with "graphic symbol" and talking about the conversion process also producing control information an output. I think we can get away with not defining "control information" specifically; I don't think the same is true for "control character". One final note about all this. You and others are constantly raising the spectre of there being a "slippery slope" here that we have to avoid: Once we allow XXX (presentation information, language tags, take your pick) the doors will open and all of HTML will end up as a charset, and there's the seventh seal blown open right there. (I'm exaggerating here, of course, although your tone sometimes makes me wonder.) I must say that I for one have no difficulty believing that this is a real issue for, say, the UTC and the ISO. I'm sure the UTC has seen all sorts of proposals that attempt to turn Unicode into HTML. Or maybe even PostScript! For this reason I have no difficulty believing that the UTC has to fight this sort of stuff off constantly or there will be real trouble for them. However, that doesn't mean it is a valid issue for the IETF. For one thing, history says otherwise. The IETF has had a largely unconotrlled charset registration process in place for well over 5 years now. And a bunch of stuff has been registered which at a minimum should be marked as "unsuitable for use in MIME text/plain". Yet in spite of this chaotic history I am unware of anyone registering a charset that includes, say, general font-switching machinery. (And it isn't like similar machinery doesn't already exist in ANSI X3.4 under the general rubric of "control character", BTW.) In fact the problem the IETF has had with plain text is the exact opposite of this: We've seen widespread usage where plain text was taken to mean "only the graphic symbols matter and the rest is trash and should be ignored and yes, this means you have to reformat everything to fit your display, and yes, when you then send code or tables through as plain text this reformatting makes it look like shit". In other words, while you may believe that the IETF definition of "character" included "control character" all along, a fair number of other people effectively did not and worse, acted on this belief, and worse still, their actions made it into some widely used products. And the result has been serious trouble and serious interoperability problems -- so much so that I had to tighten up the prose in the last go-round on MIME to make it clear that _some_ presentation information is present in plain text, when it is there it has to be acted on, and when it isn't nothing should be done. But I didn't fix the definition of "charset" to match this, so we now have a standard that says one thing in one place and another in another place, which isn't acceptable and is going to have to change. In other words, I wish you'd stop waving the "font bogey" around, as I don't think it has any real relevance in the IETF. Ned --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Monday, 30 June 1997 19:49:37 UTC