RE: Charset policy - Post Munich from Ned Freed on 1997-10-10 (ietf-charsets@w3.org from October to December 1997)

From: Ned Freed <Ned.Freed@INNOSOFT.COM>
Date: Fri, 10 Oct 1997 08:49:12 -0700 (PDT)
To: Martin J. Dürst <mduerst@ifi.unizh.ch>
Cc: Ned Freed <Ned.Freed@INNOSOFT.COM>, ietf-charsets@INNOSOFT.COM
Message-id: <01IOMYS9M6LC9JD2XE@INNOSOFT.COM>
> > Please note that designs like MIME encoded-words and MLSF which do allow
> > individual character tagging, do NOT qualify. The fact that these designs allow
> > individual character tagging (albeit in a very painful and artificial way) is a
> > artefact of the design constraints these things operate under.

> You guessed in the right direction, obviously. But I see a big difference
> between MIME encoded words (as extended in the pvcsc spec) and MLSF.

Two additional points need to be made here:

(1) MLSF is no longer on the table as a proposal at the present time. This
    may change, however, if the UTC and ISO do not deliver on embedded language
    tags.

(2) If MLSF needs to be revived I for one will support it. As such, I have
    no intention of putting something in a document that would prevent this
    from happening unless and until I am specifically directed to do so by
    either an AD or a WG chair with jurisdiction over the charset registration
    specification. You can object all you want, but since this is not a WG
    there's no chair, so unless you can convince either Harald or Keith
    that such language belongs in the document it simply is not going to
    happen.

> On the technical side, PVCSC does not apply to characters, it applies
> to encoded words. Encoded words have to be separated by linear white
> space (which is not removed when decoding the encoded words, as far
> as I understand), and can only have one language.  

I'm afraid your understanding is totally incorrect. From RFC2047:
 
   When displaying a particular header field that contains multiple
   'encoded-word's, any 'linear-white-space' that separates a pair of
   adjacent 'encoded-word's is ignored.  (This is to allow the use of
   multiple 'encoded-word's to represent long strings of unencoded text,
   without having to separate 'encoded-word's where spaces occur in the
   unencoded text.)

The problem is that encoded-words have length restrictions and restrictions
on how they can be presented. The two combine to produce a requirement that
it be possible to use multiple encoded-words to represent a long string. And
this of course opens the door up to having each character in a different
charset and language.

I also note in passing that your fundamental misunderstanding of encoded-words
either means either you have never implemented any of this or if you have you
haven't done it properly. And I must confess that I am very disappointed by
this. I had always assumed that you had substantive experience with both
charset design and implementation of charset support -- experience that far
exceeded my own, and that our present disagreement arose mostly out of a
disconnect between the way the IETF does business and what you've seen happen
in other venues. In fact I have even gone so far as to recommend you as someone
with a good grasp of these issues.

I now see that my assessment was wrong. And I hasten to add that any fault --
if fault is the right word -- is mine and mine alone -- you never
misrepresented your abilities or experience. I simply assumed too much, and now
have to revise my opinion.

But the fact of the matter is that this is one of those things where if you
haven't either implemented it yourself or tried to support an implementation in
real-world use you cannot possibly know, let alone understand, the real-world
issues that come up. I've done both -- I've done not one but two
implementations of encoded-word support, both in commercial products that are
widely used in over 50 countries, and I have done ongoing support work on both
and continue to do so at the present time. And let me tell you that the
handling of encoded words containing characters in multibyte charsets is in
fact quite tricky and difficult to get right. And language tags make it even
worse -- I have preliminary support in place for them so I think it is at least
doable, but I'm a ways away from actually trusting that I have all the details
right. (And in this case there's an entire pantheon of devils in the details.)

> The same for the
> language specification for parameters defined in PVCSC, it is
> one language per parameter, which is not individual character tagging.

This is true only because the design space allowed it and the design was
actually simplified by imposing this restriction. Had the design space not
allowed it (as it doesn't for encoded-words) or had the design been made overly
complex by having this restriction it would not be there.

And I believe this again demonstrates the essential fallacy that lies behind
your entire train of thought here. This mechanism was designed _long_ before
you starting talking about the evils of per-character language tags. And yet
the design of the language extension to MIME parameter values naturally evolved
along the lines of not having a tagging granularity that's too fine. You
yourself are saying here that it is just right.

When the tagging granularity is too fine it is a result of the design
constraints of the protocol being extended. And since we cannot control these
constraints we cannot mandate a particular granularity level for tagging.

> Also, the header syntax is indeed clumsy, but encoded words without
> language tags are already clumsy enough, and the added clumsiness
> is not really much. So it's word-based tagging, somewhat painful
> but not really a big deal, and designed to the constraints of MIME
> and email.

No it isn't, as I have just demonstrated. 

> MLSF, in particular the way it was presented and defended by it's
> proponent(s) on the unicore list, is (or hopefully *was*) completely
> different. First, it is pure individual character tagging. A tag
> can be put in at any place whatsoever.

Actually it is nothing of the sort. It is a tagging mechanism that can be used
arbitrary character sequences, including but not limited to those of length 1.
It is not limited to tagging individual characters, it doesn't make sense
to deliberately use it this way, and I cannot recall any discussion where
someone advocated this sort of usage. Allowing for it as a natural consequence
of combining strings in different languages, perhaps, but not intentional
use.

> Second, it ruined the clean
> properties of UTF-8, risking to blow up a lot of converters in
> unpredictable ways.

Even supposing I agree, which I don't, please explain why this has any
relevance whatsoever to the matter at hand. This is an issue with the overall
design of MLSF, not with its ability to insert language tags at a given level
of granularity. You could use MLSF for other sorts of tagging and this
would not change.

> Third, the design constraints of ACAP didn't
> necessitate at all.

Again, even supposing I agree, which I don't, please explain why this has any
relevance whatsoever to the matter at hand.

> It was a clear strawman. I remember naively
> proposing alternatives such as using metainformation for language,
> which was rejected on I don't remember what supposedly technical
> reasons.

All you are doing here is demonstrating that in addition to not understanding
how encoded-words work you also don't understand the design constraints ACAP
has to deal with.

> When I subscribed to the ACAP WG mailing list and had a
> look at the archives, I easily found mails discussing language as
> metainformation. But to the outside, this was withheld and denied,
> because some people felt that they *just needed* individual character
> tagging, but were not ready to discuss this technically in true
> IETF manner, but rather preferred to claim that they represented
> the IETF because they correctly assumed that most of their counterparts
> didn't have much of a clue about IETF process.

This may be your assessment of what happened. However, I was involved in all
this, and my assessment of your assessment is that it is entirely specious
and without merit.

> Fred - I don't mind having individual character tagging where it makes
> sense. It is possible with <SPAN> and LANG in HTML. It may make a lot
> of sense in other places. But I have been seriously burnt by claims,
> wrongly based on the IAB report [RFC 2070], that all text in internet
> protocols needs individual character tagging, and by upfront pseudo-
> arguments claiming technical necessities where there existed alternatives.

How exactly have you "been burned" by such claims?

> That has happened, and if I worry about it happening again, that's
> not a strawman at all.

Since you have yet to offer a counterexample that meets my criteria (which I
believe were entirely fair and even generous), I continue to label this as a
strawman. In fact any and all discussion of MLSF is by definition a strawman at
this time since the proposal is no longer even on the table!

> > And this is why I don't want to make rules about the level of granularity that
> > has to be provided: It presupposes not only that protocol designers and the
> > IESG are incompetent to decide these things themselves, it also presupposes
> > that we can at this time know all the constraints designers will be operating
> > under.

> I never proposed to make *rules* about granularity. The only thing
> I proposed was to mention granularity as such, as one (important)
> aspect of language tagging, to help designers get aware of the
> fact that this is an issue, and to avoid claims by people that
> would like to see it like this that "language tagging means
> you have to be able to tag each single character".

The entire purpose of the charset registration document is to specify rules. 
This, like it or not, is the nature of the beast. You will note that the weaker
formal IETF terms MAY and SHOULD are used nowhere in it. This is intentional,
because our experience with registration documents is that advisory text
tends to either be ignored or incorrectly interpreted as a MUST.

As such, I continue to oppose the addition of this text even as a guideline. A
reasonable inference from the statements you have made here is that even as a
guideline you will attempt to use such text to invalidate MLSF, a proposal I
will support should it become necessary to revive it in the future.

				Ned

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Sunday, 12 October 1997 18:05:38 UTC