- From: Ned Freed <Ned.Freed@INNOSOFT.COM>
- Date: Fri, 10 Oct 1997 08:49:12 -0700 (PDT)
- To: Martin J. Dürst <mduerst@ifi.unizh.ch>
- Cc: Ned Freed <Ned.Freed@INNOSOFT.COM>, ietf-charsets@INNOSOFT.COM
> > Please note that designs like MIME encoded-words and MLSF which do allow > > individual character tagging, do NOT qualify. The fact that these designs allow > > individual character tagging (albeit in a very painful and artificial way) is a > > artefact of the design constraints these things operate under. > You guessed in the right direction, obviously. But I see a big difference > between MIME encoded words (as extended in the pvcsc spec) and MLSF. Two additional points need to be made here: (1) MLSF is no longer on the table as a proposal at the present time. This may change, however, if the UTC and ISO do not deliver on embedded language tags. (2) If MLSF needs to be revived I for one will support it. As such, I have no intention of putting something in a document that would prevent this from happening unless and until I am specifically directed to do so by either an AD or a WG chair with jurisdiction over the charset registration specification. You can object all you want, but since this is not a WG there's no chair, so unless you can convince either Harald or Keith that such language belongs in the document it simply is not going to happen. > On the technical side, PVCSC does not apply to characters, it applies > to encoded words. Encoded words have to be separated by linear white > space (which is not removed when decoding the encoded words, as far > as I understand), and can only have one language. I'm afraid your understanding is totally incorrect. From RFC2047: When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.) The problem is that encoded-words have length restrictions and restrictions on how they can be presented. The two combine to produce a requirement that it be possible to use multiple encoded-words to represent a long string. And this of course opens the door up to having each character in a different charset and language. I also note in passing that your fundamental misunderstanding of encoded-words either means either you have never implemented any of this or if you have you haven't done it properly. And I must confess that I am very disappointed by this. I had always assumed that you had substantive experience with both charset design and implementation of charset support -- experience that far exceeded my own, and that our present disagreement arose mostly out of a disconnect between the way the IETF does business and what you've seen happen in other venues. In fact I have even gone so far as to recommend you as someone with a good grasp of these issues. I now see that my assessment was wrong. And I hasten to add that any fault -- if fault is the right word -- is mine and mine alone -- you never misrepresented your abilities or experience. I simply assumed too much, and now have to revise my opinion. But the fact of the matter is that this is one of those things where if you haven't either implemented it yourself or tried to support an implementation in real-world use you cannot possibly know, let alone understand, the real-world issues that come up. I've done both -- I've done not one but two implementations of encoded-word support, both in commercial products that are widely used in over 50 countries, and I have done ongoing support work on both and continue to do so at the present time. And let me tell you that the handling of encoded words containing characters in multibyte charsets is in fact quite tricky and difficult to get right. And language tags make it even worse -- I have preliminary support in place for them so I think it is at least doable, but I'm a ways away from actually trusting that I have all the details right. (And in this case there's an entire pantheon of devils in the details.) > The same for the > language specification for parameters defined in PVCSC, it is > one language per parameter, which is not individual character tagging. This is true only because the design space allowed it and the design was actually simplified by imposing this restriction. Had the design space not allowed it (as it doesn't for encoded-words) or had the design been made overly complex by having this restriction it would not be there. And I believe this again demonstrates the essential fallacy that lies behind your entire train of thought here. This mechanism was designed _long_ before you starting talking about the evils of per-character language tags. And yet the design of the language extension to MIME parameter values naturally evolved along the lines of not having a tagging granularity that's too fine. You yourself are saying here that it is just right. When the tagging granularity is too fine it is a result of the design constraints of the protocol being extended. And since we cannot control these constraints we cannot mandate a particular granularity level for tagging. > Also, the header syntax is indeed clumsy, but encoded words without > language tags are already clumsy enough, and the added clumsiness > is not really much. So it's word-based tagging, somewhat painful > but not really a big deal, and designed to the constraints of MIME > and email. No it isn't, as I have just demonstrated. > MLSF, in particular the way it was presented and defended by it's > proponent(s) on the unicore list, is (or hopefully *was*) completely > different. First, it is pure individual character tagging. A tag > can be put in at any place whatsoever. Actually it is nothing of the sort. It is a tagging mechanism that can be used arbitrary character sequences, including but not limited to those of length 1. It is not limited to tagging individual characters, it doesn't make sense to deliberately use it this way, and I cannot recall any discussion where someone advocated this sort of usage. Allowing for it as a natural consequence of combining strings in different languages, perhaps, but not intentional use. > Second, it ruined the clean > properties of UTF-8, risking to blow up a lot of converters in > unpredictable ways. Even supposing I agree, which I don't, please explain why this has any relevance whatsoever to the matter at hand. This is an issue with the overall design of MLSF, not with its ability to insert language tags at a given level of granularity. You could use MLSF for other sorts of tagging and this would not change. > Third, the design constraints of ACAP didn't > necessitate at all. Again, even supposing I agree, which I don't, please explain why this has any relevance whatsoever to the matter at hand. > It was a clear strawman. I remember naively > proposing alternatives such as using metainformation for language, > which was rejected on I don't remember what supposedly technical > reasons. All you are doing here is demonstrating that in addition to not understanding how encoded-words work you also don't understand the design constraints ACAP has to deal with. > When I subscribed to the ACAP WG mailing list and had a > look at the archives, I easily found mails discussing language as > metainformation. But to the outside, this was withheld and denied, > because some people felt that they *just needed* individual character > tagging, but were not ready to discuss this technically in true > IETF manner, but rather preferred to claim that they represented > the IETF because they correctly assumed that most of their counterparts > didn't have much of a clue about IETF process. This may be your assessment of what happened. However, I was involved in all this, and my assessment of your assessment is that it is entirely specious and without merit. > Fred - I don't mind having individual character tagging where it makes > sense. It is possible with <SPAN> and LANG in HTML. It may make a lot > of sense in other places. But I have been seriously burnt by claims, > wrongly based on the IAB report [RFC 2070], that all text in internet > protocols needs individual character tagging, and by upfront pseudo- > arguments claiming technical necessities where there existed alternatives. How exactly have you "been burned" by such claims? > That has happened, and if I worry about it happening again, that's > not a strawman at all. Since you have yet to offer a counterexample that meets my criteria (which I believe were entirely fair and even generous), I continue to label this as a strawman. In fact any and all discussion of MLSF is by definition a strawman at this time since the proposal is no longer even on the table! > > And this is why I don't want to make rules about the level of granularity that > > has to be provided: It presupposes not only that protocol designers and the > > IESG are incompetent to decide these things themselves, it also presupposes > > that we can at this time know all the constraints designers will be operating > > under. > I never proposed to make *rules* about granularity. The only thing > I proposed was to mention granularity as such, as one (important) > aspect of language tagging, to help designers get aware of the > fact that this is an issue, and to avoid claims by people that > would like to see it like this that "language tagging means > you have to be able to tag each single character". The entire purpose of the charset registration document is to specify rules. This, like it or not, is the nature of the beast. You will note that the weaker formal IETF terms MAY and SHOULD are used nowhere in it. This is intentional, because our experience with registration documents is that advisory text tends to either be ignored or incorrectly interpreted as a MUST. As such, I continue to oppose the addition of this text even as a guideline. A reasonable inference from the statements you have made here is that even as a guideline you will attempt to use such text to invalidate MLSF, a proposal I will support should it become necessary to revive it in the future. Ned --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Sunday, 12 October 1997 18:05:38 UTC