- From: Martin J. Dürst <mduerst@ifi.unizh.ch>
- Date: Fri, 10 Oct 1997 22:33:57 +0100 (MET)
- To: Ned Freed <Ned.Freed@INNOSOFT.COM>
- Cc: ietf-charsets@INNOSOFT.COM
On Fri, 10 Oct 1997, Ned Freed wrote: > > On the technical side, PVCSC does not apply to characters, it applies > > to encoded words. Encoded words have to be separated by linear white > > space (which is not removed when decoding the encoded words, as far > > as I understand), and can only have one language. > > I'm afraid your understanding is totally incorrect. From RFC2047: > > When displaying a particular header field that contains multiple > 'encoded-word's, any 'linear-white-space' that separates a pair of > adjacent 'encoded-word's is ignored. (This is to allow the use of > multiple 'encoded-word's to represent long strings of unencoded text, > without having to separate 'encoded-word's where spaces occur in the > unencoded text.) Many thanks for pointing this out. I confess that I was a little bit careless, and didn't check RFC2047 from front to end. And I should have known, because as you say of course length restrictions for encoded words otherwise would make it impossible to encode long words, or sentences for those languages that don't use whitespace to separate their words. > I also note in passing that your fundamental misunderstanding of encoded-words > either means either you have never implemented any of this or if you have you > haven't done it properly. This is correct. > And I must confess that I am very disappointed by > this. I had always assumed that you had substantive experience with both > charset design and implementation of charset support -- experience that far > exceeded my own, and that our present disagreement arose mostly out of a > disconnect between the way the IETF does business and what you've seen happen > in other venues. In fact I have even gone so far as to recommend you as someone > with a good grasp of these issues. > > I now see that my assessment was wrong. And I hasten to add that any fault -- > if fault is the right word -- is mine and mine alone -- you never > misrepresented your abilities or experience. I simply assumed too much, and now > have to revise my opinion. Many thanks for not blaming me! And probably you don't even have to blame yourself! It's of course difficult, and in addition indecent, to try to judge myself in any way, but I would not like to deny that I have a certain grasp on some of these issues, and I think you haven't been the first one to recommend me as you described above. Where you assumed too much was when you assumed, from whatever experience I seemed to posses, in the fields of internationalization and multilinguism, an actual experience in implementing RFC 1522/2047. Because this may (or may not) be your prime point of contact with these issues, and because you are most probably the single top expert worldwide for MIME, it's not too difficult to find an explanation for why that may have happened. To help you in avoiding such surprises in the future, here a list of some of the things I have done in (terms of implementation, i.e. actual programming, and not including unrelated topics): - Implemented a general architecture for character encoding conversion, to and from Unicode, for an object-oriented application framework, including about twenty encodings, and code-guessing for Japanese, Korean,... This was mainly for pure plain-text files, but the underlying input/output architecture with streams and stacked filters/buffers would not make it too difficult to use it for MIME, or for other kinds of in-text code switching such as pre Word97 RTF files. - Implemented a general localization architecture (for the same framework) that allows to change menu languages on the fly and separate for separate windows of the same application, and that avoids that a programmer has to change the internal code (no need to include "gettext" calls,...). [This is in use in an actual product, although there it is currently limited to Latin-1.] - Implemented a general framework for keyboard input, including input for Korean Hangul (with backtracking) and Japanese (SKK) as well as many simpler cases. - Implemented a general framework for multilingual/multiscript text display capable of handling things such as Arabic, Tamil, and CJK glyph disambiguation, with flexible fallback mechanisms in case fonts are missing or incomplete. (all the above in C++) The above framework was also used in an actual mail UA, which was developped in Montreal as an University-Industry collaboration, and which reached alpha stage and is still available on the net. For propriety reasons, I have never seen the source of that mailer. - Built a database and manipulation software for Japanese Kanji composition/ decomposition, written in Prolog so that flexible queries can easily be made. In all of the above, there are things I would do again the same, things I would vary depending on circumstances, things I would like to add if I had time, and of course things I would do somewhat or even completely differently. Please judge whether the experience listed above "far exeeds your own" or not. And as for disagreements, I have to say that there probably isn't a single expert in the UTC or the ISO bodies or the IETF or W3C WGs I have been in contact with who wouldn't disagree on one issue or another. And that's probably the same in every technical field. > And let me tell you that the > handling of encoded words containing characters in multibyte charsets is in > fact quite tricky and difficult to get right. Because, as far as I have read, there is the requirement that an encoded word contains whole characters, this is indeed true, because you have to know where the character boundaries are. > > The same for the > > language specification for parameters defined in PVCSC, it is > > one language per parameter, which is not individual character tagging. > > This is true only because the design space allowed it and the design was > actually simplified by imposing this restriction. Had the design space not > allowed it (as it doesn't for encoded-words) or had the design been made overly > complex by having this restriction it would not be there. No problem with that. I never said that you have to use larger granularity at all costs. The only thing I want to say is that granularity is an issue, think about what it means for each protocol, in particular if you have design choices that are comparable otherwise. And because more protocols will hopefully have language tag support from the beginning, they *will* have more choices and less constraints. Regards, Martin. --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Sunday, 12 October 1997 18:04:41 UTC