RE: Charset policy - Post Munich from Martin J. Dürst on 1997-10-10 (ietf-charsets@w3.org from October to December 1997)

From: Martin J. Dürst <mduerst@ifi.unizh.ch>
Date: Fri, 10 Oct 1997 22:33:57 +0100 (MET)
To: Ned Freed <Ned.Freed@INNOSOFT.COM>
Cc: ietf-charsets@INNOSOFT.COM
Message-id: <Pine.SUN.3.96.971010214213.7026c-100000@enoshima>
On Fri, 10 Oct 1997, Ned Freed wrote:

> > On the technical side, PVCSC does not apply to characters, it applies
> > to encoded words. Encoded words have to be separated by linear white
> > space (which is not removed when decoding the encoded words, as far
> > as I understand), and can only have one language.  
> 
> I'm afraid your understanding is totally incorrect. From RFC2047:
>  
>    When displaying a particular header field that contains multiple
>    'encoded-word's, any 'linear-white-space' that separates a pair of
>    adjacent 'encoded-word's is ignored.  (This is to allow the use of
>    multiple 'encoded-word's to represent long strings of unencoded text,
>    without having to separate 'encoded-word's where spaces occur in the
>    unencoded text.)

Many thanks for pointing this out. I confess that I was a little bit
careless, and didn't check RFC2047 from front to end. And I should
have known, because as you say of course length restrictions for
encoded words otherwise would make it impossible to encode long
words, or sentences for those languages that don't use whitespace
to separate their words.



> I also note in passing that your fundamental misunderstanding of encoded-words
> either means either you have never implemented any of this or if you have you
> haven't done it properly.

This is correct.


> And I must confess that I am very disappointed by
> this. I had always assumed that you had substantive experience with both
> charset design and implementation of charset support -- experience that far
> exceeded my own, and that our present disagreement arose mostly out of a
> disconnect between the way the IETF does business and what you've seen happen
> in other venues. In fact I have even gone so far as to recommend you as someone
> with a good grasp of these issues.
> 
> I now see that my assessment was wrong. And I hasten to add that any fault --
> if fault is the right word -- is mine and mine alone -- you never
> misrepresented your abilities or experience. I simply assumed too much, and now
> have to revise my opinion.

Many thanks for not blaming me! And probably you don't even have to
blame yourself! It's of course difficult, and in addition indecent,
to try to judge myself in any way, but I would not like to deny
that I have a certain grasp on some of these issues, and I think you
haven't been the first one to recommend me as you described above.


Where you assumed too much was when you assumed, from whatever
experience I seemed to posses, in the fields of internationalization
and multilinguism, an actual experience in implementing RFC 1522/2047.
Because this may (or may not) be your prime point of contact with
these issues, and because you are most probably the single top
expert worldwide for MIME, it's not too difficult to find an explanation
for why that may have happened.


To help you in avoiding such surprises in the future, here a list
of some of the things I have done in (terms of implementation, i.e.
actual programming, and not including unrelated topics):

- Implemented a general architecture for character encoding conversion,
	to and from Unicode, for an object-oriented application framework,
	including about twenty encodings, and code-guessing for Japanese,
	Korean,... This was mainly for pure plain-text files, but the
	underlying input/output architecture with streams and stacked
	filters/buffers would not make it too difficult to use it for MIME,
	or for other kinds of in-text code switching such as pre Word97
	RTF files.

- Implemented a general localization architecture (for the same framework)
	that allows to change menu languages on the fly and separate
	for separate windows of the same application, and that avoids
	that a programmer has to change the internal code (no need
	to include "gettext" calls,...). [This is in use in an actual
	product, although there it is currently limited to Latin-1.]

- Implemented a general framework for keyboard input, including input
	for Korean Hangul (with backtracking) and Japanese (SKK) as
	well as many simpler cases.

- Implemented a general framework for multilingual/multiscript text
	display capable of handling things such as Arabic, Tamil,
	and CJK glyph disambiguation, with flexible fallback mechanisms
	in case fonts are missing or incomplete.

(all the above in C++)

The above framework was also used in an actual mail UA, which was
developped in Montreal as an University-Industry collaboration, and
which reached alpha stage and is still available on the net.
For propriety reasons, I have never seen the source of that mailer.


- Built a database and manipulation software for Japanese Kanji composition/
	decomposition, written in Prolog so that flexible queries can
	easily be made.

In all of the above, there are things I would do again the same,
things I would vary depending on circumstances, things I would
like to add if I had time, and of course things I would do
somewhat or even completely differently.


Please judge whether the experience listed above "far exeeds your own"
or not. And as for disagreements, I have to say that there probably
isn't a single expert in the UTC or the ISO bodies or the IETF or
W3C WGs I have been in contact with who wouldn't disagree on one
issue or another. And that's probably the same in every technical
field.


> And let me tell you that the
> handling of encoded words containing characters in multibyte charsets is in
> fact quite tricky and difficult to get right.

Because, as far as I have read, there is the requirement that an encoded
word contains whole characters, this is indeed true, because you have
to know where the character boundaries are.


> > The same for the
> > language specification for parameters defined in PVCSC, it is
> > one language per parameter, which is not individual character tagging.
> 
> This is true only because the design space allowed it and the design was
> actually simplified by imposing this restriction. Had the design space not
> allowed it (as it doesn't for encoded-words) or had the design been made overly
> complex by having this restriction it would not be there.

No problem with that. I never said that you have to use larger
granularity at all costs. The only thing I want to say is that
granularity is an issue, think about what it means for each
protocol, in particular if you have design choices that are
comparable otherwise. And because more protocols will hopefully
have language tag support from the beginning, they *will* have
more choices and less constraints.


Regards,	Martin.


--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Sunday, 12 October 1997 18:04:41 UTC