Re: Charset policy - Post Munich from Martin J. Dürst on 1997-09-01 (ietf-charsets@w3.org from July to September 1997)

From: Martin J. Dürst <mduerst@ifi.unizh.ch>
Date: Tue, 02 Sep 1997 00:25:36 +0200 (MET DST)
To: Ned Freed <Ned.Freed@INNOSOFT.COM>
Cc: ietf-charsets@INNOSOFT.COM
Message-id: <Pine.SUN.3.96.970901233820.12451I-100000@enoshima>
Hello Ned - Many thanks for your interesting comments to my comments
to Haralds draft. I'm glad to see that we agree on more points than
in some of our past conversations :-).

> > >    3.1.  What charset to use
> > >
> > >    All protocols MUST identify, for all character data, which charset
> > >    is in use.
> > >
> > >    Protocols MUST be able to use the UTF-8 charset, which consists of
> > >    the ISO 10646 coded character set combined with the UTF-8
> > >    character encoding scheme, as defined in [10646] Annex R
> > >    (published in Amendment 2), for all text.
> > >
> > >    They MAY specify how to use other charsets or other character
> > >    encoding schemes for ISO 10646, such as UTF-16, but lack of an
> > >    ability to use UTF-8 needs clear and solid justification in the
> > >    protocol specification document before being entered into or
> > >    advanced upon the standards track.
> 
> > The above two paragaphs contradict each other. You can't have
> > a MUST and then a MAYbe not on the same point. Either make the
> > first a SHOULD, or make a MUST for ISO 10646/Unicode, and then
> > a SHOULD for UTF-8.
> 
> I fail to see a contradiction of here. A protocol must be able to handle UTF-8
> if it handles character data. A protocol may elect to handle other charsets as
> well, possibly including one derived from other transformation formats of
> Unicode.
> 
> What I do see here is poor ordering of what is being proposed. I suggest
> that it instead say:
> 
>     Protocols MUST be able to use the UTF-8 charset, which consists of
>     the ISO 10646 coded character set combined with the UTF-8
>     character encoding scheme, as defined in [10646] Annex R
>     (published in Amendment 2), for all text. Any exceptions
>     must be fully justifiable and the justification must be given in the
>     protocol specification. A protocol which neither supports UTF-8 nor
>     justifies its use of some other charset MUST NOT be entered on the
>     standards track.
> 
>     Protocols MAY also specify how to use other charsets or other character
>     encoding schemes for ISO 10646, such as UTF-16. As always, any protocol
>     that elects to support more than one charset MUST provide a field to
>     label which charset is being used.
> 
> In any case, I'm somewhat opposed to weakening the UTF-8 support requirement to
> a SHOULD. It really is a MUST and needs to be stated as such. We can always
> make exceptions to a MUST on a case by case basis if need be. I doubt very much
> that there will be that many of them.

I think basically, we agree. We want to push UTF-8 as much as possible.
If exceptions to a MUST are possible, then there is no contradiction;
probably Harald is in the best position to decide this.

What I think could become more frequent in the long run are protocols
or formats that want to go with UCS-2 instead of UTF-8. This will
probably be a long process, but we shouldn't put too many blocks in
it's way.


> > >    In most cases, machines cannot deduce the language of a
> > >    transmitted text by themselves;
> 
> > This is not true. There is enough evidence that for any given
> > set of languages, it is possible to devise or generate software
> > that identifies the language with accuracy converging to 100%
> > as the length of the text increases, and as the amount of
> > effort (e.g. table/dictionary size,...) increases. And once
> > this effort is done, the gap between what humans can find out
> > and what machines can find out is small.
> 
> Everything you say  may be true, but it doesn't disprove Harald's statement.

It doesn't disprove Harald's statement if you interpret it the way
he intended it to be interpreted. But there are other ways to
interpret it, namely as a categorical statement, in which case it
is false. Therefore, a more precise wording seems desirable,
which you have suggested below.

> Yes, you may be able to build a machine that deduces language with precision
> approaching 100% as the amount of text increases. However, you have not
> demontrated that:
> 
> (0) Enough text is always going to be available to make this possible.
> (1) That the 100% point is actually reached. (Convergence to 100% is not
>     the same thing, and in some cases 100% is the only acceptable answer.)
> (2) That the set of languages we use is always closed.
> (3) Machines in the real world will universally be retrofitted to have this
>     capability.
> 
> (0), (2), and (3) are in fact demonstrably false.

I agree. But then, there are cases where humans also fail, even
if they perfectly know all the languages involved. For examlpe,
can somebody tell me whether the word "burro" is Italian or Spanish?
The meaning changes completely depending on which language you assume.
Therefore, in human communication, there is usually enough redundancy
and context, or the incertainty is intended.
I think that it is important that protocol designers understand
that language tags are not something inherently required by the
fact that humans use various languages, but that they may be
requrired by the fact that computers have different (actual or
theorectial) capabilities and may lack context. This will help
protocol designers to achieve an adequate solution for the
needs of their protocols.


> Moreover, the point here, that machine recognition of the language being used
> cannot be relied upon, is a damned important one that should not be left out. I
> really want to forestall finding domain-->language tag tables in some product
> somewhere. As such, to forestall further argument I suggest that the paragraph
> be reworded to say that at the present time most machines lack the facilities
> to deduce language from content.

Good point.


> >    Please note that language information as such is not needed
> >    for the end user; humans have no problem identifying the
> >    languages they know and separating them from those they
> >    don't know.
> 
> This point, on the other hand, is demonstrably false since I have a specific
> counterexample of my own to offer.

Sorry, but yours is not a counterexample. Your are speaking about
identifying languages you don't know, where I explicitly speak about
languages that somebody knows. For the above statement to be true,
would need that you are able to distinguish English from Spanish from
French from everything else (lumped together). I guess you have no
problems with that.


> I routinely deal with customers in over 50
> countries, quite a few of which either use multiple languages or else don't
> have domain names that let me deduce country and hence probable language. And I
> occasionally receive messages from these places written in a language other
> than English, French or Spanish, hence outside my admittedly limited linguistic
> skills and limited dictionary set I keep handy.
> 
> And moreover, I sometimes cannot figure out what language is being used. (A lot
> of the ones I get look like German to me but aren't. Hey, what can I say, my
> education in this regard was terrible.) And this actually matters to me, since
> depending on the language I'll take the message to different people in the
> office or else forward it to various people I know for translation. A language
> tag would certainly help me in these cases, although I regret to say that such
> tags are rarely used in practice.
> 
> The basic problem here is that you're assuming communication is between people
> who know each other. This is usually but not always the case, and when it isn't
> these tags may actually be useful to a human reader.
> 
> I therefore do not support the addition of this text, as it will inevitably
> lead to cases where language tags will be omitted when they could have been
> useful.

This is a good point. You give an examlpe of a situation that I didn't
consider. I guess it is worth mentionning this situation explicitly,
for examlpe as follows:

Note that while humans are easily capable to identify those languages
they know and to distinguish them from those they don't know, there
are situations, for examlpe in worldwide customer support, where it
is very useful to identify languages by people that don't know them.


> > >    4.2.  Requirement for language tagging
> > >
> > >    Protocols that transfer text MUST provide for carrying information
> > >    about the language of that text.
> 
> > This is most probably too strong.
> 
> > What about:
> 
> > Protocols that transfer text MUST provide for carrying language
> > information to the extend and in the granularity that this is
> > necessary and apropriate for the operations that the text in
> > the protocol is generally intended and used for.
> 
> This, on the other hand, is too wishy-washy. We need these tags and we need
> for them to be used a lot more than they currently are. What we do not 
> need is to have lots of debates about whether or not a given protocol
> is needs such a field. It is far better to have fields we end up not using
> than to need fields we do not have.

I agree. But the MLSF debate, among else, has shown that there
are oppinions in the direction of "if it's not possible to tag
every single character with a language, it can't be called language
tagging". For many protocols and applications, this is clearly overkill.
My text above was intended to tell protocol designers to seriously
think about the interactions between their protocol and language,
and not to just cover everything with tags in fear the IESG might
otherwise reject their proposal. Maybe you know a different wording
that is more precise but has the same desired effect?



> > >    4.3.  How to identify a language
> > >
> > >    The RFC 1766 language tag is at the moment the most flexible tool
> > >    available for identifying a language; protocols SHOULD use this,
> > >    or provide clear and solid justification for doing otherwise in
> > >    the document.
> > >
> > >    In particular, claiming that a language can be deduced from the
> > >    charset in use is erroneous and will not be accepted.
> 
> > Correct. But isn't this all too obvious, given things like
> > iso-8859-1? I don't think you need this in any way to be able
> > to reject such claims should they ever come up.
> 
> Well, it may be true that everyone knows you cannot deduce language
> from iso-8859-1. But what about iso-2022-jp?
> 
> The point here is that claims of a _limited_ ability to deduce language
> from _some_ charsets have in fact been made, and we need language that
> says such claims are unacceptable no matter what.

Okay. The original sentence seems too obvious because it appears in
the context of a whole protocol. Because I haven't yet seen an IETF
protocol that uses only iso-2022-jp, it sounded too obvious to me.
If it can be reworded so as to be clearly framed as advice
to implementors (please put a language tag in even if you think
the language is obvious from the current charset) and not as
advice to protocol designers (don't come to the IESG with a
protocol that has no language tags and claim that these can
be derived from the charsets), then I'm all for it.



> > >    In some cases, a negotiation where the client proposes a set of
> > >    languages and the server replies with one is appropriate; in other
> > >    cases, supplying information in all available languages is a
> > >    better solution; most sites will either have very few languages
> > >    installed or be willing to pay the overhead of sending error
> > >    messages in many languages at once.
> 
> > I don't agree. There may be only few sites that have many
> > languages available, but those may be contacted by users
> > with special language needs that can't afford the bandwidth
> > (even if the server side providing these many languages has
> > no problem with the bandwith).
> 
> So what? Harald didn't say that implementations have to provide responses
> in multiple languages, merely that providing responses in multiple languages
> is a viable approach. And it is viable -- indeed, I have customers that
> require it.

It's again a problem of framing, whether it's the whole protocol
or a single transaction. On the protocol level, providing only
for "you always get all languages" is clearly unacceptable for
the reasons I have stated. It just doesn't scale, and may hit
those that can deal least with it. But what Harald says, as
far as I understand it, is on the protocol level, and therefore
should be changed.

If your point is that some customers demand that e.g. messages
be sent in multiple languages AT THE SAME TIME, then this is
an interesting new finding. The ACAP language negotiation
facility, for example, wouldn't cover such a case.
It may be worth to mention this point at least so that
protocol designers can consider it.
But again, a protocol that does not have any kind of language
negotiation and just sends all available languages all the time
is not a solution.


> > Also, there is an increasing tendency for products to ship
> > with all language versions integrated. For a NS or MS server,
> > you won't by a specific language version anymore very soon
> > in the future.
> 
> I fail to see the point here.

If the HTTP protocol specified "always send all languages", even
just for warnings, let alone for documents, we may be hit badly.


> > >    Negotiation is useful in the case where one side of the protocol
> > >    exchange is able to present text in multiple languages to the
> > >    other side, and the other side has a preference for one of these;
> > >    the most common example is the text part of error responses, or
> > >    Web pages that are available in multiple languages.
> 
> > The "one side is able" is somewhat dangerous here. A WG may
> > just come and tell you: Our servers all just do English,
> > the are not able to do anything else, so this doesn't apply.
> 
> The reality is that implementations are going to do this whether we
> like it or not. We can require what we like of implementations in
> terms of support of mutiple languages and we'll just be ignored.
> 
> In other words, there's a real danger here, but it isn't something we can do
> much of anything about, and as such this clause is almost entirely harmless.

I agree that we can't do too much if a protocol only gets implemented
in one language. We cannot force people to implement other languages.

But it's again a question of framing. What I wanted to say is that
we don't want some people to conclude from the implementation
level (only English implementations available) to the protocol
level (therefore, according to the policy, we don't need language
negotiation).
Given the requirements for interoperability tests to advance
standards, we get into a real conflict situation. Probably the
right solution in such a conflict is to keep the (untested)
language negotiation facilities in a non-normative appendix,
but not to allow them to just die.



> > Please add something like the following:
> 
> >    6. Documentation
> 
> >    Protocols MUST appropriately document the decisions they have
> >    taken with respect to charsets, language information, and other
> >    aspects related to internationalization and multilinguality.
> >    A format such as that currently used for Security Issues is
> >    (highly) recommended.
> 
> I would add that they must document their rationale as well as the
> decisions.

Good point!


> > Another thing, which should probably go into section 2 or so,
> > and which seems needed as a response to some of the questions
> > in the plenary in Munich, is a clarification of which protocol
> > in a protocol stack is responsible for charset and language
> > information. I'm not sure that I have found the best way
> > to express this, but it could read as follows:
> 
> >    Note that in a protocol stack, it is the responsibility of
> >    the highest layer that uses the text to appropriately label
> >    it. As an example, it is the responsibility of the standard
> >    for mail messages to assure things get correctly labeled in
> >    mail messages, even if those are sent over SMTP. It is the
> >    responsibility of SMTP to correctly label text which is
> >    exchanged as part of the SMTP protocol and is intended for
> >    end-user consumption, even if SMTP is run over TCP/IP.
> >    It would be the responsibility of IP to label text correctly
> >    if it ever would consider using text in its protocol elements
> >    (as opposed to transporting text in its payload).
> 
> I agree that this is an important point. I also think this is as 
> good an attempt as I've seen to describe the requirements in this area.

Thanks. One small correction. It should say "responsibility of the
standard for XXX to assure things *can* be [correctly] labeled".
An implementation can at least try to achieve correctness, while
there is not much for a protocol to do about it.


Regards,	Martin.


--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Monday, 1 September 1997 15:38:09 UTC