Re: Charset policy - Post Munich

Chris Weider (cweider@microsoft.com)
Mon, 01 Sep 1997 15:05:38 -0700


Date: Mon, 01 Sep 1997 15:05:38 -0700
From: Chris Weider <cweider@microsoft.com>
Subject: RE: Charset policy - Post Munich
To: 'Ned Freed' <Ned.Freed@INNOSOFT.COM>, ietf-charsets@INNOSOFT.COM
Message-id: <C17B22BA6992CF11857C00805FD4098201ED166B@RED-94-MSG.dns.microsoft.com>

I think Ned is completely correct here. The workshop report thought long
and hard about requiring language tagging and mandatory UTF-8 and
realized that this is the only way to make things work with the stupid
machines we have now :^)
Chris

> -----Original Message-----
> From:	Ned Freed [SMTP:Ned.Freed@innosoft.com]
> Sent:	Sunday, August 31, 1997 1:07 PM
> To:	ietf-charsets@innosoft.com
> Subject:	Re: Charset policy - Post Munich
> 
> > >    3.1.  What charset to use
> > >
> > >    All protocols MUST identify, for all character data, which
> charset
> > >    is in use.
> > >
> > >    Protocols MUST be able to use the UTF-8 charset, which consists
> of
> > >    the ISO 10646 coded character set combined with the UTF-8
> > >    character encoding scheme, as defined in [10646] Annex R
> > >    (published in Amendment 2), for all text.
> > >
> > >    They MAY specify how to use other charsets or other character
> > >    encoding schemes for ISO 10646, such as UTF-16, but lack of an
> > >    ability to use UTF-8 needs clear and solid justification in the
> > >    protocol specification document before being entered into or
> > >    advanced upon the standards track.
> 
> > The above two paragaphs contradict each other. You can't have
> > a MUST and then a MAYbe not on the same point. Either make the
> > first a SHOULD, or make a MUST for ISO 10646/Unicode, and then
> > a SHOULD for UTF-8.
> 
> I fail to see a contradiction of here. A protocol must be able to
> handle UTF-8
> if it handles character data. A protocol may elect to handle other
> charsets as
> well, possibly including one derived from other transformation formats
> of
> Unicode.
> 
> What I do see here is poor ordering of what is being proposed. I
> suggest
> that it instead say:
> 
>     Protocols MUST be able to use the UTF-8 charset, which consists of
>     the ISO 10646 coded character set combined with the UTF-8
>     character encoding scheme, as defined in [10646] Annex R
>     (published in Amendment 2), for all text. Any exceptions
>     must be fully justifiable and the justification must be given in
> the
>     protocol specification. A protocol which neither supports UTF-8
> nor
>     justifies its use of some other charset MUST NOT be entered on the
>     standards track.
> 
>     Protocols MAY also specify how to use other charsets or other
> character
>     encoding schemes for ISO 10646, such as UTF-16. As always, any
> protocol
>     that elects to support more than one charset MUST provide a field
> to
>     label which charset is being used.
> 
> In any case, I'm somewhat opposed to weakening the UTF-8 support
> requirement to
> a SHOULD. It really is a MUST and needs to be stated as such. We can
> always
> make exceptions to a MUST on a case by case basis if need be. I doubt
> very much
> that there will be that many of them.
> 
> > >    In most cases, machines cannot deduce the language of a
> > >    transmitted text by themselves;
> 
> > This is not true. There is enough evidence that for any given
> > set of languages, it is possible to devise or generate software
> > that identifies the language with accuracy converging to 100%
> > as the length of the text increases, and as the amount of
> > effort (e.g. table/dictionary size,...) increases. And once
> > this effort is done, the gap between what humans can find out
> > and what machines can find out is small.
> 
> Everything you say  may be true, but it doesn't disprove Harald's
> statement.
> Yes, you may be able to build a machine that deduces language with
> precision
> approaching 100% as the amount of text increases. However, you have
> not
> demontrated that:
> 
> (0) Enough text is always going to be available to make this possible.
> (1) That the 100% point is actually reached. (Convergence to 100% is
> not
>     the same thing, and in some cases 100% is the only acceptable
> answer.)
> (2) That the set of languages we use is always closed.
> (3) Machines in the real world will universally be retrofitted to have
> this
>     capability.
> 
> (0), (2), and (3) are in fact demonstrably false. As such, I claim
> Harald's
> statement, which you should not didn't say that machine recognitiion
> isn't
> possible, but only that most machines aren't capable of it right now,
> is
> correct.
> 
> Moreover, the point here, that machine recognition of the language
> being used
> cannot be relied upon, is a damned important one that should not be
> left out. I
> really want to forestall finding domain-->language tag tables in some
> product
> somewhere. As such, to forestall further argument I suggest that the
> paragraph
> be reworded to say that at the present time most machines lack the
> facilities
> to deduce language from content.
> 
> >    Please note that language information as such is not needed
> >    for the end user; humans have no problem identifying the
> >    languages they know and separating them from those they
> >    don't know.
> 
> This point, on the other hand, is demonstrably false since I have a
> specific
> counterexample of my own to offer. I routinely deal with customers in
> over 50
> countries, quite a few of which either use multiple languages or else
> don't
> have domain names that let me deduce country and hence probable
> language. And I
> occasionally receive messages from these places written in a language
> other
> than English, French or Spanish, hence outside my admittedly limited
> linguistic
> skills and limited dictionary set I keep handy.
> 
> And moreover, I sometimes cannot figure out what language is being
> used. (A lot
> of the ones I get look like German to me but aren't. Hey, what can I
> say, my
> education in this regard was terrible.) And this actually matters to
> me, since
> depending on the language I'll take the message to different people in
> the
> office or else forward it to various people I know for translation. A
> language
> tag would certainly help me in these cases, although I regret to say
> that such
> tags are rarely used in practice.
> 
> The basic problem here is that you're assuming communication is
> between people
> who know each other. This is usually but not always the case, and when
> it isn't
> these tags may actually be useful to a human reader.
> 
> I therefore do not support the addition of this text, as it will
> inevitably
> lead to cases where language tags will be omitted when they could have
> been
> useful.
> 
> >    Please note that languages are not as clearcut a concept as
> >    character sets. There are mixtures of languages, language
> >    variants, words that move from one language to another,
> >    and text parts that are not in any particular language.
> 
> This is a good point and one that needs to be made.
> 
> > >    4.2.  Requirement for language tagging
> > >
> > >    Protocols that transfer text MUST provide for carrying
> information
> > >    about the language of that text.
> 
> > This is most probably too strong.
> 
> > What about:
> 
> > Protocols that transfer text MUST provide for carrying language
> > information to the extend and in the granularity that this is
> > necessary and apropriate for the operations that the text in
> > the protocol is generally intended and used for.
> 
> This, on the other hand, is too wishy-washy. We need these tags and we
> need
> for them to be used a lot more than they currently are. What we do not
> 
> need is to have lots of debates about whether or not a given protocol
> is needs such a field. It is far better to have fields we end up not
> using
> than to need fields we do not have.
> 
> > >    Protocols SHOULD also provide for carrying information about
> the
> > >    language of names.
> 
> > Do you seriously want to suggest that we devise some kind of
> > language-tag syntax for URLs, Email addresses, host names, and
> > so on?
> 
> Here I agree that the present document goes too far. Name languages
> are _incredibly_ tricky stuff -- if you think words move around a bit,
> you
> should see, say, the Korean-American phone book for the greater LA
> area!
> 
> I think these needs to be dropped entirely.
> 
> > >    4.3.  How to identify a language
> > >
> > >    The RFC 1766 language tag is at the moment the most flexible
> tool
> > >    available for identifying a language; protocols SHOULD use
> this,
> > >    or provide clear and solid justification for doing otherwise in
> > >    the document.
> > >
> > >    In particular, claiming that a language can be deduced from the
> > >    charset in use is erroneous and will not be accepted.
> 
> > Correct. But isn't this all too obvious, given things like
> > iso-8859-1? I don't think you need this in any way to be able
> > to reject such claims should they ever come up.
> 
> Well, it may be true that everyone knows you cannot deduce language
> from iso-8859-1. But what about iso-2022-jp?
> 
> The point here is that claims of a _limited_ ability to deduce
> language
> from _some_ charsets have in fact been made, and we need language that
> says such claims are unacceptable no matter what.
> 
> > >    4.4.  Considerations for negotiation
> 
> > Please say "language negotiation".
> 
> Agreed.
> 
> > >    Protocols where users have text presented to them in response
> to
> > >    user actions MUST provide for multiple languages.
> 
> > This is too sweeping. Some people could think that it means that
> > a protocol must provide at least two languages, or that every
> > implementation has to provide multiple languages.
> 
> > Please say something like:
> 
> >    Protocols where users have text presented to them in response
> >    to user actions MUST provide the means by which implementors
> >    can satisfy the language needs of the users.
> 
> I have no problem with this.
> 
> > >    In some cases, a negotiation where the client proposes a set of
> > >    languages and the server replies with one is appropriate; in
> other
> > >    cases, supplying information in all available languages is a
> > >    better solution; most sites will either have very few languages
> > >    installed or be willing to pay the overhead of sending error
> > >    messages in many languages at once.
> 
> > I don't agree. There may be only few sites that have many
> > languages available, but those may be contacted by users
> > with special language needs that can't afford the bandwidth
> > (even if the server side providing these many languages has
> > no problem with the bandwith).
> 
> So what? Harald didn't say that implementations have to provide
> responses
> in multiple languages, merely that providing responses in multiple
> languages
> is a viable approach. And it is viable -- indeed, I have customers
> that
> require it.
> 
> > Also, there is an increasing tendency for products to ship
> > with all language versions integrated. For a NS or MS server,
> > you won't by a specific language version anymore very soon
> > in the future.
> 
> I fail to see the point here.
> 
> > >    Negotiation is useful in the case where one side of the
> protocol
> > >    exchange is able to present text in multiple languages to the
> > >    other side, and the other side has a preference for one of
> these;
> > >    the most common example is the text part of error responses, or
> > >    Web pages that are available in multiple languages.
> 
> > The "one side is able" is somewhat dangerous here. A WG may
> > just come and tell you: Our servers all just do English,
> > the are not able to do anything else, so this doesn't apply.
> 
> The reality is that implementations are going to do this whether we
> like it or not. We can require what we like of implementations in
> terms of support of mutiple languages and we'll just be ignored.
> 
> In other words, there's a real danger here, but it isn't something we
> can do
> much of anything about, and as such this clause is almost entirely
> harmless.
> 
> > >    4.5.  Default Language
> 
> > >    When human-readable text must be presented in a context where
> the
> > >    sender has no knowledge of the recipient's language preferences
> > >    (such as login failures or E-mailed warnings, or prior to
> language
> > >    negotiation), text SHOULD be presented in Default Language.
> 
> > >    The Default Language is English, since this is the language
> which
> > >    most people will be able to get adequate help in interpreting
> when
> > >    working with computers.
> 
> > It may be a good idea to replace "most people" by "the greatest
> number
> > of people". This is a sensitive spot, and "most people" is saying
> > something about their absolute percentage, whereas we just need to
> > say that it is better than any other language we could pick.
> 
> Agreed.
> 
> > >    Note that negotiating English is NOT the same as Default
> Language;
> > >    Default Language is an emergency measure in otherwise
> unmanageable
> > >    situations. It may be appropriate for application designers to
> > >    make sure that messages in Default Language are understandable
> to
> > >    people with a limited understanding of the English language.
> 
> > The following is implicit here, but has led to prolonged discussions
> > on some lists:
> 
> > What I think the text above says is that it's not permitted to
> > say: "If the client doesn't negotiate language, this defaults to
> > English (or whatever other "default" language)."
> 
> > If this is the case, it would be better to explicitly state:
> 
> >    Protocols MUST NOT define a default language to avoid language
> >    negotiation; language MUST be explicitly negotiated for all
> >    languages.
> 
> > I think it's better to make this clear, if this is what is desired,
> > and something else otherwise, than to have more such discussions.
> 
> Agreed.
> 
> > >    5.  Locale
> 
> > >    In some cases, and especially with text where the user is
> expected
> > >    to do processing on the text, locale information may be
> usefully
> > >    attached to the text; this would identify the sender's opinion
> > >    about appropriate rules to follow when processing the document,
> > >    which the recipient may choose to agree with or ignore.
> > >
> > >    This document does not require the communication of locale
> > >    information on all text, but encourages its inclusion when
> > >    appropriate.
> 
> > The above is not very clearcut, but there is probably nothing
> > better in sight.
> 
> Agreed.
> 
> > Please add something like the following:
> 
> >    6. Documentation
> 
> >    Protocols MUST appropriately document the decisions they have
> >    taken with respect to charsets, language information, and other
> >    aspects related to internationalization and multilinguality.
> >    A format such as that currently used for Security Issues is
> >    (highly) recommended.
> 
> I would add that they must document their rationale as well as the
> decisions.
> 
> > Another thing, which should probably go into section 2 or so,
> > and which seems needed as a response to some of the questions
> > in the plenary in Munich, is a clarification of which protocol
> > in a protocol stack is responsible for charset and language
> > information. I'm not sure that I have found the best way
> > to express this, but it could read as follows:
> 
> >    Note that in a protocol stack, it is the responsibility of
> >    the highest layer that uses the text to appropriately label
> >    it. As an example, it is the responsibility of the standard
> >    for mail messages to assure things get correctly labeled in
> >    mail messages, even if those are sent over SMTP. It is the
> >    responsibility of SMTP to correctly label text which is
> >    exchanged as part of the SMTP protocol and is intended for
> >    end-user consumption, even if SMTP is run over TCP/IP.
> >    It would be the responsibility of IP to label text correctly
> >    if it ever would consider using text in its protocol elements
> >    (as opposed to transporting text in its payload).
> 
> I agree that this is an important point. I also think this is as 
> good an attempt as I've seen to describe the requirements in this
> area.
> 
> 				Ned

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)