- From: Chris Weider <cweider@microsoft.com>
- Date: Mon, 01 Sep 1997 15:05:38 -0700
- To: 'Ned Freed' <Ned.Freed@INNOSOFT.COM>, ietf-charsets@INNOSOFT.COM
I think Ned is completely correct here. The workshop report thought long and hard about requiring language tagging and mandatory UTF-8 and realized that this is the only way to make things work with the stupid machines we have now :^) Chris > -----Original Message----- > From: Ned Freed [SMTP:Ned.Freed@innosoft.com] > Sent: Sunday, August 31, 1997 1:07 PM > To: ietf-charsets@innosoft.com > Subject: Re: Charset policy - Post Munich > > > > 3.1. What charset to use > > > > > > All protocols MUST identify, for all character data, which > charset > > > is in use. > > > > > > Protocols MUST be able to use the UTF-8 charset, which consists > of > > > the ISO 10646 coded character set combined with the UTF-8 > > > character encoding scheme, as defined in [10646] Annex R > > > (published in Amendment 2), for all text. > > > > > > They MAY specify how to use other charsets or other character > > > encoding schemes for ISO 10646, such as UTF-16, but lack of an > > > ability to use UTF-8 needs clear and solid justification in the > > > protocol specification document before being entered into or > > > advanced upon the standards track. > > > The above two paragaphs contradict each other. You can't have > > a MUST and then a MAYbe not on the same point. Either make the > > first a SHOULD, or make a MUST for ISO 10646/Unicode, and then > > a SHOULD for UTF-8. > > I fail to see a contradiction of here. A protocol must be able to > handle UTF-8 > if it handles character data. A protocol may elect to handle other > charsets as > well, possibly including one derived from other transformation formats > of > Unicode. > > What I do see here is poor ordering of what is being proposed. I > suggest > that it instead say: > > Protocols MUST be able to use the UTF-8 charset, which consists of > the ISO 10646 coded character set combined with the UTF-8 > character encoding scheme, as defined in [10646] Annex R > (published in Amendment 2), for all text. Any exceptions > must be fully justifiable and the justification must be given in > the > protocol specification. A protocol which neither supports UTF-8 > nor > justifies its use of some other charset MUST NOT be entered on the > standards track. > > Protocols MAY also specify how to use other charsets or other > character > encoding schemes for ISO 10646, such as UTF-16. As always, any > protocol > that elects to support more than one charset MUST provide a field > to > label which charset is being used. > > In any case, I'm somewhat opposed to weakening the UTF-8 support > requirement to > a SHOULD. It really is a MUST and needs to be stated as such. We can > always > make exceptions to a MUST on a case by case basis if need be. I doubt > very much > that there will be that many of them. > > > > In most cases, machines cannot deduce the language of a > > > transmitted text by themselves; > > > This is not true. There is enough evidence that for any given > > set of languages, it is possible to devise or generate software > > that identifies the language with accuracy converging to 100% > > as the length of the text increases, and as the amount of > > effort (e.g. table/dictionary size,...) increases. And once > > this effort is done, the gap between what humans can find out > > and what machines can find out is small. > > Everything you say may be true, but it doesn't disprove Harald's > statement. > Yes, you may be able to build a machine that deduces language with > precision > approaching 100% as the amount of text increases. However, you have > not > demontrated that: > > (0) Enough text is always going to be available to make this possible. > (1) That the 100% point is actually reached. (Convergence to 100% is > not > the same thing, and in some cases 100% is the only acceptable > answer.) > (2) That the set of languages we use is always closed. > (3) Machines in the real world will universally be retrofitted to have > this > capability. > > (0), (2), and (3) are in fact demonstrably false. As such, I claim > Harald's > statement, which you should not didn't say that machine recognitiion > isn't > possible, but only that most machines aren't capable of it right now, > is > correct. > > Moreover, the point here, that machine recognition of the language > being used > cannot be relied upon, is a damned important one that should not be > left out. I > really want to forestall finding domain-->language tag tables in some > product > somewhere. As such, to forestall further argument I suggest that the > paragraph > be reworded to say that at the present time most machines lack the > facilities > to deduce language from content. > > > Please note that language information as such is not needed > > for the end user; humans have no problem identifying the > > languages they know and separating them from those they > > don't know. > > This point, on the other hand, is demonstrably false since I have a > specific > counterexample of my own to offer. I routinely deal with customers in > over 50 > countries, quite a few of which either use multiple languages or else > don't > have domain names that let me deduce country and hence probable > language. And I > occasionally receive messages from these places written in a language > other > than English, French or Spanish, hence outside my admittedly limited > linguistic > skills and limited dictionary set I keep handy. > > And moreover, I sometimes cannot figure out what language is being > used. (A lot > of the ones I get look like German to me but aren't. Hey, what can I > say, my > education in this regard was terrible.) And this actually matters to > me, since > depending on the language I'll take the message to different people in > the > office or else forward it to various people I know for translation. A > language > tag would certainly help me in these cases, although I regret to say > that such > tags are rarely used in practice. > > The basic problem here is that you're assuming communication is > between people > who know each other. This is usually but not always the case, and when > it isn't > these tags may actually be useful to a human reader. > > I therefore do not support the addition of this text, as it will > inevitably > lead to cases where language tags will be omitted when they could have > been > useful. > > > Please note that languages are not as clearcut a concept as > > character sets. There are mixtures of languages, language > > variants, words that move from one language to another, > > and text parts that are not in any particular language. > > This is a good point and one that needs to be made. > > > > 4.2. Requirement for language tagging > > > > > > Protocols that transfer text MUST provide for carrying > information > > > about the language of that text. > > > This is most probably too strong. > > > What about: > > > Protocols that transfer text MUST provide for carrying language > > information to the extend and in the granularity that this is > > necessary and apropriate for the operations that the text in > > the protocol is generally intended and used for. > > This, on the other hand, is too wishy-washy. We need these tags and we > need > for them to be used a lot more than they currently are. What we do not > > need is to have lots of debates about whether or not a given protocol > is needs such a field. It is far better to have fields we end up not > using > than to need fields we do not have. > > > > Protocols SHOULD also provide for carrying information about > the > > > language of names. > > > Do you seriously want to suggest that we devise some kind of > > language-tag syntax for URLs, Email addresses, host names, and > > so on? > > Here I agree that the present document goes too far. Name languages > are _incredibly_ tricky stuff -- if you think words move around a bit, > you > should see, say, the Korean-American phone book for the greater LA > area! > > I think these needs to be dropped entirely. > > > > 4.3. How to identify a language > > > > > > The RFC 1766 language tag is at the moment the most flexible > tool > > > available for identifying a language; protocols SHOULD use > this, > > > or provide clear and solid justification for doing otherwise in > > > the document. > > > > > > In particular, claiming that a language can be deduced from the > > > charset in use is erroneous and will not be accepted. > > > Correct. But isn't this all too obvious, given things like > > iso-8859-1? I don't think you need this in any way to be able > > to reject such claims should they ever come up. > > Well, it may be true that everyone knows you cannot deduce language > from iso-8859-1. But what about iso-2022-jp? > > The point here is that claims of a _limited_ ability to deduce > language > from _some_ charsets have in fact been made, and we need language that > says such claims are unacceptable no matter what. > > > > 4.4. Considerations for negotiation > > > Please say "language negotiation". > > Agreed. > > > > Protocols where users have text presented to them in response > to > > > user actions MUST provide for multiple languages. > > > This is too sweeping. Some people could think that it means that > > a protocol must provide at least two languages, or that every > > implementation has to provide multiple languages. > > > Please say something like: > > > Protocols where users have text presented to them in response > > to user actions MUST provide the means by which implementors > > can satisfy the language needs of the users. > > I have no problem with this. > > > > In some cases, a negotiation where the client proposes a set of > > > languages and the server replies with one is appropriate; in > other > > > cases, supplying information in all available languages is a > > > better solution; most sites will either have very few languages > > > installed or be willing to pay the overhead of sending error > > > messages in many languages at once. > > > I don't agree. There may be only few sites that have many > > languages available, but those may be contacted by users > > with special language needs that can't afford the bandwidth > > (even if the server side providing these many languages has > > no problem with the bandwith). > > So what? Harald didn't say that implementations have to provide > responses > in multiple languages, merely that providing responses in multiple > languages > is a viable approach. And it is viable -- indeed, I have customers > that > require it. > > > Also, there is an increasing tendency for products to ship > > with all language versions integrated. For a NS or MS server, > > you won't by a specific language version anymore very soon > > in the future. > > I fail to see the point here. > > > > Negotiation is useful in the case where one side of the > protocol > > > exchange is able to present text in multiple languages to the > > > other side, and the other side has a preference for one of > these; > > > the most common example is the text part of error responses, or > > > Web pages that are available in multiple languages. > > > The "one side is able" is somewhat dangerous here. A WG may > > just come and tell you: Our servers all just do English, > > the are not able to do anything else, so this doesn't apply. > > The reality is that implementations are going to do this whether we > like it or not. We can require what we like of implementations in > terms of support of mutiple languages and we'll just be ignored. > > In other words, there's a real danger here, but it isn't something we > can do > much of anything about, and as such this clause is almost entirely > harmless. > > > > 4.5. Default Language > > > > When human-readable text must be presented in a context where > the > > > sender has no knowledge of the recipient's language preferences > > > (such as login failures or E-mailed warnings, or prior to > language > > > negotiation), text SHOULD be presented in Default Language. > > > > The Default Language is English, since this is the language > which > > > most people will be able to get adequate help in interpreting > when > > > working with computers. > > > It may be a good idea to replace "most people" by "the greatest > number > > of people". This is a sensitive spot, and "most people" is saying > > something about their absolute percentage, whereas we just need to > > say that it is better than any other language we could pick. > > Agreed. > > > > Note that negotiating English is NOT the same as Default > Language; > > > Default Language is an emergency measure in otherwise > unmanageable > > > situations. It may be appropriate for application designers to > > > make sure that messages in Default Language are understandable > to > > > people with a limited understanding of the English language. > > > The following is implicit here, but has led to prolonged discussions > > on some lists: > > > What I think the text above says is that it's not permitted to > > say: "If the client doesn't negotiate language, this defaults to > > English (or whatever other "default" language)." > > > If this is the case, it would be better to explicitly state: > > > Protocols MUST NOT define a default language to avoid language > > negotiation; language MUST be explicitly negotiated for all > > languages. > > > I think it's better to make this clear, if this is what is desired, > > and something else otherwise, than to have more such discussions. > > Agreed. > > > > 5. Locale > > > > In some cases, and especially with text where the user is > expected > > > to do processing on the text, locale information may be > usefully > > > attached to the text; this would identify the sender's opinion > > > about appropriate rules to follow when processing the document, > > > which the recipient may choose to agree with or ignore. > > > > > > This document does not require the communication of locale > > > information on all text, but encourages its inclusion when > > > appropriate. > > > The above is not very clearcut, but there is probably nothing > > better in sight. > > Agreed. > > > Please add something like the following: > > > 6. Documentation > > > Protocols MUST appropriately document the decisions they have > > taken with respect to charsets, language information, and other > > aspects related to internationalization and multilinguality. > > A format such as that currently used for Security Issues is > > (highly) recommended. > > I would add that they must document their rationale as well as the > decisions. > > > Another thing, which should probably go into section 2 or so, > > and which seems needed as a response to some of the questions > > in the plenary in Munich, is a clarification of which protocol > > in a protocol stack is responsible for charset and language > > information. I'm not sure that I have found the best way > > to express this, but it could read as follows: > > > Note that in a protocol stack, it is the responsibility of > > the highest layer that uses the text to appropriately label > > it. As an example, it is the responsibility of the standard > > for mail messages to assure things get correctly labeled in > > mail messages, even if those are sent over SMTP. It is the > > responsibility of SMTP to correctly label text which is > > exchanged as part of the SMTP protocol and is intended for > > end-user consumption, even if SMTP is run over TCP/IP. > > It would be the responsibility of IP to label text correctly > > if it ever would consider using text in its protocol elements > > (as opposed to transporting text in its payload). > > I agree that this is an important point. I also think this is as > good an attempt as I've seen to describe the requirements in this > area. > > Ned --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Tuesday, 2 September 1997 13:02:58 UTC