Re: Suggested character set policy for the IETF

On Mon, 23 Jun 1997 Harald.T.Alvestrand@uninett.no wrote:

>                IETF Policy on Character Sets and Languages
> 
>                          Harald Tveit Alvestrand
>                                  UNINETT
>                       Harald.T.Alvestrand@uninett.no

This is extremely good and valuable work. A few comments below.


>     1.  Introduction
> 
>     The Internet is international.
> 
>     With the international Internet follows an absolute requirement to
>     interchange data in a multiplicity of languages, which in turn
>     utilize a bewildering number of characters or other character-like
>     representation mechanisms.

What do you mean by "other character-like representation mechanisms"?


>     This document is (INTENDED TO BE) the current policies being
>     applied by the Internet Engineering Steering Group towards the
>     standardization efforts in the Internet Engineering Task Force in
>     order to help Internet protocols fulfil these requirements.


>     2.  Where to do internationalization
> 
>     Internationalization is for humans. This means that protocols are
>     not subject to internationalization; text strings are. Where
>     protocols may masquerade as text strings, such as in many IETF
>     application layer protocols, protocols MUST specify which parts
>     are protocol and which are text. [WR 2.2.1.1]
> 
>     Names are a problem, because people feel strongly about them, many
>     of them are mostly for local usage, and all of them tend to leak
>     out of the local context at times. RFC 1958 [ARCH] recommends US-
>     ASCII for all globally visible names.
> 
>     This document does not mandate a policy on name
>     internationalization, but requires that all protocols describe
>     whether names are internationalized or US-ASCII.

I think that by names, you mean things such as filenames, mail addresses,
and so on. It would be good to make this clear.


>     3.  Character sets
>
>     For a definition of the term "character set", refer to the
>     workshop report. Like MIME, this document uses it to mean the
>     combination of a coded character set and a character encoding
>     scheme.

I think the title should be something like "character encoding".
The MIME terminology is unfortunate. It was created with only
a limited view on character encoding, and is confusing and
inappropriate. In particular, it is strange to say that a
"character set" is the combination of a coded character set
and a character encoding scheme. The term "coded character set"
expresses, with the word "coded", the fact that something is
added to a character set, namely an identifying unique number
(or bit combination) to each character in the character set.
The word "set" is by most people associated more or less closely
with its mathematical meaning, and it is used that way in the
term "coded character set", but it is totally misleading in the
term "character set" as used in MIME.

It is very clear that the parameter name "charset" should not
be changed at all. But there is absolutely no justification for
keeping a misleading term such as "character set"; there is no
requirement for terminology to be backwards compatible with
a particular IETF prototol, and there are other IETF documents
(e.g. HTML i18n, MHML) that use much more understandable and
less misleading terminology. Just using "Charset" instead of
"character set" would already be a great improvement.


>     3.1.  What character set to use
> 
>     All protocols MUST identify, for all character data, which
>     character set is in use.
> 
>     Protocols MUST be able to use the ISO 10646 coded character set,
>     with the UTF-8 character encoding scheme, for all text. (This is
>     called "UTF-8" in the rest of this document)
> 
>     They MAY specify how to use other character sets or other
>     character encoding schemes, such as UTF-16, but lack of an ability
>     to use UTF-8 needs clear and solid justification in the protocol
>     specification document before being entered into or advanced upon
>     the standards track.

Saying that protocols MUST use 10646 with UTF-8, and then saying that
they can do otherwise with solid justification is a contradiction.
Probably, what we want is to say:

Protocols MUST be able to use 10646.
Protocols SHOULD use UTF-8 to transport 10646.

>     For existing protocols or protocols that move data from existing
>     datastores, support of other character sets, or even using a
>     default other than UTF-8, may be a requirement. This is
>     acceptable, but UTF-8 support MUST be possible.

There may be good reasons in the future in some protocols to go
e.g. to UTF-16 or whatever. I think it is a very good idea to
streamline ietf on UTF-8, it has enormous advantages. But on
the other side, writing this in stone is not really necessary
nor desirable.

One main opposition to UTF-8 may come from scripts and users
where UTF-8 is not very compact (CJK at first, but then in
particular India and South East Asia).



>     3.2.  How to decide a character set
> 
>     In some cases, like HTTP, there is direct or semi-direct
>     communication between the producer and the consumer of a character
>     set. In this case, it may make sense to negotiate a character set
>     before sending data.

"In this case" seems to refer to HTTP. The text "may make sense"
sounds as if character negotiation in HTTP doesn't really make
that much sense. It should probably be "For such cases, it may make sense...".


>     Note that a character set is an absolute; for almost all languages
>     but English and a few other Latin-based scripts, text cannot be
>     rendered comprehensibly without supporting the right character
>     set.

Please remove the "for almost ...". Even for English, if you don't
know whether it's EBCDIC or ASCII, you are not comprehensible.



>     4.  Languages
> 
> 
>     4.1.  The need for language information
> 
>     All human-readable text has a language.
> 
>     Many operations, including high quality formatting, text-to-speech
>     synthesis, searching, sorting, spellchecking and so on need access
>     to information about the language of a piece of text. [WC
>     3.1.1.4].

Can we please eliminate sorting? I don't know why this gets repeated
like a virus. Adding hyphenation may be a good idea. Even if it's
subsumed under "high quality formatting", it's something that is
very easy to understand.

As for spellchecking, I wonder how much this is appropriate here.
Of course language has to be considered when spellchecking, but
spellchecking is done at the origin of some text usually. Or
maybe somebody wants to build up a SSCP (simple spell checking
protocol)?



>     Humans have some tolerance for foreign languages, but are
>     generally dissatisfied with being presented text in a language
>     they do not understand; this is why negotiation of language is
>     needed.

"generally dissatisfied" is not the most important point. Presenting
text in a language that is not understood is like presenting no
information, i.e. it is useless. The degree of satisfaction this
creates is rather secondary.


>     (Some items, like domain names and other names, may in some cases
>     be very useful without this information.)

This should be strengthened. Domain names and such are not only
very useful without language information, adding language information
to domain names would in fact completely break their functionality.



>     The interaction between language and processing is complex; for
>     instance, if I compare "hosta(lang=en)" to "hosta(lang=no)" I will
>     generally expect a match, while "aasmund" sorts after "attaboy"
>     according to Norwegian rules, but before it using English rules.
>     (the "aa" is sorted together with "latin letter a with ring
>     above", which is at the end of the Norwegian alphabet).

Again here, sorting is a non-starter. Nobody would like to see
something like

aasmund(lang=en)
attaboy
aasmund(lang=no)

If you want examples for complex interaction, I can provide a few,
in particular from the domain of high-quality rendering.


>     4.2.  How to identify a language
> 
>     The RFC 1766 language tag is at the moment the most flexible tool
>     available for identifying a language; protocols SHOULD use this,
>     or provide clear and solid justification for doing otherwise in
>     the document.

I think it would be good to add a litle paragraph about hierarchical
tags and the use of these hierarchies in protocols. Something like

RFC 1766 language tags are hierarchical; in some cases, this hierarchy
can be used beneficially. Procotols SHOULD specify how they deal
with language tag hierarchies.


>     4.3.  Considerations for negotiation
> 
>     Protocols that transfer human-readable text MUST provide for
>     multiple languages.

See Ken's comment on how this should be changed.


>     In some cases, a negotiation where the client proposes a set of
>     languages and the server replies with one is appropriate; in other
>     cases, supplying information in all available languages is a
>     better solution; most sites will either have very few languages
>     installed or be willing to pay the overhead of sending error
>     messages in many languages at once.

I think this is going the wrong way. It is indeed true that many
sites will have very few languages. But it is definitely not true
that they are willing to pay the overhead, and in particular not
true that the recipient is willing to pay the overhead.
Punishing those sites that go a long way to add many languages
is the wrong way, we should make it easy to add languages.



>     4.4.  Default Language
> 
>     When human-readable text must be presented in a context where the
>     sender has no knowledge of the recipient's language preferences
>     (such as login failures or E-mailed warnings, or prior to language
>     negotiation), text SHOULD be presented in Default Language.
> 
>     The Default Language is English, since this is the language which
>     most people will be able to get adequate help in interpreting when
>     working with computers.
> 
>     Note that negotiating English is NOT the same as Default Language;
>     Default Language is an emergency measure in otherwise unmanageable
>     situations.

Very good! This makes English the default where really needed, but
not more.


>     5.  Locale
> 
>     POSIX defines a concept called a "locale", which includes a lot of
>     information about collating order, date format, currency format
>     and so on.
> 
>     In some cases, and especially with text where the user is expected
>     to do processing on the text, locale information may be usefully
>     attached to the text.
> 
>     This document does not require the communication of locale
>     information on all text, but encourages its inclusion when
>     appropriate.
> 
>     Note that the language and character set will often be present as
>     parts of a locale tag (such as no_NO.iso-8859-1; the language is
>     before the _ and the character set is after the dot); care must be
>     taken to define precisely which specification of character set and
>     language applies to any one text item.
> 
>     The default locale is the POSIX locale.

I really don't feel sure about this. Please see my recent discussion
with Keld. Probably, we should play through a few more examlpes to
get a better understanding.



>     6.  Security considerations
> 
>     Apart from the fact that security warnings in a foreign language
>     may cause inappropriate behaviour from the user, and the fact that
>     multilingual systems usually have problems with consistency
>     between language variants, no security considerations relevant
>     have been identified.

This brings me to a very important point:

I think this document should specify that each (application level
or otherwise affected) IETF document should contain a section on
internationalization and multilingual issues, in the same way it
currently contains a section on security considerations.


In conclusion: Great work, hope it can go into production really soon.


Regards,	Martin.


--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Thursday, 26 June 1997 03:32:00 UTC