Re: Charset policy - Post Munich from Francois Yergeau on 1997-08-31 (ietf-charsets@w3.org from July to September 1997)

From: Francois Yergeau <yergeau@alis.com>
Date: Sun, 31 Aug 1997 15:02:28 -0400
To: ietf-charsets@INNOSOFT.COM
Message-id: <3.0.1.32.19970831150228.00a8a2f0@genstar.alis.ca>
À 13:15 29/08/97 +0200, Harald.T.Alvestrand@uninett.no a écrit :
>Please check this for consistency with previous comments and comments
>made in Munich.

I can't talk about Munich, yet I have some comments.

>NOTE: There are two more documents that should be in the same Last Call,
IMHO:
>
>- Ned's charset registry (draft-freed-charset-reg-02.txt)
>- Francois' updated UTF-8 (draft-yergeau-utf8-rev-00.txt)

Which reminds me that this update is long overdue.  I'll send it in a
separate message.

>    This document is (INTENDED TO BE) the current policies being

Nit: "is the current policies" sounds ungrammatical to me.  What about
"specifies the current policies" or "is the current policy" ?

>    This document does not mandate a policy on name
>    internationalization, but requires that all protocols describe
>    whether names are internationalized or US-ASCII.

My first impression was that there would be endless quarrels over what is a
name and what is not, until I saw a definition in section 3.  May I suggest
moving section 3 ahead of section 3?

>    3.  Definition of Terms
>
>    This document uses the term "charset" to mean a set of rules for
>    mapping from a sequence of octets to a sequence of characters,

Same as MIME, then.  Why not simply refer to MIME?

>    A "name" is an identifier such as a person's name, a hostname, a
>    domainname, a filename or an E-mail address...

"...as used with some significance in a protocol".  My name and email
address in the sig below are not names as discussed here, I guess, just
part of a mail message body.

>    3.1.  What charset to use
>
>    All protocols MUST identify, for all character data, which charset
>    is in use.
>
>    Protocols MUST be able to use the UTF-8 charset, which consists of
>    the ISO 10646 coded character set combined with the UTF-8
>    character encoding scheme, as defined in [10646] Annex R
>    (published in Amendment 2), for all text.
>
>    They MAY specify how to use other charsets or other character
>    encoding schemes for ISO 10646, such as UTF-16, but lack of an
>    ability to use UTF-8 needs clear and solid justification in the
>    protocol specification document before being entered into or
>    advanced upon the standards track.

Same remark as Martin, this is self-contradictory.  I think 10646 should be
a MUST, while UTF-8 should be a strong SHOULD, needing "clear and solid
justification" to ignore.

>    3.2.  How to decide a charset

I think some language on default charsets is needed here.  Having seen the
mess created by defaulting to Latin-1 in HTTP, I think a mandated default
of UTF-8 everywhere, both in protocols items and contents, is warranted at
the strong SHOULD level.

>    Negotiating a charset may be regarded as an interim mechanism that
>    is to be supported until UTF-8 support is prevalent; however, the
>    timeframe of "interim" may be at least 50 years,

Anyone got a handy crystal ball?  May I suggest "decades" instead of a hard
number like 50?

>    Many operations, including high quality formatting, text-to-speech
>    synthesis, searching, hyphenation, spellchecking and so on need
>    access to information about the language of a piece of text. [WC
>    3.1.1.4].

I agree with Martin that some of the listed operations do not *need*, but
would still benefit from language tagging.

>    In most cases, machines cannot deduce the language of a
>    transmitted text by themselves; the protocol must specify how to
>    transfer the language information if it is to be available at all.

If it is to available in general.  Language can be guessed, but only if
there is enough text.

>    The interaction between language and processing is complex; for
>    instance, if I compare "name-of-thing(lang=en)" to "name-of-
>    thing(lang=no)" for equality, I will generally expect a match,
>    while the word "ask(no)" is a kind of tree, and is hardly useful
>    as a command verb.

Nit: the use of the first person (I) is quite inhabitual in RFCs.

>    4.5.  Default Language
>
>    When human-readable text must be presented in a context where the
>    sender has no knowledge of the recipient's language preferences
>    (such as login failures or E-mailed warnings, or prior to language
>    negotiation), text SHOULD be presented in Default Language.
>
>    The Default Language is English, since this is the language which
>    most people will be able to get adequate help in interpreting when
>    working with computers.

I disagree with this for the following reasons:

1) The justification is very weak.  There is no trace of a requirement for
mandating  single Internet-wide Default Language.

2) The spec as written prevents me (for instance) from using some other
language X as default in an Intranet application, if I am bound by contract
to obey Internet protocols on that Intranet; this holds even though I may
know that all users of that Intranet do not understand English but speak X
and/or can get adequate help in X.

3) It asks every Joe User in the world to provide his Web home page in
English, in case some client comes with no language preference settings.
Same for all gopher pages, ftp archives etc., where negotiation is not even
possible.

4) History shows us that the dominant language changes over time; English
is bound to go the way of Greek and Latin some day.

I'd rather see this whole section go away.


Regards
-- 
François Yergeau <yergeau@alis.com>
Alis Technologies inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Monday, 1 September 1997 13:07:09 UTC