Warnings, RFC 1522, and ISO-8859-1

Dear HTTP 1.1 specialists,

As a specialist in I18N (coauthor of the HTML I18N spec),
I am extremely busy trying to have a look at all the many
internet drafts in the working, to help to find viable and
long-lasting I18N solutions. As virtually all applications
area drafts affect I18N, this takes a lot of time (besides
my normal work!). It is therefore only lately that I have become
aware of some details in the HTTP 1.1 spec that I have difficulties
understanding and that I would propose to change.
Discussions in private, and on another list, have not given
any serious explanations for why HTTP 1.1 solves these issues
the way it does, and have suggested that I address them directly
to this list.


The areas concerned I am concerned with are the TEXT rule and
its explanation in Section 2.2 on page 16, and the warnings in
Section 14.45 on pages 128/9 of draft-ietf-http-v11-07.txt.

The main concern is the choice of "ISO-8859-1 OR RFC 1522"
for the encoding of TEXT and warnings. I will expand on this
below.

A second point is the question of whether it is okay for
a server/proxy not to send any warning text, but only
the number, (correctly!) assuming that the client is in
better shape to decide on language and wording of the text.
On the other side, the draft gives the strong impression
that the warning has to come with text, so that client
implementors will just try to display it, and the user
may end up without any warning text.

A third point is the question of explicitly specifying English
as a default for warnings. Every HTTP implementor should
be expected to have as much knowledge of internet practice
to be able to conclude that given no other indication of
language preference, English is the best choice anyway.
So saying "The default Language is English" is a dummy
statement. On the other hand, it seems to be offensive
to some people (not me!) that worry about the dominance
of English in the internet. It can very well be argued
that English as a default should not be written in stone.
Therefore, it might be silently removed if any other changes
in the "Warning" text are necessary.


Now back to the MAIN POINT: Can anybody explain to me why
ISO-8859-1 was choosen as a default for TEXT in headers
and warnings? Given the recommendations of the IAB
charset workshop (draft-weider-iab-char-wrkshop-00.txt),
which repeatedly mentionnes UTF-8, this seems like a
rather antiquated choice. On the other side, UTF-8
is extremely suited for the purpose: It covers all the
characters of the world, is reasonably compact, and
works together smoothlessly with ASCII. It is clear
that 7-bit octets are reserved for ASCII; the 8th bit,
a precious resource, should be used as carefully as
possible. Using it for UTF-8 is definitely better
than using it for ISO-8859-1.

To improve the situation, I propose four variants for a
better solution. The choice of variant may depend on
various factors I am not fully knowledgeable about,
such as the installed base of servers, proxies, and
clients that support ISO-8859-1 and/or RFC 1522.
Whoever knows anything on this topic is wellcome
to share his/her knowledge.


Solution 1: UTF-8 only
----------------------

Advantages: Very easy to implement on server. For those
that doubt it, I offer to transcode lists of warnings from
ISO 8859-1 (and quite a few other encodings) to UTF-8.
A file with a list of warnings in various languages
can be edited directly if it is in UTF-8, whereas this is
not possible for an RFC1522-based solution. This applies
also to the other solutions that are based on UTF-8, as the
server can choose to use any of the allowed encodings.
Easy to implement on client (does not need RFC 1522 code).
UTF-8 support will be in the major web clients next year,
and in anything that is serious about Java, anyway.
Display support for exotic scripts such as Tibetan is
not an issue, as RFC 1522 has the same problems.


Solution 2: UTF-8 and RFC 1522
------------------------------

The main advantage for this is that some scripts, in
particular Indic scripts and Georgian, expand by
a factor of 3 from native encoding to UTF-8.
Otherwise, there is no good reason for keeping RFC
1522 with UTF-8 except maybe for installed base.


Solution 3: UTF-8 and ISO-8859-1
--------------------------------

At first sight, this may seem very dangerous and bad
design, because how should one find out whether something
is ISO-8859-1 or UTF-8? Indeed, "guessing" is needed.
But guessing is tremendously simplified, to the extent
where it is really difficult to speak about guessing,
by the following facts:

ISO-8859-1 8-bit characters can be divided into three
areas: A0-BF, C0-DF, and E0-FF. A0-BF contains all kinds
of symbols, such as 1/4, copyright, superscript 2,...
C0-DF contains upper case accented characters, E0-FF
contains lower-case accented characters. The range
80-9F is not defined in ISO-8859-1, it's reserved for
control characters (C1), but not used in internet context.
In ISO-8859-1 strings, E0-FF will be relatively frequent,
C0-DF considerably less frequent, and A0-BF even less.
Sequences of two characters with the 8th bit set also
are very rare.

In UTF-8, the range 80-FF is divided into leading
characters (L: C0-FF) and trailing characters (T: 80-BF).
The following sequences are legal UTF-8:
L1 (C0-DF) T
L2 (E0-EF) T T
L3 (F0-F7) T T T
L4 (F8-FB) T T T T
L5 (FC-FD) T T T T T

So to find an octet sequence that is both legal UTF-8 and
reasonable ISO-885-1, the best chance is to find a reasonable
combination of an uppercase accented letter followed by a
special sign such as copyrigth. Can a warning, or any other
TEXT, reasonably be expected to contain such a combination
(and no other 8-bit characters that don't conform to UTF-8)?
Code to test an octet string for UTF-8 compliance is avaliable
on request. The "guessing" solution was also accepted in ftp-wg
to provide a reasonable upgrade path for existing implementations
that use arbitrary unlabeled "charset"s in their filenames.


Solution 4: RFC 1522 only
-------------------------

Not really the best solution, but at least fair to everybody,
and leaving the 8th bit open for the future.



Some readers may argue that HTTP 1.0 already specifies
ISO-8859-1 as the default for TEXT. This is not exactly
true. HTTP 1.0 says:

   Recipients of header field TEXT containing octets outside the US-
   ASCII character set may assume that they represent ISO-8859-1
   characters.

Very obviously, this is just a suggestion, not a default. It does
not make sense to thighten this default in the wrong direction.
It may even be that at some places, based on this not-so-tight
specification, implementors may have used any encoding for
such fields.
It may also be worth contemplating what happens when an UTF-8
string sent out by a server happens to be displayed on a
client that is assuming that it can be nothing else than
ISO-8859-1. If the UTF-8 string is something else than a
string that could have been represented in ISO-8859-1,
then it would have been impossible to reliably send it with
HTTP 1.0 anyway. Otherwise, accidental accented characters
will appear as two octets with the 8th bit set, and display
as one or two characters in the ISO-8859-1 range. While
this is of course very unfortunate, it does not preclude
readability. It is a phenomenon that most computer users
dealing with such languages are actually only too familliar
with.


Some additional comments for people concerned about these
issues:

Previous Discussions
--------------------
To those of you to whom I give the impression of reopening
a point that has already been beaten to death, please note
that this is not true. "charset" issues and defaults for
entity content have been discussed repeatedly and vigurously
on this list, and a reasonable solution, considering all the
backwards compatibility issues, has been found.
However, ever after scanning the list archives of this year
in great detail, I have not found any serious discussion
of I18N issues in headers and warnings.


Procedural Concerns
-------------------
The current HTTP 1.1 draft is beyond last call, waiting for
becomming an RFC. I do not know whether last minute changes
can or should be made, but I have to say I don't care.
Whether the issues I mentionned above are solved by a last
minute change, a separate RFC, a mutual understanding in
this group, or whatever, is of minor concern if they are
solved at all. [The reference to RFC1522 has to be changed
anyway to its superseding RFC 2047.]


Many thanks in advance for you consideration.

Regards,	Martin Du"rst.

----
Dr.sc.  Martin J. Du"rst			    ' , . p y f g c R l / =
Institut fu"r Informatik			     a o e U i D h T n S -
der Universita"t Zu"rich			      ; q j k x b m w v z
Winterthurerstrasse  190			     (the Dvorak keyboard)
CH-8057   Zu"rich-Irchel   Tel: +41 1 257 43 16
 S w i t z e r l a n d	   Fax: +41 1 363 00 35   Email: mduerst@ifi.unizh.ch
----

Received on Monday, 16 December 1996 10:29:31 UTC