html, http, urls and internationalisation

I have looked at the http v 1.1 draft, the i18n draft, the url and html
rfcs and the internationalisation efforts are going forward, but I am still
feeling not strongly enough and well enough in all areas.

Internationalisation is needed for both users and implementors.

For the users, the user want to use natural language and characters
in all places. This means that both URLs and html documents should be
able to use un-encoded characters!

In URLS today many characters are encoded, some because they must, some because
there is not a good definition of character set to use.
For the user it is totally unacceptable to have to write encoded characters!
Do you want to enter: %66%f6%72%62%e4%74%74%72%69%6e%67%61%72%2e%68%74%6d%6c
am I going to tell my users: type in this string and you will get there?
And URLS are used in many places, in the browser, when writing html documents
and many other places. Of cource, you can say that the browser and the
html editor will hide this for the user, but they do not do that today, and
html documents can be edited with a text editor. Also URLS are used on
printed matter where no software can hide the ugly encodings for you.

In html there are both URLS and letters of text, the natural thing for a user
to do is to use normal characters everywhere, both in URLS and in text.
Encoded URLS and escape sequences are not for the user.

The direction of going towards UCS (ISO 10646/Unicode) is the right way, but
be more mandatory about it.
Define that URLS are written using UCS character coding so that characters
need not be encoded when not ascii. For URLS with non UCS you have to use
encoded data. This would allow most URLS to be written with printable
characters and they will have a well defined code value. Of cource, som
countries cannot print in their papers and books letters outside ascii
or some other subset of UCS, but why does we who normally use characters ouside
ascii have to be tortued by this. In our countries URLS can be printed
in the easy understandable way and in english speaking areas they can
be encoded.
As long as the characters in an URL only contains characters from the
ISO 8859-1 subset of UCS, the URL can be sent as 8-bit characters otherwise
as UCS. This can easily be handled in http by defining that request lines
that begin with the two characters 0xFE 0xFF switches to UCS-2 and all
others uses ISO 8859-1. This allows all following data (headers etc) to
be in a defined character set without having to encode this everywhere.
Also this would allow todays http to be used in a compatible way.

For html documents defining UCS as the character set is good, they can
then be transmitted in 8-bit mode using the ISO 8859-1 subset or in
UCS-2 or UCS-4 mode allowing compatibility with today.
But it would for the implementor be better if only implementation level 2
is used. This would allow most characters to be defined by a single
character code, instead of several. For example using level 3 which allows
any combination of normal and combining characters to be used, the
letter: i could be defined as code 0x69 or by 0x0131 0x0307.
Implementation level 3 is also not compatible with todays browsers
if they send a ISO 8859-1 text using numeric character references and
combining characters.


I would recommend that all documents instead of today saying in many places,
may assume about character sets, should say should or must assume, so that
is is mandatory. And the mandatory character set should be UCS and its
subset ISO 8859-1. This would simplify for implementors and give them a
very clear definition what is to be used. I am so tired of stupid
browsers and other programs that take an incoming iso 8859-1 documents,
translating it to the macintosh character set of a mac and send back
information from teh document to the server using macintosh character set.

   Dan

Received on Sunday, 28 January 1996 03:22:09 UTC