- From: <Dan.Oscarsson@malmo.trab.se>
- Date: Sun, 28 Jan 1996 12:12:02 +0100
- To: html-wg@oclc.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
I have looked at the http v 1.1 draft, the i18n draft, the url and html rfcs and the internationalisation efforts are going forward, but I am still feeling not strongly enough and well enough in all areas. Internationalisation is needed for both users and implementors. For the users, the user want to use natural language and characters in all places. This means that both URLs and html documents should be able to use un-encoded characters! In URLS today many characters are encoded, some because they must, some because there is not a good definition of character set to use. For the user it is totally unacceptable to have to write encoded characters! Do you want to enter: %66%f6%72%62%e4%74%74%72%69%6e%67%61%72%2e%68%74%6d%6c am I going to tell my users: type in this string and you will get there? And URLS are used in many places, in the browser, when writing html documents and many other places. Of cource, you can say that the browser and the html editor will hide this for the user, but they do not do that today, and html documents can be edited with a text editor. Also URLS are used on printed matter where no software can hide the ugly encodings for you. In html there are both URLS and letters of text, the natural thing for a user to do is to use normal characters everywhere, both in URLS and in text. Encoded URLS and escape sequences are not for the user. The direction of going towards UCS (ISO 10646/Unicode) is the right way, but be more mandatory about it. Define that URLS are written using UCS character coding so that characters need not be encoded when not ascii. For URLS with non UCS you have to use encoded data. This would allow most URLS to be written with printable characters and they will have a well defined code value. Of cource, som countries cannot print in their papers and books letters outside ascii or some other subset of UCS, but why does we who normally use characters ouside ascii have to be tortued by this. In our countries URLS can be printed in the easy understandable way and in english speaking areas they can be encoded. As long as the characters in an URL only contains characters from the ISO 8859-1 subset of UCS, the URL can be sent as 8-bit characters otherwise as UCS. This can easily be handled in http by defining that request lines that begin with the two characters 0xFE 0xFF switches to UCS-2 and all others uses ISO 8859-1. This allows all following data (headers etc) to be in a defined character set without having to encode this everywhere. Also this would allow todays http to be used in a compatible way. For html documents defining UCS as the character set is good, they can then be transmitted in 8-bit mode using the ISO 8859-1 subset or in UCS-2 or UCS-4 mode allowing compatibility with today. But it would for the implementor be better if only implementation level 2 is used. This would allow most characters to be defined by a single character code, instead of several. For example using level 3 which allows any combination of normal and combining characters to be used, the letter: i could be defined as code 0x69 or by 0x0131 0x0307. Implementation level 3 is also not compatible with todays browsers if they send a ISO 8859-1 text using numeric character references and combining characters. I would recommend that all documents instead of today saying in many places, may assume about character sets, should say should or must assume, so that is is mandatory. And the mandatory character set should be UCS and its subset ISO 8859-1. This would simplify for implementors and give them a very clear definition what is to be used. I am so tired of stupid browsers and other programs that take an incoming iso 8859-1 documents, translating it to the macintosh character set of a mac and send back information from teh document to the server using macintosh character set. Dan
Received on Sunday, 28 January 1996 03:22:09 UTC