Re: html, http, urls and internationalisation

As a followup:

I think the discussion on i18n of URLs have 3 aspects:

1. URLs themselves

2. use of URLs in HTTP

3. use of URLs in HTML

Thats why, contrary to Larrys plea, you see this message here, here,
and here.

1. URLs themselves.

These are at an abstract character level, as Larry and Franc,ois
correctly points out, you cannot see what is the charset
when you look at a business card or an URL in the newspaper.

I propose that any character here be allowed, except for the 
URL syntax characters, (things like < / : ) - in the non-DNS
part of the URL. Remember these are abstract characters, and
there is no binding to for example ISO 10646 in the sense
of a character repertoire, or to any encoding (charset).

2. Use of URLs in HTTP.

Here Franc,ois proposes UTF-8. In principle I sympatise with
this proposal - and I could agree to this being the default.
The current state is that only a restricted US-ASCII set is allowed,
and for octets with the high bit (codes 128-255) you can use
the %xx to keep it in 7-bit representation. With a labelling
for the charset, (Glenn Adams once proposed a URL-encoding
header) this can also encompass other charsets for the convenience
of the browser. I think we need to be able to specify something
else than UTF-8, for example big-5 is not covered by ISO 10646.
Allowabale charsets should be those allowed for WWW services
in general.

Also I think the burden should be placed on the server rather than the
client, as it is the server which is specialized and references
a store with the need, while every client in the world should be
able to reference that specific server's data (via eg. URLs coming
from other documents.) The server is where the intelligense is 
needed and can be expected, while the client may stay dumb.

3. Use of URLs in HTML.

Here it should be possible to write a HTML document in a given
charset, and then reference the (abstract) characters in the URL, just
like it is possible to write characters in the rest of the HTML document.
That is, the normal characters of the document charset can be used,
like full iso-8859-1 in normal HTML docs, and full Unicode in 
Unicode docs. Also the way of generating out-of-band characters
should be allowed in HTML URL strings, like &a-ring and &#xxxx;

4. Result

In this way we have a natural way to write natural URLs in printed
matter, etc capable of serving the whole world (on the world wide
web:-)

There is a natural way to write URLs in HTML docs, and these URLs
can then be converted into a charset that is suitable for HTTP
communication with a server (default is UTF-8). The server then
has the responsibility of converting the charset encoded URL into
a reference in its data store and fetch the data.

Keld

Received on Wednesday, 31 January 1996 15:11:59 UTC