URLs and i18n

Keld J|rn Simonsen (keld@dkuug.dk)
Wed, 31 Jan 1996 12:37:40 +0100

Message-Id: <199601311137.MAA11637@dkuug.dk>
From: keld@dkuug.dk (Keld J|rn Simonsen)
Date: Wed, 31 Jan 1996 12:37:40 +0100
To: uri@bunyip.com
Subject: URLs and i18n

Dan.Oscarsson@malmo.trab.se writes:

> >I propose that any character here be allowed, except for the 
> >URL syntax characters, (things like < / : ) - in the non-DNS
> >part of the URL. Remember these are abstract characters, and
> >there is no binding to for example ISO 10646 in the sense
> >of a character repertoire, or to any encoding (charset).

> In part this is what I said in my original message. I suggested
> that it could be defined that if characters are not encoded,
> they should be assumed to be coded as 10646 when transmitted
> digitally. Of cource on could add a charset tag like:
> http://host.x.y(iso 8859-1)/dir1/file.html
> if the need is to use an other coding than 10646.
> As long as the characters used are of the iso 8859-1 subset, a
> URL could be transmitted with 8-bit bytes as of today.

OK, I got your point that iso-8859-1 is the http default.
Anything else should be labelled in some way by http.
> About DNS - I never suggested that we should be allowed to use
> 8-bit characters in DNS (though DNS can handle 8-bit characters).
> It is the part after locationpart that need i18n.
> DNS part does not belong to these working groups, though it is
> high time 8-bit characters were allowed in DNS too.
I have explicitely kept DNS out of the discussion, so we agree.

> >2. Use of URLs in HTTP.
> >
> >Here Franc,ois proposes UTF-8. In principle I sympatise with
> >this proposal - and I could agree to this being the default.
> >The current state is that only a restricted US-ASCII set is allowed,
> >and for octets with the high bit (codes 128-255) you can use
> >the %xx to keep it in 7-bit representation.

> HTTP is defined as 8-bit and there is nothing forbidding 8-bit
> characters to be used in HTTP today. Most servers work fine if
> you send them 8-bit characters, I do it every day.
> UTF-8 would break current usage. Basic character set of HTTP/HTML
> is iso 8859-1, UTF-8 is not iso 8859-1 compatible.

Except that URLs are defined to be 7-bit. I agree that UTF-8 and
iso-8859-1 are conflicting.
> >Also I think the burden should be placed on the server rather than the
> >client, as it is the server which is specialized and references
> >a store with the need, while every client in the world should be
> >able to reference that specific server's data (via eg. URLs coming
> >from other documents.) The server is where the intelligense is 
> >needed and can be expected, while the client may stay dumb.

> It sound good, but one important thing is that the user must
> be able to in a URL location input field, enter an URL with
> non ascii characters and not get it encoded as some idiotic
> MS-DOS character set that is not used by the server!
> This need to be solved.

yes, of cause. If the user enters a url with non-ascii chars,
then it is the clients responsibility  to convert it into something that
can be understood by the server with the new i18n url spec, say
iso-8859-1 or utf-8 charset.

> >
> >3. Use of URLs in HTML.
> >
> It is not acceptable to define that during transmission all
> URLS must be encoded and therefore request the www-server to
> translate every document is handles. We cannot place to great burden
> of the server, the large CPU power lies of the client side.

URLs are always encoded - abstract characters encoded in some character
set. I am not talking about translating whole documents, only
the URLs. The server gets an URL in a request from the client, and then
the server has to translate the URL into its native charset (possibly
the same). Considering all the other work done on an URL before
the server can deliver a document to a client, this conversion 
is not a big impact.
> >
> UTF-8 is bad as I stated above as it breaks compatibility with
> current usage. Sugges either a UTF-encoding compatible with
> iso 8859-1 or that HTTP protocol is extended with a UCS-2 or UTF-8
> mode (could be done with prefix characters to the request line).

Point taken. I think we need to label the URL charset in the header
anyway - when it is not iso-8859-1 or ascii - as all other charset
info is labelled in HTTP/HTML. For compatibility reasons I propose that
this be done with a header like url-charset - then servers
conforming to HTTP/1.0 can deal with this new protocol specification
without change, as per the draft HTTP spec. This would not be possible
if we added syntax to the URL specification.