Re: html, http, urls and internationalisation
Wed, 31 Jan 1996 10:26:12 +0100

Date: Wed, 31 Jan 1996 10:26:12 +0100
Message-Id: <>
Subject: Re: html, http, urls and internationalisation

>Thats why, contrary to Larrys plea, you see this message here, here,
>and here.
Yes, that is right. And as URLs are used in HTTP and HTML and are
an integral part of both HTTP and HTML it is very important that
they also are considered when desiging HTTP and HTML. Think of
what would happen if you designed a house and said that the basement
is designed somwhere else so I can ignore it. Your house may fall
into pieces because the basement was made of incompatible material.
It is the same here, you cannot ignore a fundamental part of HTTP and HTML.

It is HTTP and HTML that must be specified so that they can handle
URLs in a natural way even for non english speakers!

>1. URLs themselves.
>These are at an abstract character level, as Larry and Franc,ois
>correctly points out, you cannot see what is the charset
>when you look at a business card or an URL in the newspaper.
>I propose that any character here be allowed, except for the 
>URL syntax characters, (things like < / : ) - in the non-DNS
>part of the URL. Remember these are abstract characters, and
>there is no binding to for example ISO 10646 in the sense
>of a character repertoire, or to any encoding (charset).
In part this is what I said in my original message. I suggested
that it could be defined that if characters are not encoded,
they should be assumed to be coded as 10646 when transmitted
digitally. Of cource on could add a charset tag like:
http://host.x.y(iso 8859-1)/dir1/file.html
if the need is to use an other coding than 10646.
As long as the characters used are of the iso 8859-1 subset, a
URL could be transmitted with 8-bit bytes as of today.

About DNS - I never suggested that we should be allowed to use
8-bit characters in DNS (though DNS can handle 8-bit characters).
It is the part after locationpart that need i18n.
DNS part does not belong to these working groups, though it is
high time 8-bit characters were allowed in DNS too.

>2. Use of URLs in HTTP.
>Here Franc,ois proposes UTF-8. In principle I sympatise with
>this proposal - and I could agree to this being the default.
>The current state is that only a restricted US-ASCII set is allowed,
>and for octets with the high bit (codes 128-255) you can use
>the %xx to keep it in 7-bit representation.
HTTP is defined as 8-bit and there is nothing forbidding 8-bit
characters to be used in HTTP today. Most servers work fine if
you send them 8-bit characters, I do it every day.
UTF-8 would break current usage. Basic character set of HTTP/HTML
is iso 8859-1, UTF-8 is not iso 8859-1 compatible.

>Also I think the burden should be placed on the server rather than the
>client, as it is the server which is specialized and references
>a store with the need, while every client in the world should be
>able to reference that specific server's data (via eg. URLs coming
>from other documents.) The server is where the intelligense is 
>needed and can be expected, while the client may stay dumb.
It sound good, but one important thing is that the user must
be able to in a URL location input field, enter an URL with
non ascii characters and not get it encoded as some idiotic
MS-DOS character set that is not used by the server!
This need to be solved.

>3. Use of URLs in HTML.
>Here it should be possible to write a HTML document in a given
>charset, and then reference the (abstract) characters in the URL, just
>like it is possible to write characters in the rest of the HTML document.
>That is, the normal characters of the document charset can be used,
>like full iso-8859-1 in normal HTML docs, and full Unicode in 
>Unicode docs. Also the way of generating out-of-band characters
>should be allowed in HTML URL strings, like &a-ring and &#xxxx;

This is the way many of us want. If I use iso 8859-1 as the character
set of my html-document, the URLS must be able to use that too.
Especially as some of my own html-documents I store on my system
are named using iso 8859-1 and not using som encoding scheme.
Of course, I can use encoding if I want and if I need to address
a URL that does not use iso 8859-1.
It is not acceptable to define that during transmission all
URLS must be encoded and therefore request the www-server to
translate every document is handles. We cannot place to great burden
of the server, the large CPU power lies of the client side.

>4. Result
>In this way we have a natural way to write natural URLs in printed
>matter, etc capable of serving the whole world (on the world wide
>There is a natural way to write URLs in HTML docs, and these URLs
>can then be converted into a charset that is suitable for HTTP
>communication with a server (default is UTF-8). The server then
>has the responsibility of converting the charset encoded URL into
>a reference in its data store and fetch the data.
UTF-8 is bad as I stated above as it breaks compatibility with
current usage. Sugges either a UTF-encoding compatible with
iso 8859-1 or that HTTP protocol is extended with a UCS-2 or UTF-8
mode (could be done with prefix characters to the request line).