http charset labelling

Keld J|rn Simonsen (keld@dkuug.dk)
Wed, 31 Jan 1996 17:55:50 +0100


Message-Id: <199601311655.RAA19593@dkuug.dk>
From: keld@dkuug.dk (Keld J|rn Simonsen)
Date: Wed, 31 Jan 1996 17:55:50 +0100
To: uri@bunyip.com
Subject: http charset labelling

Glenn Adams writes:

>     The problem still
>     exist that not all characters of the world is in ISO 10646.
> 
> Yes, I agree.  That is why I proposed a charset tagging syntax to be
> intrinsic to URLs.

I think there is agreement that we need some charset tagging on URLs.

The problem is how to tag it.

1. I am not sure I understand what Glenn is writing here,
would intrinsic be in the sense that MIME has for its headers,
the ?=charset=? thing? Something along the lines where
in the first part of the URL you have the protocol specification
userid and password and socket number and the domain, there could
be a field for a charset. This would then be an extention of the
URL syntax. An example could be after the port number:

      http://www.dkuug.dk:80:utf-8/maits/

2. Another place is at the end of a GET / POST request in HTTP, an
example:
  
      GET http://www.dkuug.dk/maits/ HTTP/1.1 utf-8

3. yet another place could be in headers for the GET request:

      GET http://www.dkuug.dk/maits/ HTTP/1.1
      Url-Charset: utf-8

Discussion:

1. is general to all URL usage, so there would be no need
to update protocols. Anyway a server using HTTP/1.0 would not
understand this notion, and thus it would create havoc (I think).
The other thing is that specifying a charset in a URL is not the right
place to do it, it should not be nessecary to specify charsets of urls
in newspapers and business cards, as we agreed that URLs were 
coding independent information.

2. is http specific - It may cause some http/1.0 servers to
goof as there is a parameter that it does not expect.

3. should be backwards compatible, as servers may ignore 
headers they don't understand (as per the http 1.0 spec)
and they have a good chance of understanding the URL that is there
- possibly in semi-official iso-8859-1 anyway (URLs are 7-bit,
http is 8-bit iso-8859-1 per default)

So basically there is not much difference between 2. and 3. -
they are protocol specific and do not touch URL syntax. 
I dislike 1. as it implies writing encoding in the URL.

keld