Re: html, http, urls and internationalisation from Larry Masinter on 1996-01-28 (ietf-http-wg@w3.org from January to March 1996)

From: Larry Masinter <masinter@parc.xerox.com>
Date: Sun, 28 Jan 1996 13:09:47 PST
To: keld@dkuug.dk
Cc: Dan.Oscarsson@malmo.trab.se, html-wg@oclc.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, maits@dkuug.dk
Message-Id: <96Jan28.131001pst.2733@golden.parc.xerox.com>

Sigh, it's really frustrating to have talked this out so many times in
the URI mailing list only to have the same discussion again now in two
other working group mailing lists.

URLs are written with characters, not octets. The characters in a URL
are used to represent octets, not characters. The
characters "h", "t", "t", "p" etc. in 
     http://foo.com/abcdefg

are used to create separate octet strings

      66 6f 6f 2e 63 6f 6d (foo.com)

and 
      2f 61 62 63 64 65 66 (/abcdef)

which are then fed respectively to the http protocol as the DNS entry
to which the connection was open and the string in the GET.

To summarize:

	URL: sequence of characters  
	URL interpretation:
	  parse URL, extract sequences of octets, send octets
	  to appropriate protocol based on scheme

In some protocols, those sequences of octets are then subsequently
interpreted as representations of characters in a given character
encoding. In some cases, the protocol makes no such interpretation,
but some implementations of the protocols do.

> I would propose that URLs be written in the charset of the 
> document that references the url,

This is exactly the situation. URLs are sequences of characters, can
be written in newspapers or on business cards (which, not being
computer encodings, don't have a 'charset'). For those situations
where URLs are embedded in other documents, that embedding should use
the charset of the containing document. The repertoire of characters
allowed within URLs is intentionally restricted to allow such
embedding in almost all contexts.

>				possibly enhanced with
> the extensions that we make to get further characters, 
> for example &a-ring; or &#xxxx; 

this is the part that's impossible. You might imagine doing such a
thing, but it doesn't work if you then try to use URLs for the purpose
for which they are functional.

Some folks want to deal with the variability of how particular
implementations of HTTP or FTP might use sequences of octets to
represent characters, and, in particular, the characters that appear
before the local user behind the HTTP or FTP server. So, if you have a
FTP or HTTP server that serves out files in your file server, and your
file server uses Big5 or Unicode for the representation of file names,
you have to choose an encoding of Big5 or Unicode as octets in order
to deal with the FTP or HTTP protocols. It would be useful to
standardize that encoding, because there are new HTTP implementations
being delivered all the time, and even new FTP implementations.

This is not a HTML issue, except that HTML forms that use Action=GET,
which I already discussed in a previous message.

Received on Sunday, 28 January 1996 13:12:43 UTC