Re: revised "generic syntax" internet draft from Martin J. Duerst on 1997-04-22 (uri@w3.org from April 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Tue, 22 Apr 1997 17:13:42 +0200 (MET DST)
To: Keld J|rn Simonsen <keld@dkuug.dk>
Cc: John C Klensin <klensin@mci.net>, Dan Oscarsson <Dan.Oscarsson@trab.se>, Harald.T.Alvestrand@uninett.no, uri@bunyip.com, fielding@kiwi.ICS.UCI.EDU
Message-Id: <Pine.SUN.3.96.970422164228.245X-100000@enoshima>
On Tue, 22 Apr 1997, Keld J|rn Simonsen wrote:

> "Martin J. Duerst" writes:

> > In particular, the "FORM-UTF8: Yes" I proposed is very similar
> > to your proposal. To be able to label arbitrary "charset"s is
> > an extension, but I don't think it is needed at this stage of
> > ISO 10646 and Internet development. The way I put it usually
> > is that currently, we have "chaos". There is no need to proceed
> > to "labeled chaos" when we can proceed to "order" directly.
> > The Universal Character Set really shows off its strength most
> > directly for short and widely used strings such as URLs.
> 
> My "URL-Charset:" header also goes along the "labelled chaos" that
> we already have with HTML,

Yes, it is similar to what we have with HTML. But there are significant
differences in the properties of HTML and URLs that suggest that using
different approaches might be a good idea:

- Length: HTML is much longer than URLs, and tagging is therefore
	less of a burden.

- Length again: HTML can benefit from using different "charset"s
	as a kind of "compression", this is less of an issue for URLs.

- Round-trip vs. one way: URLs make a round trip from the originator
	and back to it, and they have to arrive there safely. HTML
	is more downstreams only, and never needs an exact match
	after many transformations.

- Transcription by paper: URLs are transcribed on paper. Adding
	a charset tag on paper is very clumsy (think about
	http:[us-ascii]//www.ibm.com printed in a newspaper).
	It may look like we don't need that tag, because
	the characters are all identified, yet if we want
	to use the current URL software which compares URLs
	using octet identity, we have to transform the
	characters back into the octets that they originated
	from.

> and then the coding of URLs in
> anchors etc in the HTML markup. The natural thing there is that URLs
> are encoded in the charset of the HTML document. So a request
> for the URL would then have a header with the URL and then the
> "URL-charset" of the HTML document. Straightforward. And we could
> use equivalent mechanisms whether the URL was typed in or came from
> a HTML document.

This is indeed true, and part of our proposal. But this solves
only part of the problem, namely the question: What characters
do the octets you are currently manipulating in your computer
actually represent. So for example if I type an "o" with a "/"
on a Mac, it will be represented as 0xBF internally, and it is
(implicitly and naturally) tagged Mac-Roman. And if I cut-copy-
paste that character into another document, it will keep its
identity, but because it is in a web page editor, it might
change its representation e.g. to 0xF8, and be (implicitly
and naturally) tagged Latin-1. Printed on paper, it will
still be an "o with /", but it doesn't need any tagging for
representation because the tagging is automatically and
implicitly added when it is input again.

The problem here starts when this "o with /" is converted
to %HH, and when it is send to a server. Now I am not
anymore interested in the encoding in which I currently
keep the character (which is usually not too difficult
to know), but in the encoding that the server is assuming
the character will arive in. And if I don't take special
measures, I have absolutely no idea about what that could
be. Now there are several possibilities:

1) Add another tag, this time explicit, that has to be
	carried around *all the time* and separately from
	the information that might be around implicitly.
	As I said above, this is very ugly, and no current
	software is prepared for it. Also, it introduces
	the problem that the browser (which
	strips the tag and converts) has to know about
	a large number of charset's, more than just
	the pages it is used to display and stuff that
	it is otherwise used to. Knowledge that could
	be centralized has to be widely distributed.

2) Send the URL as is, with a "charset" information.
	The server would get URLs in all kinds of
	charsets, and would have to care on its own
	for how to convert them to the charset it is
	using. Also, we can't freely convert to
	%HH, because then we need to add a tag
	as to what we used when converting to %HH.

3) Define a single encoding (this obviously is UTF-8).
	This means that when you see an URL with
	beyond-ASCII characters in it, you will know
	that to convert it to %HH and send it to the
	browser, you have to use UTF-8. It's like
	the tag above, but just that there is only
	one possibility, and that this therefore
	doesn't have to be specified.

4) Have a knowledge database about different protocols/
	schemes and the encoding they use (if they use
	a single one). Is very clumsy to write general
	URL software with nice interface.

5) Have a way to ask the server what charset it accepts.
	Again, this needs new protocol, the tag, instead
	of making a roundtrip, is served by the server on
	demand. This gets difficult especially if you have
	various encodings in various areas of the same
	server. Also, the client needs to know about
	lots of encodings.


> Also the responsibility of handling the character
> encoding incl conversion would be at the server side, which normally
> would be the "offender" allowing strange things like non-ASCII URLs.

Your proposal is probably very close to 2) above. I think it
would be probable to deploy it for HTTP, but it would put more
heavy burdens on the server than with UTF-8 (where the server
just has to know UTF-8 and whatever it wants to use locally).
Also, it would need to add a tag for when we convert to %HH.


Regards,	Martin.
Received on Tuesday, 22 April 1997 11:19:30 UTC