Re: html, http, urls and internationalisation

Keld J|rn Simonsen (keld@dkuug.dk)
Wed, 31 Jan 1996 12:15:39 +0100


Message-Id: <199601311115.MAA11079@dkuug.dk>
From: keld@dkuug.dk (Keld J|rn Simonsen)
Date: Wed, 31 Jan 1996 12:15:39 +0100
In-Reply-To: Larry Masinter <masinter@parc.xerox.com>
To: Larry Masinter <masinter@parc.xerox.com>, borka@e5.ijs.si
Subject: Re: html, http, urls and internationalisation
Cc: yergeau@alis.ca, Dan.Oscarsson@malmo.trab.se, maits@dkuug.dk,

Larry Masinter writes:

> > What Keld said is sound and could be worked further. THe major
> > restriction is the DNS part and this should be kept as it is
> > (character < 127). The same applies to the syntax characters.
> 
> No, "what Keld said" isn't "sound" it is just "sounds nice".

Glad you like the sound effects, Larry!

> Keld said, for example,
> 
> > 1. URLs themselves.
> 
> > These are at an abstract character level, as Larry and Franc,ois
> > correctly points out, you cannot see what is the charset
> > when you look at a business card or an URL in the newspaper.
> 
> > I propose that any character here be allowed, except for the 
> > URL syntax characters, (things like < / : ) - in the non-DNS
> > part of the URL. Remember these are abstract characters, and
> > there is no binding to for example ISO 10646 in the sense
> > of a character repertoire, or to any encoding (charset).
> 
> However, this nice-sounding proposal contained no solution to the
> following questions:
> 
> 1)how do these abstract characters subsequently get turned
>   into octets that are employed in real protocols in general
>   and http and ftp in particular?
>   (The current URL specification gives an algorithm.)

>From glyphs on paper to a computer system, eg. a browser:
by having the human recognise (aka "read") the characters and enter
them, as is normally done.

>From a html doc into a http request: The html doc has a
charset, and the http request url is represented in a charset.
So the html string with the URL is converted into the http 
charset, and then the URL is sent with high bits encoded according
to the url specifications (in %xx notation). I found no ways
of specifying a charset in the current rfcs on URLs.

I did specify the transformations and encodings in earlier mail.
> 
> 2)how does one translate a URL that uses a large character
>   repertoire so that it might be written in a context with 
>   a small repertoire? E.g., a URL with chinese characters
>   in an ASCII email message.
>   (The current URL specification manages this by limiting
>   the repertoire.)

That was also described in the previous mailing, about the html I said:

> >Here it should be possible to write a HTML document in a given
> >charset, and then reference the (abstract) characters in the URL, just
> >like it is possible to write characters in the rest of the HTML document.
> >That is, the normal characters of the document charset can be used,
> >like full iso-8859-1 in normal HTML docs, and full Unicode in 
> >Unicode docs. Also the way of generating out-of-band characters
> >should be allowed in HTML URL strings, like &a-ring and &#xxxx;

> I don't think these problems are unsolvable, but I think in the course
> of making a "sound" proposal you'll find that it starts "sounding"
> less and less like something that you'd want to implement.

I think most of the concerns have been addressed in what I wrote,
but anyway there may be finer details in it that needs to be sharpened
and and it needs to be cast in concrete specs.

I think most of the specs are already there and ready to be employed
in an implementation.

> So, I'll ask again, PLEASE stop cross-posting this discussion to three
> separate mailing lists.

OK, taken ad notam.

Keld