Re: html, http, urls and internationalisation

>From: Larry Masinter <masinter@parc.xerox.com>
>Date: Sun, 28 Jan 1996 13:09:47 PST
>
>Sigh, it's really frustrating to have talked this out so many times in
>the URI mailing list only to have the same discussion again now in two
>other working group mailing lists.

Perhaps this is a hint that there is a user requirement that hasn't
been met?  Anyway, the URI-WG is dead.

>URLs are written with characters, not octets. The characters in a URL
>are used to represent octets, not characters.

Technically, yes.  And octets above 127 are stuck with the nice "%XX"
representation, because there is no agreement on what characters
should represent them.  Which is at the heart of the issue at hand:
URLs are used to name and locate Internet resources, and only those
whose language can be represented by ASCII (English and Swahili) can
have meaningful (to them) names.

> URL: sequence of characters  
> URL interpretation:
>   parse URL, extract sequences of octets, send octets
>   to appropriate protocol based on scheme

Fits nicely with the UTF-8 idea: get a sequence of chars, extract the
corresponding sequence of octets according to UTF-8 transformation of
Unicode character numbers, and transmit that.

>For those situations
>where URLs are embedded in other documents, that embedding should use
>the charset of the containing document.

Doesn't work.  If I pick a URL containing cyrillic letters from a KOI8
document, and retype it in an ISO-8859-5 document keeping the
*characters* constant, the octets will change and the link won't work
anymore.  Unless the charset is identified in what goes to the server,
or there is an agreed upon mapping from characters to octets for
cyrillic characters in URLs.

>The repertoire of characters
>allowed within URLs is intentionally restricted to allow such
>embedding in almost all contexts.

Yes, ASCII, which is insufficient and is why this debate comes back
again and again.

Regards,

-- 
Francois Yergeau

Received on Sunday, 28 January 1996 20:27:56 UTC