Re: revised "generic syntax" internet draft from Dan Oscarsson on 1997-04-15 (uri@w3.org from April 1997)

From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Date: Tue, 15 Apr 1997 15:50:11 +0200 (MET DST)
To: uri@bunyip.com, fielding@kiwi.ICS.UCI.EDU
Cc: Harald.T.Alvestrand@uninett.no
Message-Id: <199704151350.PAA20358@valinor.malmo.trab.se>
> PROBLEM 1:  Users in network environments where non-ASCII characters
>             are the norm would prefer to use language-specific characters
>             in their URLs, rather than ASCII translations.
> 
> Proposal 1a: Do not allow such characters, since the URL is an address
>              and not a user-friendly string.  Obviously, this solution
>              causes non-Latin character users to suffer more than people
>              who normally use Latin characters, but is known to interoperate
>              on all Internet systems.
Well, Swedish letters like åäö are normally called Latin, but I assume you
mean ascii.
This proposal is NOT acceptable. It is very important that URLs are
user-friendly.
It will also make Java impossible to use over the webb!
Java allows non ascii in variable names and types (at least one language
that is international!) and this means that it must be possible to
fetch non ascii Java classes over the webb, if Java is going to work.

> 
> Proposal 1b: Allow such characters, provided that they are encoded using
>              a charset which is a superset of ASCII.  Clients may display
>              such URLs in the same charset of their retrieval context,
>              in the data-entry charset of a user's dialog, as %xx encoded
>              bytes, or in the specific charset defined for a particular
>              URL scheme (if that is the case).  Authors must be aware that
>              their URL will not be widely accessible, and may not be safely
>              transportable via 7-bit protocols, but that is a reasonable
>              trade-off that only the author can decide.
If the URL is %xx encoded it works over 7-bit transports.
Also the URL is accessible from any place where ascii can be used as
every URL can be encoded using the %xx encoding which gives an ascii
only encoding of the URL.
I dislike this proposal as the non ascii characters use an undefined character
set and therefore a client cannot know how to interpret the characters
and display the correctely in the local character set.

> 
> Proposal 1c: Allow such characters, but only when encoded as UTF-8.
>              Clients may only display such characters if they have a
>              UTF-8 font or a translation table.  Servers are required to
>              filter all generated URLs through a translation table, even
>              when none of their URLs use non-Latin characters.  Browsers
>              are required to translate all FORM-based GET request data
>              to UTF-8, even when the browser is incapable of using UTF-8
>              for data entry. 

> raw bits.  The server would be required to interpret all URL characters
> as characters, rather than the current situation in which the server's
> namespace is distributed amongst its interpreting components, each of which
> may have its own charset (or no charset).  Even if we were to make such
> a change, it would be a disaster since we would have to find a way to
> distinguish between clients that send UTF-8 encoded URLs and all of those
> currently in existence that send the same charset as is used by the HTML
> (or other media type) page in which the FORM was obtained and entered
> by the user.

I think you are missing one important thing. The UTF-8 encoded URL is a
transport format, if a URL is embedded within a iso 8859-1 encoded
html document, the URL is encoded using iso 8859-1. A URL should only
be encoded using a well defined character set like the UTF-8 encoding
when transmitted in a protocol that says that a URL is part of the
protocol, when a URL is embedded in someting else, like a html
document, printed on paper, displayed on a screen, the URL should be
encoded using the same character set that the object it is embedded in, is
using. As a browser knows the character set used in a html document, it
can easily translate the URL from, for example iso 8859-1, to UTF-8 for
transmission in the protocol. When a URL is sent embedded in a html
document (or in a form) it should use the same encoding as the document or the
form. So I do not think the problem is as great as you said, most CGI script
can work as before as internal URLs are of the same character set as
the document generated. But some of the libraries for decoding incoming
URLs used by CGI scripts will have to be changed, and many products
must learn to separate local (native) character set and transport
character set. Many already partially does that, for example Netscape
for Mac uses the Macintosh character set for display of html documents
even though the transport format of the is iso 8859-1.

-
If we cannot find a way to send URLs containing any character in a way so
that the characters can be understood and displyed in a user friendly
manner, the web and URLs are not the future.

    Dan
    
--
Dan Oscarsson
Telia Engineering AB                       Email: Dan.Oscarsson@trab.se
Box 85
201 20  Malmo, Sweden
Received on Tuesday, 15 April 1997 09:51:25 UTC