Comments on draft-masinter-url-i18n-03 from Dan Oscarsson on 1998-09-04 (uri@w3.org from September 1998)

From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Date: Fri, 4 Sep 1998 11:45:04 +0200 (MET DST)
To: duerst@w3.org, masinter@parc.xerox.com
Cc: uri@Bunyip.Com
Message-Id: <199809040945.LAA07002@valinor.malmo.trab.se>
Hi

Nice to see some discussion on non-ascii uris.

A few comments to the discussions and some to the draft.


I agree that matching accented with unaccented characters is unwise.
As was pointed out: the Swedish "ö" cannot be matched to a "o"
because in Swedish they are two different letters. "ö" is not an
accented letter in Swedish even though it looks like one for
English speaking people. So uris for Swedish documents should not
be matched using accented/unaccented matching.

There have been talks about encodings of URIs in different contexts.
I can see three very important cases: in a protocol, embedded in a document
and when used by humans.
In a protocol)
Here it is very good if everybody uses the same encoding
of the characters. ISO 10646 encoded as UTF-8 is fine.
But during some time software must be able to still receive URIs encoded
in the old way (like EBCDIC, ISO 8859-1) that works today and is sent
by software.

Embedded in a document)
Here the URI need to either use %-encodings or the same character set as
the document. Unless the entire document is in UTF-8, the URIs cannot be
in UTF-8.
For example: at my site all documents are in ISO 8859-1, URIs are in
ISO 8859-1. The users edit html-documents cannot be expected to enter
anything than their normal characters. It would be very user unfriendly
to require them to enter either %-encodings or UTF-8 encodings in URIs.
And software handling the document, like a web browser, must extract
the URIs using the character set of the document and internally handle them
in a way that allows it to reencode the URI correctely so that if can use
UTF-8 when sending it over a protocol.

Used by humans)
URIs used by humans will be user firendly if the user can enter, view,
type or print on a paper, the URI in their own native language.
This requires that in human interaction %-encodings or UTF-8 encodings
should not be used, if the characters represented by the code values,
represent characters used by the users alphabet.

-
Some comments to the draft:

I am missing some text about how URIs should be printed on paper.
Just like a URI can be viewed without the encodings being shown, a
URI on paper can also be in that format. An i18n URI need not be
%-encoded on paper!

Part 3.4 Display of URIs
section b)
I recommend that it should say SHOULD instead of MAY. That is, if possible,
a URI SHOULD be displyed to a human in an unencoded way so they can
easily read them. And the human shall be allowed to enter them
as is natural to do, it is the software that (without showing the user)
reencodes the URI into the encoding used by transport.

Part 3.5 Interpretation of URIs
As said above, it is more robust to allow case-insensitive matching,
but not matching what might look like an accented character with an
unaccented one.

Part 4.2.1 URIs within HTTP
As the HTTP protocol is an 8-bit and not a 7-bit protocol it should 
generally be expected that URIs can be sent using 8-bit bytes.
So software should expect that they  do not have to %-encode 8-bit
bytes to work over HTTP. Servers that cannot handle 8-bit bytes
probably cannot handle URIs containg anything but 7-bit code values, so
difficulties should very seldom occur.

Part 4.2.2 URIs within HTML and XML
As I said above, if an non-ascii URI is ued within a document, it
should use the same character set as the rest of the document.
Neither %-encoding nor UTF-8 encoding should be needed.
This is because documents are written by normal humans and they cannot be
expected to encode URIs before typing them in. The same way that they
type the URI into their web browser, the same way must be possible in
the document.

--
Handling URIs in a web server as give in the draft is not very difficult.
The web server I use handles many of the cases. Incoming URLs can
be handled if they are %-encoded, UTF-8 encoded, ISO 8859-1 encoded and
for a restricted subset, for MacOS encodings. This is to be handle
to handle all the encodings that web browsers at my place send when
non-ascii URLs are used.
The file system have all file names in ISO 8859-1 and the web server
translates all incoming URLs into ISO 8859-1. It also ignores case so that
users can use the case they want when creating their documents and still
you can give the URL over phone without telling which characters are in
upper and lower case.

What I now is missing is that the web browsers allowed enter and display
of URLs in a user friendly way. UTF-8 URLs looks bad in all browsers,
ISO 8859-1 looks fine on Unix and MS Windows, but on the Mac the display
incorrectely and are transmitted using MacOS character set.

Hope the draft when it gets ready will get browser companies to fix their
software so it works for non-ascii URLs.

    Dan
--
Dan Oscarsson
Telia Prosoft AB                       Email: Dan.Oscarsson@trab.se
Box 85
201 20  Malmo, Sweden
Received on Friday, 4 September 1998 05:47:18 UTC