- From: Dan Oscarsson <Dan.Oscarsson@trab.se>
- Date: Fri, 4 Sep 1998 11:45:04 +0200 (MET DST)
- To: duerst@w3.org, masinter@parc.xerox.com
- Cc: uri@Bunyip.Com
Hi Nice to see some discussion on non-ascii uris. A few comments to the discussions and some to the draft. I agree that matching accented with unaccented characters is unwise. As was pointed out: the Swedish "ö" cannot be matched to a "o" because in Swedish they are two different letters. "ö" is not an accented letter in Swedish even though it looks like one for English speaking people. So uris for Swedish documents should not be matched using accented/unaccented matching. There have been talks about encodings of URIs in different contexts. I can see three very important cases: in a protocol, embedded in a document and when used by humans. In a protocol) Here it is very good if everybody uses the same encoding of the characters. ISO 10646 encoded as UTF-8 is fine. But during some time software must be able to still receive URIs encoded in the old way (like EBCDIC, ISO 8859-1) that works today and is sent by software. Embedded in a document) Here the URI need to either use %-encodings or the same character set as the document. Unless the entire document is in UTF-8, the URIs cannot be in UTF-8. For example: at my site all documents are in ISO 8859-1, URIs are in ISO 8859-1. The users edit html-documents cannot be expected to enter anything than their normal characters. It would be very user unfriendly to require them to enter either %-encodings or UTF-8 encodings in URIs. And software handling the document, like a web browser, must extract the URIs using the character set of the document and internally handle them in a way that allows it to reencode the URI correctely so that if can use UTF-8 when sending it over a protocol. Used by humans) URIs used by humans will be user firendly if the user can enter, view, type or print on a paper, the URI in their own native language. This requires that in human interaction %-encodings or UTF-8 encodings should not be used, if the characters represented by the code values, represent characters used by the users alphabet. - Some comments to the draft: I am missing some text about how URIs should be printed on paper. Just like a URI can be viewed without the encodings being shown, a URI on paper can also be in that format. An i18n URI need not be %-encoded on paper! Part 3.4 Display of URIs section b) I recommend that it should say SHOULD instead of MAY. That is, if possible, a URI SHOULD be displyed to a human in an unencoded way so they can easily read them. And the human shall be allowed to enter them as is natural to do, it is the software that (without showing the user) reencodes the URI into the encoding used by transport. Part 3.5 Interpretation of URIs As said above, it is more robust to allow case-insensitive matching, but not matching what might look like an accented character with an unaccented one. Part 4.2.1 URIs within HTTP As the HTTP protocol is an 8-bit and not a 7-bit protocol it should generally be expected that URIs can be sent using 8-bit bytes. So software should expect that they do not have to %-encode 8-bit bytes to work over HTTP. Servers that cannot handle 8-bit bytes probably cannot handle URIs containg anything but 7-bit code values, so difficulties should very seldom occur. Part 4.2.2 URIs within HTML and XML As I said above, if an non-ascii URI is ued within a document, it should use the same character set as the rest of the document. Neither %-encoding nor UTF-8 encoding should be needed. This is because documents are written by normal humans and they cannot be expected to encode URIs before typing them in. The same way that they type the URI into their web browser, the same way must be possible in the document. -- Handling URIs in a web server as give in the draft is not very difficult. The web server I use handles many of the cases. Incoming URLs can be handled if they are %-encoded, UTF-8 encoded, ISO 8859-1 encoded and for a restricted subset, for MacOS encodings. This is to be handle to handle all the encodings that web browsers at my place send when non-ascii URLs are used. The file system have all file names in ISO 8859-1 and the web server translates all incoming URLs into ISO 8859-1. It also ignores case so that users can use the case they want when creating their documents and still you can give the URL over phone without telling which characters are in upper and lower case. What I now is missing is that the web browsers allowed enter and display of URLs in a user friendly way. UTF-8 URLs looks bad in all browsers, ISO 8859-1 looks fine on Unix and MS Windows, but on the Mac the display incorrectely and are transmitted using MacOS character set. Hope the draft when it gets ready will get browser companies to fix their software so it works for non-ascii URLs. Dan -- Dan Oscarsson Telia Prosoft AB Email: Dan.Oscarsson@trab.se Box 85 201 20 Malmo, Sweden
Received on Friday, 4 September 1998 05:47:18 UTC