- From: Peter Svanberg <psv@nada.kth.se>
- Date: Sun, 6 Aug 1995 16:05:30 +0200 (MET DST)
- To: Jon Knight <J.P.Knight@lut.ac.uk>
- Cc: Gavin Nicol <gtn@ebt.com>, masinter@parc.xerox.com, glenn@stonehand.com, html-wg@oclc.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, Olle Jarnefors <ojarnef@admin.kth.se>, Rickard Schoultz <schoultz@sunet.se>
On Thu, 3 Aug 1995, Jon Knight wrote: > On Thu, 3 Aug 1995, Gavin Nicol wrote: > However, I'd rather see something tacked onto the _end_ of the URL rather > than at the start of the opaque data section. Maybe something like: > > http://www.jacme.co.jp/%B0%F5%BA%FE.html;charset=EUC > > It just seems more inkeeping with other things I've seen suggested in the > past. I can't refrain from reminding you about what we -- three IETF:ers from Sweden -- suggested on this subject already in May 1993 (and which nobody took any notice of...): ------ Date: Tue, 11 May 93 10:36:58 +0200 Message-Id: <9305110836.AA06718@mercutio.admin.kth.se> From: Rickard Schoultz <schoultz@othello.admin.kth.se>, Olle Jarnefors <ojarnef@admin.kth.se>, Peter Svanberg <psv@nada.kth.se> Sender: Olle Jarnefors <ojarnef@admin.kth.se> To: uri@bunyip.com In-Reply-To: <9305080145.AA05811@wilma.cs.utk.edu> (Fri, 07 May 1993 21:45:02 -0400, Keith Moore <moore@cs.utk.edu>) Subject: Re: Wrappers for URLs : : As has already been pointed out in this discussion, non-ASCII characters are used a lot in many countries outside USA. In a language using the Latin script like Swedish almost 1/3 of all words contain non-ASCII letters. In languages such as Greek, Russian, Hindi and Chinese no ASCII letters at all are used. For ordinary users in these countries it will be unacceptable to see their everyday letters represented by %-headed hexadecimal digit sequences. We propose this solution: C) In the human form of URLs non-ASCII characters may be used provided a character set indicator is added to the URL immediately before the closing ">". This indicator shall have the syntax "%:" charset where "charset" is a value registered by IANA for MIME use. The corresponding coded character set defines the mapping of the non-ASCII character to a sequence of octets that can be represented by the %-mechanism in the program form of the URL. Take as an example a file called l<a">s mig in the directory pub at host othello.admin.kth.se. Here <a"> is the character LATIN SMALL LETTER A WITH DIAERESIS, coded by the octet E4 hex according to ISO-8859-1. (The name of the file is Swedish for "read me".) The human form URL for this file preferred in Sweden would be <ftp://othello.admin.kth.se/pub/l<a">s mig%:iso-8859-1> <a"> here would in reality be the non-ASCII character. The program form would be <ftp://othello.admin.kth.se/pub/l%e4s mig> Why is it necessary to include a character set indicator in these extended URLs containing non-ASCII characters? It's because that makes the URL resistent to character-preserving conversion between different coded character sets. Say that the URL in the example is included in a text file coded with ISO-8859-1, so the non-ASCII character is represented by the octet E4. Then this file is transferred to a Mac and therefore converted to the Macintosh character set. For the Mac user it will look exactly as intended, containing a lowecase a with diaeresis (which is necessary to form the Swedish expression for "read me"). In the Mac file, however, this letter will be represented by 8A instead of E4. Thanks to the character set indicator %:iso-8859-1 a client program on the Mac will however be able to feed the right octet E4 to the FTP program to fetch the correct file from the FTP server (where ISO-8859-1 is used in file names). It could be argued that occasional non-ASCII letters is nothing to make a fuss about: European users can be taught to read and input %-sequences instead. But consider a Greek FTP server, where almost the whole path of a file is written with Greek letters using ISO-8859-7. In that case URLs will be almost three times as long and consist of mostly a soup of "%" and hexadecimal digits, interspersed with "/". Such URLs will be unusable for humans, unless some way of using the real non-ASCII letters is provided. Another points in connection with internationalized URLs that we would like to raise: D) The hexadecimal %-headed representation used in the program form is very inefficient. In countries with languages using other scripts than the Latin URLs may be almost three times as long as in English-speaking countries. To reduce this unfairness we could, in addition to the %-representation. include a &-representation: After "&" would follow a sequence of octets encoded by a BASE64-like method into a 33 % (instead of 300 %) longer sequence of the characters A-Z, a-z, 0-9, + and -. This sequence would be ended by a second "&". -- Rickard Schoultz schoultz@admin.kth.se SUNET/KTH +46-8-790 90 88 (voice) S-100 44 Stockholm (SWEDEN) +46-8-10 25 10 (fax) Olle Jarnefors Internet: ojarnef@admin.kth.se Information Management Services UUCP: ...!uunet!mcsun!sunic!kth!ojarnef Royal Institute of Technology (KTH) BITNET: ojarnef@sekth Fax:+46 8 10 25 10 S-100 44 Stockholm, Sweden Phone: +46 8 790 71 26 (time zone +0200) Peter Svanberg, NADA, KTH Email: psv@nada.kth.se Dept of Num An & CS, Royal Inst of Tech Phone: +46 8 790 71 40 S-100 44 Stockholm, SWEDEN Fax: +46 8 790 09 30
Received on Sunday, 6 August 1995 07:06:56 UTC