- From: Larry Masinter <masinter@parc.xerox.com>
- Date: Sun, 28 Jan 1996 13:09:47 PST
- To: keld@dkuug.dk
- Cc: Dan.Oscarsson@malmo.trab.se, html-wg@oclc.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, maits@dkuug.dk
Sigh, it's really frustrating to have talked this out so many times in the URI mailing list only to have the same discussion again now in two other working group mailing lists. URLs are written with characters, not octets. The characters in a URL are used to represent octets, not characters. The characters "h", "t", "t", "p" etc. in http://foo.com/abcdefg are used to create separate octet strings 66 6f 6f 2e 63 6f 6d (foo.com) and 2f 61 62 63 64 65 66 (/abcdef) which are then fed respectively to the http protocol as the DNS entry to which the connection was open and the string in the GET. To summarize: URL: sequence of characters URL interpretation: parse URL, extract sequences of octets, send octets to appropriate protocol based on scheme In some protocols, those sequences of octets are then subsequently interpreted as representations of characters in a given character encoding. In some cases, the protocol makes no such interpretation, but some implementations of the protocols do. > I would propose that URLs be written in the charset of the > document that references the url, This is exactly the situation. URLs are sequences of characters, can be written in newspapers or on business cards (which, not being computer encodings, don't have a 'charset'). For those situations where URLs are embedded in other documents, that embedding should use the charset of the containing document. The repertoire of characters allowed within URLs is intentionally restricted to allow such embedding in almost all contexts. > possibly enhanced with > the extensions that we make to get further characters, > for example &a-ring; or &#xxxx; this is the part that's impossible. You might imagine doing such a thing, but it doesn't work if you then try to use URLs for the purpose for which they are functional. Some folks want to deal with the variability of how particular implementations of HTTP or FTP might use sequences of octets to represent characters, and, in particular, the characters that appear before the local user behind the HTTP or FTP server. So, if you have a FTP or HTTP server that serves out files in your file server, and your file server uses Big5 or Unicode for the representation of file names, you have to choose an encoding of Big5 or Unicode as octets in order to deal with the FTP or HTTP protocols. It would be useful to standardize that encoding, because there are new HTTP implementations being delivered all the time, and even new FTP implementations. This is not a HTML issue, except that HTML forms that use Action=GET, which I already discussed in a previous message.
Received on Sunday, 28 January 1996 13:12:43 UTC