- From: Daniel W. Connolly <connolly@beach.w3.org>
- Date: Wed, 02 Aug 1995 18:45:09 -0400
- To: Glenn Adams <glenn@stonehand.com>
- Cc: html-wg@oclc.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
In message <9508021956.AA23126@trubetzkoy.stonehand.com>, Glenn Adams writes: > >It is my current understanding that arbitrary bytes can be encoded in URLs. Well... that's stretching it. Arbibrary bytes can be encoded in morse code too. A URL is a sequence of US-ASCII characters. Check RFC1738: 2.2. URL Character Encoding Issues URLs are sequences of characters, i.e., letters, digits, and special characters. A URLs may be represented in a variety of ways: e.g., ink on paper, or a sequence of octets in a coded character set. The interpretation of a URL depends only on the identity of the characters used. In most URL schemes, the sequences of characters in different parts of a URL are used to represent sequences of octets used in Internet protocols. For example, in the ftp scheme, the host name, directory name and file names are such sequences of octets, represented by parts of the URL. Within those parts, an octet may be represented by the chararacter which has that octet as its code within the US-ASCII [20] coded character set. >This provide a means for an HTML UA to formulate a response to a form >submission for arbitrary character encodings. Er... well.. I suppose so. But that seems like a pretty roundabout way to go about it. I'd much prefer to see a general purpose replacement for the application/x-www-form-urlencoded media type. Somethink like text/tab-separated-values might work nicely. Or something SQLish, or lispish, or Tcl-ish. text/tab-separated-values would be nice because you could use other charset= values for other encodings. Of course you'd have the same nasty interactions between octet 7 for the TAB character as octet 10 and 13 for CR/LF. > However, I do have one important >question: how does an HTTP server identify the encoding of such bytes (i.e., >the CHARSET) and communicate that encoding to the consumer of this data (e.g., >a CGI script)? Well... I gather you're still talking about the application/x-www-url-encoded media type. The only "specification" for that is in the HTML spec. (hang on... I'd better check the CGI spec... nope. It just says stuff like "Examples of the command line usage are much better demonstrated than explained.") I think the character encoding scheme is US-ASCII, or perhaps ISO-Latin-1, by convention. Like I said... I'd much prefer to see x-www-form-urlencoded replaced than having other character sets shoehorned into that hack. Dan
Received on Wednesday, 2 August 1995 15:45:23 UTC