Re: Globalizing URIs

In message <9508021956.AA23126@trubetzkoy.stonehand.com>, Glenn Adams writes:
>
>It is my current understanding that arbitrary bytes can be encoded in URLs.

Well... that's stretching it. Arbibrary bytes can be encoded in
morse code too. A URL is a sequence of US-ASCII characters. Check
RFC1738:

2.2. URL Character Encoding Issues

   URLs are sequences of characters, i.e., letters, digits, and special
   characters. A URLs may be represented in a variety of ways: e.g., ink
   on paper, or a sequence of octets in a coded character set. The
   interpretation of a URL depends only on the identity of the
   characters used.

   In most URL schemes, the sequences of characters in different parts
   of a URL are used to represent sequences of octets used in Internet
   protocols. For example, in the ftp scheme, the host name, directory
   name and file names are such sequences of octets, represented by
   parts of the URL.  Within those parts, an octet may be represented by
   the chararacter which has that octet as its code within the US-ASCII
   [20] coded character set.

>This provide a means for an HTML UA to formulate a response to a form
>submission for arbitrary character encodings.

Er... well.. I suppose so. But that seems like a pretty roundabout
way to go about it.

I'd much prefer to see a general purpose replacement for
the application/x-www-form-urlencoded media type.

Somethink like text/tab-separated-values might work nicely.
Or something SQLish, or lispish, or Tcl-ish. text/tab-separated-values
would be nice because you could use other charset= values for
other encodings. Of course you'd have the same nasty interactions
between octet 7 for the TAB character as octet 10 and 13 for CR/LF.

> However, I do have one important
>question:  how does an HTTP server identify the encoding of such bytes (i.e.,
>the CHARSET) and communicate that encoding to the consumer of this data (e.g.,
>a CGI script)?

Well... I gather you're still talking about the
application/x-www-url-encoded media type. The only "specification" for
that is in the HTML spec.  (hang on... I'd better check the CGI
spec... nope. It just says stuff like "Examples of the command line
usage are much better demonstrated than explained.")

I think the character encoding scheme is US-ASCII, or perhaps
ISO-Latin-1, by convention.

Like I said... I'd much prefer to see x-www-form-urlencoded replaced
than having other character sets shoehorned into that hack.

Dan

Received on Wednesday, 2 August 1995 15:45:23 UTC