Re: Charsets revisited from Larry Masinter on 1996-01-25 (ietf-http-wg@w3.org from January to March 1996)

From: Larry Masinter <masinter@parc.xerox.com>
Date: Thu, 25 Jan 1996 08:23:02 PST
To: glenn@stonehand.com
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <96Jan25.082314pst.2733@golden.parc.xerox.com>

> I think you are focusing too narrowly.  The problem goes more
> deeply.  In particular, the fundamental problem is how to specify
> the information needed to decode escaped octets representing
> non-ASCII character data which appear in a URI, such as found in an
> HTTP Simple Request. 

> This problem is endemic to the specification of URIs as such and
> needs to be addressed at that level no matter to what use URIs are
> put.


The fudamental problem is that people on earth not only choose to
speak and write different languages, they even choose different coding
sequences to represent the SAME language.  It's hard to argue where
the 'fundamental problem' might be; the real question is: what's the
right place to *solve* the problem.

Practically speaking, I think we have to solve different parts of the
problem in different places. We can solve the problem of 'what is the
character encoding used in data sent from a server to a client' by
charset tagging and negotiation in HTTP GET; we can solve the problem
of 'what is the character encoding used to encode what a client typed
into a form when sent from client to server' by using
multipart/form-data as the wrapper for the response and using charset
tags within the parts that need them.

As for your example:

> ... all of the following are asking for the same resource:

>  GET /%1B$BF%7CK%5C%1B(B.HTM
>  GET /%93%FA%96%7B.HTM
>  GET /%C6%FC%CB%DC.HTM
>  GET /e%E5g,%00.%00H%00T%00M
>  GET /+ZeVnLA-.HTM

> The problem is, unless the server knows that the characters encoded with
> the URI octet escapement mechanism in these examples use ISO-2022-JP, SHIFT
> JIS, EUC-J, UNICODE-1-1, and UNICODE-1-1-UTF7, respectively, then the
> serve has no reliable way of decoding the octets as characters.

you cannot possibly mean that the *same* HTTP server will employ
2022-jp, shift jis, euc-j, unicode-1-1 and unicode-1-1-utf7.

My recommendation for the solution to this problem is that we
establish an application profile 'HTTP servers for Japanese' that
recommends that filenames in URLs be encoded as unicode-1-1-utf7 no
matter what the native file system encoding might be.

Then the server would have a reliable way of decoding the URLs. This
solution would require no changes to HTTP, HTML or URLs.

Received on Thursday, 25 January 1996 08:39:10 UTC