Re: Charsets revisited from Glenn Adams on 1996-01-25 (ietf-http-wg@w3.org from January to March 1996)

From: Glenn Adams <glenn@stonehand.com>
Date: Thu, 25 Jan 96 10:11:21 -0500
To: Larry Masinter <masinter@parc.xerox.com>
Cc: frystyk@w3.org, nms@nns.ru, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <9601251511.AA21660@trubetzkoy.stonehand.com>

    From: Larry Masinter <masinter@parc.xerox.com>
    Date: Wed, 24 Jan 1996 15:35:36 PST

    In this particular case, the problem is with section 8.2.1 of RFC
    1866 (HTML):

    This specification calls for the _characters_ of the form results ...

I think you are focusing too narrowly.  The problem goes more deeply.
In particular, the fundamental problem is how to specify the information
needed to decode escaped octets representing non-ASCII character data
which appear in a URI, such as found in an HTTP Simple Request.  For
example, all of the following are asking for the same resource:

  GET /%1B$BF%7CK%5C%1B(B.HTM
  GET /%93%FA%96%7B.HTM
  GET /%C6%FC%CB%DC.HTM
  GET /e%E5g,%00.%00H%00T%00M
  GET /+ZeVnLA-.HTM

The problem is, unless the server knows that the characters encoded with
the URI octet escapement mechanism in these examples use ISO-2022-JP, SHIFT
JIS, EUC-J, UNICODE-1-1, and UNICODE-1-1-UTF7, respectively, then the
serve has no reliable way of decoding the octets as characters.

This problem is endemic to the specification of URIs as such and needs to
be addressed at that level no matter to what use URIs are put.

Regards,
Glenn Adams

Received on Thursday, 25 January 1996 07:14:58 UTC