Re: Charsets revisited from Larry Masinter on 1996-01-24 (ietf-http-wg@w3.org from January to March 1996)

From: Larry Masinter <masinter@parc.xerox.com>
Date: Wed, 24 Jan 1996 15:35:36 PST
To: glenn@stonehand.com
Cc: frystyk@w3.org, nms@nns.ru, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <96Jan24.153541pst.2733@golden.parc.xerox.com>
> I agree with this.  However, it is true that (1) the URI wg no longer exists;
> (2) HTTP is the primary consumer/producer of URIs; and (3) a serious problem
> exists w.r.t. handling non-ASCII character data in URIs.  This problem needs
> to be addressed very quickly, so what forum would be best to address it?

Glenn,

In this particular case, the problem is with section 8.2.1 of RFC
1866 (HTML):

>       1. The form field names and values are escaped: space
>       characters are replaced by `+', and then reserved characters
>       are escaped as per [URL]; that is, non-alphanumeric
>       characters are replaced by `%HH', a percent sign and two
>       hexadecimal digits representing the ASCII code of the
>       character. Line breaks, as in multi-line text field values,
>       are represented as CR LF pairs, i.e. `%0D%0A'.

This specification calls for the _characters_ of the form results to
be encoded in a URL. However, the URL encoding (specified in section
2.2 of RFC 1738 (URL)) is a way of encoding octets, not a way of
encoding characters.

It is this disconnect that leaves the ambiguity that we're worried
about here: when a user fills out a form and the values in that form
are transmitted, what is the character set used in the transmission.

As such, I think this issue must be addressed in the HTML working
group as a technical review issue for RFC 1866. As we've discussed in
numerous other venues, there is no easy solution to the problem in
general, although RFC 1867 (file-upload) gives some relief in many
instances.

ey are implemented,
   even if those enhancements are rarely used.

> The assumtion that current content negotiation makes is that there is
> no prior knowledge. In most cases a user will have selected a specific 
> URL in a previous page they have retrieved.

Except when the main page is in a language or format they cannot read.
Because in your view 'in most cases' one could do without a feature
does not mean that the feature should be removed from a specification.
In most cases, the web could have gotten by with only FTP.

> This proposal removes large areas that are causing concerns in the current
> review process and merely reflects current practice.

This is (a) nonsense -- (1) the concern will not go away merely
because content negotiation isn't in the draft and also (2) wrong, in
that there is current practice with content negotiation and (b)
irrelevant -- in that 'current practice' cannot completely dictate
future development.

> Without having dynamic content retrieval allows micro servers to return
> pre-canned responses. Few CGI scripts return more that one variation etc etc.

All proposals for content negotiation are for it to be optional.
Micro servers need not pay attention to accept headers if they do not
wish to serve varying content.

> No intelligent server mechanism is a intelligent as the dumbest human.

This is nonsense. The issue is not intelligence, it is information.
Unless you expect everyone to know the content of their mime.types
file and the latest status of their system administrator's
configuration files, or the capabilities of the browser that they
downloaded and installed.

> Without it, I feel that 1.1 will meet the goals stated in section
>	1.1 Purpose

The 'Purpose' statement of a document is never comprehensive, and
never an adequate description of the goals of the working group that
is producing it.

> If you feel that it is a must then the simplest case is to return a
> precanned entity that in some standard format lists the URL's and the 
> content types each represents and then let the client do all the work.

This *is* the URI proposal in the current HTTP/1.1 draft.
Received on Wednesday, 24 January 1996 19:40:22 UTC