- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 06 May 2002 12:06:18 +0900
- To: www-tag@w3.org
Hello Dan and others, Some pointers/comments: At 16:43 02/05/01 -0500, Dan Connolly wrote: >OK, I've taken a stab at integrating feedback >received since 15 Feb: > http://www.w3.org/2001/tag/doc/get7 > Designers of HTML forms that accept non-western characters have been > challenged by various implementation limitations and gaps in > specifications. For example: > > The content type "application/x-www-form-urlencoded" is inefficient > for sending large quantities of binary data or text containing > non-ASCII characters. > > > [11]multipart/form-data in [12]HTML 4.01 > > [11] http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.2 > [12] http://www.w3.org/TR/html401/ > > We expect these limitations to be address in future specifications > (@@e.g. XForms?) and deployed in due course. Some comments about the non-ASCII character aspect, about inefficiency and about actual breakdowns: Inefficiency is due to the octet -> %hh escape conversion, combined with the fact that many characters need more than one octet to be encoded. But this, in contrast to the 'large quantities of binary data or text', is not an obstacle for GET, and shouldn't be presented as such. [This doesn't mean that it should not or cannot be improved, for that see later.] Breakdowns are much more important, and should clearly be mentioned. They happen because the mappings between characters and octets is not clearly specified outside US-ASCII. From: http://www.ietf.org/rfc/rfc2396.txt (to go directly to the relevant section: use e.g. http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.1) >>>> 2.1 URI and non-ASCII characters The relationship between URI and characters has been a source of confusion for characters that are not part of US-ASCII. To describe the relationship, it is useful to distinguish between a "character" (as a distinguishable semantic entity) and an "octet" (an 8-bit byte). There are two mappings, one from URI characters to octets, and a second from octets to original characters: URI character sequence->octet sequence->original character sequence ... For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. >>>> This can lead to different cases of breakdowns: Breakdowns between computers: As an example, (how) is a form able to submit a GET request with non-ASCII text and have the server understand what was submitted? From http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset: >>>> accept-charset = charset list [CI] This attribute specifies the list of character encodings for input data that is accepted by the server processing this form. The value is a space- and/or comma-delimited list of charset values. The client must interpret this list as an exclusive-or list, i.e., the server is able to accept any single character encoding per entity received. The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element. >>>> In current practice, the second paragraph above is more relevant. This works in all major version >=4 browsers. The 'accept-charset' attribute has not received much attention for a long time, but has recently become implemented in a number of places (see http://lists.w3.org/Archives/Public/www-international/2002AprJun/0011.html). XForms will specify that the encoding to be used is always UTF-8. Because Dan cited from http://www.w3.org/TR/html401/, and some of you might go back to http://www.w3.org/TR/html401/interact/forms.html, I'll just use this occasion to point out that the following two pieces that say that GET can be used only with data values in US-ASCII is definitely wrong. I have copied www-html-editor@w3.org (bcc to reduce crossposting) to register these as errata: In http://www.w3.org/TR/html401/interact/forms.html#h-17.13.1, it currently says: >>>> Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set. >>>> Proposal: Remove the Note. Rationale: Unnecessary limitation that never applied in practice. In http://www.w3.org/TR/html401/interact/forms.html#h-17.13.3.4: it currently says: >>>> * If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes. >>>> Proposal: Replace the last sentence with "For the encoding on non-ASCII characters, please see http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset.", or remove the last sentence of the paragraph. Rationale: Unnecessary limitation that never applied in practice. Another kind of breakdown is of course the fact that while URIs (including GET requests) that include text that can be represented by the US-ASCII character repertoire can be quite readable, whereas %hh escaping isn't. But that's another long topic. Regards, Martin.
Received on Sunday, 5 May 2002 23:16:33 UTC