Re: DRAFT Findings on when to use GET to make resources addressable (whenToUseGet-7) from Martin Duerst on 2002-05-06 (www-html-editor@w3.org from April to June 2002)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 06 May 2002 12:06:18 +0900
To: www-tag@w3.org
Message-Id: <4.2.0.58.J.20020506093221.02ecbef8@localhost>
Hello Dan and others,

Some pointers/comments:

At 16:43 02/05/01 -0500, Dan Connolly wrote:
>OK, I've taken a stab at integrating feedback
>received since 15 Feb:

>   http://www.w3.org/2001/tag/doc/get7

>    Designers of HTML forms that accept non-western characters have been
>    challenged by various implementation limitations and gaps in
>    specifications. For example:
>
>      The content type "application/x-www-form-urlencoded" is inefficient
>      for sending large quantities of binary data or text containing
>      non-ASCII characters.
>
>
>     [11]multipart/form-data in [12]HTML 4.01
>
>      [11] http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.2
>      [12] http://www.w3.org/TR/html401/
>
>    We expect these limitations to be address in future specifications
>    (@@e.g. XForms?) and deployed in due course.

Some comments about the non-ASCII character aspect, about
inefficiency and about actual breakdowns:


Inefficiency is due to the octet -> %hh escape conversion, combined
with the fact that many characters need more than one octet to be encoded.
But this, in contrast to the 'large quantities of binary data or text',
is not an obstacle for GET, and shouldn't be presented as such.
[This doesn't mean that it should not or cannot be improved,
for that see later.]


Breakdowns are much more important, and should clearly be
mentioned. They happen because the mappings between characters and
octets is not clearly specified outside US-ASCII. From:
http://www.ietf.org/rfc/rfc2396.txt (to go directly to the relevant section:
use e.g. http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.1)

 >>>>
2.1 URI and non-ASCII characters

    The relationship between URI and characters has been a source of
    confusion for characters that are not part of US-ASCII. To describe
    the relationship, it is useful to distinguish between a "character"
    (as a distinguishable semantic entity) and an "octet" (an 8-bit
    byte). There are two mappings, one from URI characters to octets, and
    a second from octets to original characters:

    URI character sequence->octet sequence->original character sequence
...
    For original character sequences that contain non-ASCII characters,
    however, the situation is more difficult. Internet protocols that
    transmit octet sequences intended to represent character sequences
    are expected to provide some way of identifying the charset used, if
    there might be more than one [RFC2277].  However, there is currently
    no provision within the generic URI syntax to accomplish this
    identification. An individual URI scheme may require a single
    charset, define a default charset, or provide a way to indicate the
    charset used.
 >>>>

This can lead to different cases of breakdowns:

Breakdowns between computers:

As an example, (how) is a form able to submit a GET request with
non-ASCII text and have the server understand what was submitted? From
http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset:

 >>>>
accept-charset = charset list [CI]
     This attribute specifies the list of character encodings for input 
data that
     is accepted by the server processing this form. The value is a space- 
and/or
     comma-delimited list of charset values. The client must interpret this 
list
     as an exclusive-or list, i.e., the server is able to accept any single 
character
     encoding per entity received.

     The default value for this attribute is the reserved string "UNKNOWN". 
User
     agents may interpret this value as the character encoding that was used to
     transmit the document containing this FORM element.
 >>>>

In current practice, the second paragraph above is more relevant. This works
in all major version >=4 browsers. The 'accept-charset' attribute has not
received much attention for a long time, but has recently become implemented
in a number of places (see
http://lists.w3.org/Archives/Public/www-international/2002AprJun/0011.html).

XForms will specify that the encoding to be used is always UTF-8.


Because Dan cited from http://www.w3.org/TR/html401/, and some of you
might go back to http://www.w3.org/TR/html401/interact/forms.html,
I'll just use this occasion to point out that the following two
pieces that say that GET can be used only with data values in
US-ASCII is definitely wrong. I have copied www-html-editor@w3.org
(bcc to reduce crossposting) to register these as errata:

In http://www.w3.org/TR/html401/interact/forms.html#h-17.13.1,
it currently says:

 >>>>
Note. The "get" method restricts form data set values to ASCII characters. Only
       the "post" method (with enctype="multipart/form-data") is specified to
       cover the entire [ISO10646] character set.
 >>>>

Proposal: Remove the Note.
Rationale: Unnecessary limitation that never applied in practice.


In http://www.w3.org/TR/html401/interact/forms.html#h-17.13.3.4:
it currently says:

 >>>>
     * If the method is "get" and the action is an HTTP URI, the user agent 
takes
       the value of action, appends a `?' to it, then appends the form data 
set,
       encoded using the "application/x-www-form-urlencoded" content type. 
The user
       agent then traverses the link to this URI. In this scenario, form data
       are restricted to ASCII codes.
 >>>>

Proposal: Replace the last sentence with "For the encoding on non-ASCII
characters, please see
http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset.",
or remove the last sentence of the paragraph.
Rationale: Unnecessary limitation that never applied in practice.


Another kind of breakdown is of course the fact that while URIs
(including GET requests) that include text that can be represented
by the US-ASCII character repertoire can be quite readable, whereas
%hh escaping isn't. But that's another long topic.


Regards,    Martin.
Received on Sunday, 5 May 2002 23:16:33 UTC