application/www-form-urlencoded issues from Bjoern Hoehrmann on 2006-09-28 (www-archive@w3.org from September 2006)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 28 Sep 2006 19:39:06 +0200
To: www-archive@w3.org
Message-ID: <860oh2lem8ccp1cpu5hdn84k4kakhj9g4s@hive.bjoern.hoehrmann.de>
Hi,

  The following is a list of issues I considered when writing draft-
hoehrmann-urlencoded-00.txt and how I resolved them; the list is meant
to aid review of the document. Let me know if you have any opinion on
these issues. The current draft is available at [1]

  * Why not standardize application/x-www-form-urlencoded instead?

    There is no one such format, specifications and implementations
    vary in how they handle character encodings, escaped characters,
    which characters are used as separator (some technologies allow
    to choose virtually any character), whether the media type can
    have a 'charset' parameter, how they handle encoded data sets 
    that the RFC 1866 algorithm would never produce (e.g., foo&bar,
    foo=bar=baz).

    Further, the media type application/x-www-form-urlencoded cannot
    be registered under the rules of RFC 4288, and updating RFC 4288
    to make an exception for this type would likely be difficult and
    would set bad precedent. Given these problems, there is not much
    that could reasonably be standardized without contradicting de-
    ployed infrastructure.

  * Okay, but the new format is not substantially better, so it won't
    be implemented anyway.

    The Introduction of the document lists several key benefits of the
    format, for example, common characters like ":" and "/" are not
    escaped, so you get "url=http://example.org/" instead of the much
    less readable "url=http%3A%2F%2Fexample.org%2F" that legacy im-
    plementations would produce, or if most of the data contains e.g.
    japanese text, and the data is POSTed to some web service, then
    the japanese characters would be encoded as 3 bytes, not 9 bytes
    each, which reduces e.g. transport packet fragmentation in Ajax
    applications that constantly transfer only small amounts of data.

    Also note that unlike application/x-www-form-urlencoded the new 
    format supports the notion of undefined values which allows for
    shorter and more natural representation of certain data sets; the
    draft gives the following example: For instance, a data set used
    to control columns in product lists could look as follows. A more
    conventional way to encode the same information would be, e.g.,
    "c1=img&c2=avail&c3=name&c4=price" -- with undefined values this
    could be written as "img;avail;name;price".

  * Should ' be in the set of escaped characters?

    This is relevant to copy and paste operations and in some environ-
    ments URL extraction; for example, given <a href='...'>...</a> and
    http://example.org/search.p6?q=Teal'c the resource identifier can
    not be simply copied and pasted into the attribute value, the '
    character has to be escaped. Likewise, if the URL occurs in some
    unstructured text, like an IRC chat, a text/plain mail, or similar,
    some tools might consider the URL to end with the '. In these cases
    it would make sense to escape the "'". On the other hand, writing
    http://example.org/search.p6?q=Teal%27c would make the data set
    less readable. The current draft does not escape it; comments on
    this issue are very welcome.

  * Shouldn't there be different encoding algorithms for query strings
    and normal POSTed data?

    This is relevant e.g. when SPARQL queries are transmitted using
    POSTed application/www-form-urlencoded data sets. For example, the
    query might be

      PREFIX dc: <http://purl.org/dc/elements/1.1/> 
      SELECT ?book ?who 
      WHERE { ?book dc:creator ?who }

    and in the SPARQL protocol this would map to query=...query...
    The decoding algorithm defined in the specification could handle
    this case just fine if it were

      Content-Type: application/www-form-urlencoded

      query=PREFIX dc: <http://purl.org/dc/elements/1.1/> 
            SELECT ?book ?who 
            WHERE { ?book dc:creator ?who }

    but the encoding algorithm would encode it as

      Content-Type: application/www-form-urlencoded

      query=PREFIX+dc:+%3Chttp://purl.org/dc/elements/1.1/%3E+%0A+++
      +++SELECT+?book+?who+%0A++++++WHERE+%7B+?book+dc:creator+?who+%7D

    and as such implementations would be non-compliant if they produce
    the former. Should the specification allow construction of the
    former when creating stand-alone entities? The current draft does
    not.

  * Should it be possible to encode (("", undefined))?

    The draft does not allow encoding of a data set with a single item
    where the name is the empty string and the value is undefined. It
    would encode to the empty string, which already represents a data
    set with no value. It would be possible to reserve a character or
    string to represent this data set, which would then have to be es-
    caped when it is used as data. 
    
  * I'm using XForms, so I can't use this format anyway.

    Instead of internet media types XForms uses QNames to specify the
    serialization format, and the XForms specification currently lacks
    such an identifier for application/www-form-urlencoded; it would
    be possible to define XForms extensions to identify this format,
    and XForms implementations are free to do so. The draft does not
    define such a QName as it is expected that future versions of the
    XForms specification specify, for example, ietf-urlencoded-post
    and ietf-urlencoded-get, in which case such a definition in the
    application/www-form-urlencoded specification would be redundant.

  * The Compatibility considerations are terrible!

    I know! Please propose something better, or make suggestions how
    to improve the current text.

[1] http://ietfreport.isoc.org/idref/draft-hoehrmann-urlencoded/

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Thursday, 28 September 2006 17:39:28 UTC