Re: Query String format still undefined ? from Jamie Lokier on 2004-07-08 (ietf-http-wg@w3.org from July to September 2004)

From: Jamie Lokier <jamie@shareable.org>
Date: Thu, 8 Jul 2004 02:31:26 +0100
To: Stephan Hesmer <shesmer@apache.org>
Cc: ietf-http-wg@w3.org
Message-ID: <20040708013126.GC17266@mail.shareable.org>
Stephan Hesmer wrote:
> does anybody know where the query string format is defined ? I do not 
> mean which chars are allowed and which ones are reserved. What I mean 
> is, where is the format ?name=value&name=value ... defined e.g. as a 
> BNF? I searched through all related RFCs even through CGI and could not 
> find it. I found a lot of references to query string, but they all say 
> more or less that first it depends on the scheme or even url or second 
> only define the allowed chars.

The general format of a query string can be anything an application
wants, as long as it only uses the allowed characters.  It isn't
restricted to the format ?name=value&name=value.

Therefore it would be wrong for a web server to reject query strings
which didn't confirm to that syntax.  (Anyway, ISINDEX queries,
although they aren't used any more, have a different syntax: ?value).

If you are interacting with HTML forms, and if they are submitted
using the HTTP GET method, then the format of the query string is
called "application/x-www-form-urlencoded".  I.e. it's identical
(after the "?") to the string sent with an HTTP POST using that MIME type.

The exact set of which characters should be %-escaped has varied due
to change from RC 1738 to RFC 2396.  Thus there are a mix of clients
and servers using different sets.  Anyway, they don't follow the rules
strictly: they tend to be conservative with sending and encoding, and
lenient when receiving and decoding.


Standards
---------

RFC 1866 (HTML 2.0), section 8.2.1, "The form-urlencoded Media Type".

REC-html-401 (HTML 4.01), section 17.3.3, "Processing form data", step 4:

    If the method is "get" and the action is an HTTP URI, the user agent
    takes the value of action, appends a `?' to it, then appends the form
    data set, encoded using the "application/x-www-form-urlencoded"
    content type. The user agent then traverses the link to this URI. In
    this scenario, form data are restricted to ASCII codes.

Section 17.13.4, "Form content types", application/x-www-form-urlencoded:

    1. Control names and values are escaped.  Space characters are
       replaced by `+', and then reserved characters are escaped as
       described in [RFC1738], section 2.2: Non-alphanumeric
       characters are replaced by `%HH', a percent sign and two
       hexadecimal digits representing the ASCII code of the
       character. Line breaks are represented as "CR LF" pairs (i.e.,
       `%0D%0A').
 
    2. The control names/values are listed in the order they appear in
       the document. The name is separated from the value by `=' and
        name/value pairs are separated from each other by `&'.

CGI 1.1 and CGI 1.2 drafts, section 3.1, "URL Encoding" have something
similar but less precise.


Generating query strings
------------------------

1. Encode character strings into octet strings.

   If form control names and values consist only of ASCII characters,
   this is trivial.  Otherwise, see below's section on non-ASCII
   characters in query strings, and encode characters to octets
   accordingly.

2. Line breaks for multi-line values should be encoded as CR LF pairs.

3. Replace some octets with %-escaped equivalents.

   Bare essentials: ";", "?", "&", "=", "+" and "%" must be %-escaped.
   These are reserved characters in form-encoding and/or generic URI
   syntax in the form-encoding context.  (Technically ";" and "?"
   aren't, but not escaping these will break some servers).

   Octets outside the ASCII non-control range (32-126) must be %-escaped.

   "<", ">", "#", <">, "{", "}", "|", "\", "^", "[", "]" and "`"
   should be %-escaped, as these are not allowed in URIs.

   "/", ":", "@", "$" and "," should be %-escaped, as these are the other
   "reserved" characters of generic URI syntax, although they aren't
   reserved in this context.  Most (perhaps all) servers accept these
   without %-escaping, but it is sensible to do so.  "/" is significant
   because some old relative URI resolvers don't behaviour properly if
   this appears in a query string.

   "~" should be %-escaped because it was not permitted by RFC 1738, the
   old URI syntax.  Although that's superceded, you never know, there
   might be an ancient server application which is so strict it refuses
   "~".  Also the CGI 1.2 draft standard still refers to RFC 1738, and
   thus requires "~" to be %-escaped in a query string.

   I prefer to %-escape "!", "*", "'", "(", ")" as this is convenient
   for users who cut and paste URIs on a shell command line -- and
   because they are common delimiters in text messages.

   Altogether that means the only characters which are _not_ %-escaped
   (according to my suggestions) are "-", "_", "." and ASCII alphanumerics.

   (In an experiment, Mozilla 1.2 almost agree with me: it %-escapes all
   characters except "-", "_", ".", "*" and ASCII alphanumerics).

4. Convert space characters to "+".

5. Join the encoded names and values into "="-separated pairs, as
   "name=value".

6. Join the pairs separated by "&".  When it's known that it will work
   (and only then), ";" can be used.

   If the names and values are from an HTML form, the name-value pairs
   should be joined in the order the controls appear in the original form.

   Although some servers permit name=value pairs to be separated by
   ";", and RFC 1866 (HTML 2.0) encourages that, many servers don't
   treat that as a separator.  It's not a standard requirement.  So
   machine-generated URIs to a server should always join pairs with
   "&" unless it's known that the target server supports the ";" form.

   The ";" form results in more compact and readable HTML (because "&" is
   written as "&amp;" in href and src attributes), so it's ok for servers
   to generate the ";" form in links referring back to the same server,
   if the server does parse that appropriately.


Parsing query strings
---------------------

The de facto behaviour of much server query processing is:

1. Split the URI at the first occurrence of "?", and take the
   second part as the query string if there is one.  If there is
   more than one "?", split only at the first one.  "?" is permitted
   in query strings now (see RFC 2396), although form-encoding should
   have %-escaped it.

2. Split the query part at "&".  If you like, split at ";" as well.
   Some servers do, some don't.
   (Note that microsoft.com and google.com do _not_ split at ";").

3. For each sub-sequence look for "name=value": i.e. split each
   sub-sequence at the first occurrence of "=".  If there is more than
   one "=", split only at the first one.

4. In each name and value, convert each occurrence of "+" to " " (space).

5. %-unescape each name and value, by mapping %<HEX><HEX> to octets.
   (I'm not sure if +-conversion and %-unescaping of the name parts is
   consistent among different client implementations.  I've never tested,
   and only ever seen ASCII alphanumeric names in use.)

5b. Simultaneous with 5, you may map %u<HEX><HEX><HEX><HEX> to octets
   representing that Unicode character.  This is non-standard, but
   some old popular client software generates this form.  If this is
   done, it should be concurrent with 5, not a separate string scan.
   The uppercase "%U" form is not used.  Note that your interpretation
   of "octet sequence representing that Unicode character" is often
   but may not be UTF-8, which complicates matters.  For reference,
   microsoft.com unescapes these but google.com does not.  It may seem
   logical to decode UTF-16 surrogate pairs, although microsoft.com
   doesn't and I don't know if those clients ever generated them.

6. Interpret the resulting octet strings as character strings.

   If the octets are in the range 32-126, it is usually trivially
   ASCII.  Otherwise it's more complicated (see below's section on
   non-ASCII characters in query strings).

7. CR LF sequences are line breaks, so for forms with multi-line inputs
   it may be appropriate to convert these to LF or whatever is used
   in the application.

The following is not necessarily the behaviour of most servers, but
rather suggestions based on my studies:

"Reserved" characters are allowed in a URI query string (see
RFC 2396), so servers which check the query string generically should
permit those even unescaped.  This includes "?" and "=", e.g. a query
string like this is technically valid URI syntax: "?foo=bar=hello??".

For perspective, both google.com and microsoft.com accept a string
like that, decoding the name as "foo" and value as "bar=hello??".  It
might not be technically valid form-encoding, but it's accepted.

A URI query string is not the same as a form-encoded string.  The part
of a server which parses form-encoded syntax could be strict and
reject any characters which should be %-escaped and aren't.  However
this is unwise, because the exact set which should be escaped isn't
very clear, and it changed between RFC 1738 and RFC 2396, and programs
(and people) vary in which URI characters they escape.  The tradition
is to be lenient, perhaps even more so than octets or %-escapes in the path.

"%" which is not followed by two hex digits is strictly invalid URI
syntax.  However many servers, including microsoft.com and google.com,
accept it without complaint.

Summary:

Altogether, this means: simply convert "+" to space, %-unescape
%<HEX><HEX> (and maybe %u<HEX><HEX><HEX><HEX> if you want), to make an
octet string.  Leave octets which don't match these patterns
unchanged.  Then interpret the octet string as a character string.

For security, it makes sense to reject a zero octet (whether the
result of %-unescaping or not), and if decoding as UTF-8 (see below),
any non-minimal UTF-8 sequence should be rejected as well as code
points in the Unicode UTF-16 surrogate range and the BOMs (U+FEFF and
U+FFFE).  These are all characters likely to be misinterpreted by
application code in ways which are subtle enough to bypass security
checks.  An alternative to rejection would be to map all of these to a
relatively harmless character such as "?" or the Unicode replacement
character U+FFFD, or to leave them as %-sequences.  (The latter is
what I do).  Similar filtering or rejection should apply to path
characters (with the addition of %-escaped "/" and ".").  However,
most servers don't do these things.


Non-ASCII characters in query strings
-------------------------------------

For perfect standard compliance the name and value octet strings,
after +-conversion and %-unescaping, are supposed to consist of ASCII
codes only, but there are very many applications where they aren't.

Unfortunately the submission of non-ASCII characters in form-encoded
URIs is not standardised, although it's widely implemented and a lot
of services depend on it.  The de facto behaviour of modern web
browsers is to encode characters to octets in the form document's
character encoding prior to %-escaping for the URI.  Some modern
browsers will use one of the values in the HTML "accept-charset"
attribute instead if that's defined.  Older browsers don't always do
either, and may use the browsers "default charset" or something like that.

Because of possible variation, applications which depend on
international characters in form input are most robust if they include
a hidden form field with some text that, when returned to the server,
reveals the encoding used.  However, that is becoming less needed,
provided you can depend on the people using modern clients.

Web browsers typically represent characters which cannot be encoded as
"?" or HTML-style numeric escapes like "&#1234;".  I think the latter
is gaining popularity.  Note that neither of those is distinguishable
from a form value containing those strings.  If an application depends
on getting exactly the international characters entered by the user,
then it's best if the form page's document is encoded in UTF-8 in the
first place, so it's most likely the browser will encode the form in
UTF-8 and all characters will be encoded.  (Language tagging is still
absent, which is an issue for some CJK users due to identical Unicode
values being used for characters which are drawn differently in
different languages).


Hope this explains enough!
-- Jamie
Received on Wednesday, 7 July 2004 21:31:32 UTC