- From: Jamie Lokier <jamie@shareable.org>
- Date: Thu, 8 Jul 2004 02:31:26 +0100
- To: Stephan Hesmer <shesmer@apache.org>
- Cc: ietf-http-wg@w3.org
Stephan Hesmer wrote: > does anybody know where the query string format is defined ? I do not > mean which chars are allowed and which ones are reserved. What I mean > is, where is the format ?name=value&name=value ... defined e.g. as a > BNF? I searched through all related RFCs even through CGI and could not > find it. I found a lot of references to query string, but they all say > more or less that first it depends on the scheme or even url or second > only define the allowed chars. The general format of a query string can be anything an application wants, as long as it only uses the allowed characters. It isn't restricted to the format ?name=value&name=value. Therefore it would be wrong for a web server to reject query strings which didn't confirm to that syntax. (Anyway, ISINDEX queries, although they aren't used any more, have a different syntax: ?value). If you are interacting with HTML forms, and if they are submitted using the HTTP GET method, then the format of the query string is called "application/x-www-form-urlencoded". I.e. it's identical (after the "?") to the string sent with an HTTP POST using that MIME type. The exact set of which characters should be %-escaped has varied due to change from RC 1738 to RFC 2396. Thus there are a mix of clients and servers using different sets. Anyway, they don't follow the rules strictly: they tend to be conservative with sending and encoding, and lenient when receiving and decoding. Standards --------- RFC 1866 (HTML 2.0), section 8.2.1, "The form-urlencoded Media Type". REC-html-401 (HTML 4.01), section 17.3.3, "Processing form data", step 4: If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes. Section 17.13.4, "Form content types", application/x-www-form-urlencoded: 1. Control names and values are escaped. Space characters are replaced by `+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A'). 2. The control names/values are listed in the order they appear in the document. The name is separated from the value by `=' and name/value pairs are separated from each other by `&'. CGI 1.1 and CGI 1.2 drafts, section 3.1, "URL Encoding" have something similar but less precise. Generating query strings ------------------------ 1. Encode character strings into octet strings. If form control names and values consist only of ASCII characters, this is trivial. Otherwise, see below's section on non-ASCII characters in query strings, and encode characters to octets accordingly. 2. Line breaks for multi-line values should be encoded as CR LF pairs. 3. Replace some octets with %-escaped equivalents. Bare essentials: ";", "?", "&", "=", "+" and "%" must be %-escaped. These are reserved characters in form-encoding and/or generic URI syntax in the form-encoding context. (Technically ";" and "?" aren't, but not escaping these will break some servers). Octets outside the ASCII non-control range (32-126) must be %-escaped. "<", ">", "#", <">, "{", "}", "|", "\", "^", "[", "]" and "`" should be %-escaped, as these are not allowed in URIs. "/", ":", "@", "$" and "," should be %-escaped, as these are the other "reserved" characters of generic URI syntax, although they aren't reserved in this context. Most (perhaps all) servers accept these without %-escaping, but it is sensible to do so. "/" is significant because some old relative URI resolvers don't behaviour properly if this appears in a query string. "~" should be %-escaped because it was not permitted by RFC 1738, the old URI syntax. Although that's superceded, you never know, there might be an ancient server application which is so strict it refuses "~". Also the CGI 1.2 draft standard still refers to RFC 1738, and thus requires "~" to be %-escaped in a query string. I prefer to %-escape "!", "*", "'", "(", ")" as this is convenient for users who cut and paste URIs on a shell command line -- and because they are common delimiters in text messages. Altogether that means the only characters which are _not_ %-escaped (according to my suggestions) are "-", "_", "." and ASCII alphanumerics. (In an experiment, Mozilla 1.2 almost agree with me: it %-escapes all characters except "-", "_", ".", "*" and ASCII alphanumerics). 4. Convert space characters to "+". 5. Join the encoded names and values into "="-separated pairs, as "name=value". 6. Join the pairs separated by "&". When it's known that it will work (and only then), ";" can be used. If the names and values are from an HTML form, the name-value pairs should be joined in the order the controls appear in the original form. Although some servers permit name=value pairs to be separated by ";", and RFC 1866 (HTML 2.0) encourages that, many servers don't treat that as a separator. It's not a standard requirement. So machine-generated URIs to a server should always join pairs with "&" unless it's known that the target server supports the ";" form. The ";" form results in more compact and readable HTML (because "&" is written as "&" in href and src attributes), so it's ok for servers to generate the ";" form in links referring back to the same server, if the server does parse that appropriately. Parsing query strings --------------------- The de facto behaviour of much server query processing is: 1. Split the URI at the first occurrence of "?", and take the second part as the query string if there is one. If there is more than one "?", split only at the first one. "?" is permitted in query strings now (see RFC 2396), although form-encoding should have %-escaped it. 2. Split the query part at "&". If you like, split at ";" as well. Some servers do, some don't. (Note that microsoft.com and google.com do _not_ split at ";"). 3. For each sub-sequence look for "name=value": i.e. split each sub-sequence at the first occurrence of "=". If there is more than one "=", split only at the first one. 4. In each name and value, convert each occurrence of "+" to " " (space). 5. %-unescape each name and value, by mapping %<HEX><HEX> to octets. (I'm not sure if +-conversion and %-unescaping of the name parts is consistent among different client implementations. I've never tested, and only ever seen ASCII alphanumeric names in use.) 5b. Simultaneous with 5, you may map %u<HEX><HEX><HEX><HEX> to octets representing that Unicode character. This is non-standard, but some old popular client software generates this form. If this is done, it should be concurrent with 5, not a separate string scan. The uppercase "%U" form is not used. Note that your interpretation of "octet sequence representing that Unicode character" is often but may not be UTF-8, which complicates matters. For reference, microsoft.com unescapes these but google.com does not. It may seem logical to decode UTF-16 surrogate pairs, although microsoft.com doesn't and I don't know if those clients ever generated them. 6. Interpret the resulting octet strings as character strings. If the octets are in the range 32-126, it is usually trivially ASCII. Otherwise it's more complicated (see below's section on non-ASCII characters in query strings). 7. CR LF sequences are line breaks, so for forms with multi-line inputs it may be appropriate to convert these to LF or whatever is used in the application. The following is not necessarily the behaviour of most servers, but rather suggestions based on my studies: "Reserved" characters are allowed in a URI query string (see RFC 2396), so servers which check the query string generically should permit those even unescaped. This includes "?" and "=", e.g. a query string like this is technically valid URI syntax: "?foo=bar=hello??". For perspective, both google.com and microsoft.com accept a string like that, decoding the name as "foo" and value as "bar=hello??". It might not be technically valid form-encoding, but it's accepted. A URI query string is not the same as a form-encoded string. The part of a server which parses form-encoded syntax could be strict and reject any characters which should be %-escaped and aren't. However this is unwise, because the exact set which should be escaped isn't very clear, and it changed between RFC 1738 and RFC 2396, and programs (and people) vary in which URI characters they escape. The tradition is to be lenient, perhaps even more so than octets or %-escapes in the path. "%" which is not followed by two hex digits is strictly invalid URI syntax. However many servers, including microsoft.com and google.com, accept it without complaint. Summary: Altogether, this means: simply convert "+" to space, %-unescape %<HEX><HEX> (and maybe %u<HEX><HEX><HEX><HEX> if you want), to make an octet string. Leave octets which don't match these patterns unchanged. Then interpret the octet string as a character string. For security, it makes sense to reject a zero octet (whether the result of %-unescaping or not), and if decoding as UTF-8 (see below), any non-minimal UTF-8 sequence should be rejected as well as code points in the Unicode UTF-16 surrogate range and the BOMs (U+FEFF and U+FFFE). These are all characters likely to be misinterpreted by application code in ways which are subtle enough to bypass security checks. An alternative to rejection would be to map all of these to a relatively harmless character such as "?" or the Unicode replacement character U+FFFD, or to leave them as %-sequences. (The latter is what I do). Similar filtering or rejection should apply to path characters (with the addition of %-escaped "/" and "."). However, most servers don't do these things. Non-ASCII characters in query strings ------------------------------------- For perfect standard compliance the name and value octet strings, after +-conversion and %-unescaping, are supposed to consist of ASCII codes only, but there are very many applications where they aren't. Unfortunately the submission of non-ASCII characters in form-encoded URIs is not standardised, although it's widely implemented and a lot of services depend on it. The de facto behaviour of modern web browsers is to encode characters to octets in the form document's character encoding prior to %-escaping for the URI. Some modern browsers will use one of the values in the HTML "accept-charset" attribute instead if that's defined. Older browsers don't always do either, and may use the browsers "default charset" or something like that. Because of possible variation, applications which depend on international characters in form input are most robust if they include a hidden form field with some text that, when returned to the server, reveals the encoding used. However, that is becoming less needed, provided you can depend on the people using modern clients. Web browsers typically represent characters which cannot be encoded as "?" or HTML-style numeric escapes like "Ӓ". I think the latter is gaining popularity. Note that neither of those is distinguishable from a form value containing those strings. If an application depends on getting exactly the international characters entered by the user, then it's best if the form page's document is encoded in UTF-8 in the first place, so it's most likely the browser will encode the form in UTF-8 and all characters will be encoded. (Language tagging is still absent, which is an issue for some CJK users due to identical Unicode values being used for characters which are drawn differently in different languages). Hope this explains enough! -- Jamie
Received on Wednesday, 7 July 2004 21:31:32 UTC