- From: Henrik Nordstrom <henrik@henriknordstrom.net>
- Date: Tue, 25 Aug 2009 14:21:05 +0200
- To: Julian Reschke <julian.reschke@gmx.de>
- Cc: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>, Bjoern Hoehrmann <derhoermi@gmx.net>
Should probably change topic here, but it's still relevant so keeping the issue topic. Most of this is taking a more generic view of quoted-pair, not isolated to chunk extension values. tis 2009-08-25 klockan 09:11 +0200 skrev Julian Reschke: > quoted-pair is also used in comments. Are we ok with restricting the set > here as well? And, if yes, shouldn't we then also adjust the allowed set > for non-quoted characters in comments? What? Restricting how? I thought we were talking about restricting the use of CTLs? Now some further rambling on the use of quoted-pair and the difficulties this causes for parsers: qdtext is for text within a quoted-string, and MUST NOT include '"' or '\'. Those two must be produced as quoted-pair to be used within a quoted-string. qdtext = OWS / %x21 / %x23-5B / %x5D-7E / obs-text ; OWS / <VCHAR except DQUOTE and "\"> / obs-text ctext is the same but for comment, and MUST NOT include '(', ')' or '\'. Those three must be produced as quoted-pair to be used within a comment. ctext = OWS / %x21-27 / %x2A-5B / %x5D-7E / obs-text ; OWS / <VCHAR except "(", ")", and "\"> / obs-text Neither of qdtext or ctext allows for CTLs, except for HT or obsoleted CRLF folding (from OWS). Specifications (2616) is very strict on where quoted-pair is alowed to be used, but it's at the same time very subtle where those areas are creating a large grey area where parsing is somewhat non-obvious. It's the same question as been raised earlier regarding comments. A construct looking like a comment is only a comment if the header in question is defined to allow comments, if not it's literally part of the header value. Quoted-string is also only quoted-string if the header in question is defined to accept quoted-string, if not it may be a literal part of the header value even if it may look like a quoted-string (for a header defined as taking *TEXT as value, 2616 has no such headers however) RFC2616 BNF and relevant comments: generic-message = start-line *(message-header CRLF) CRLF [ message-body ] message-header = field-name ":" [ field-value ] field-name = token field-value = *( field-content | LWS ) field-content = <the OCTETs making up the field-value and consisting of either *TEXT or combinations of token, separators, and quoted-string> TEXT = <any OCTET except CTLs, but including LWS> A CRLF is allowed in the definition of TEXT only as part of a header field continuation. Comments can be included in some HTTP header fields by surrounding the comment text with parentheses. Comments are only allowed in fields containing "comment" as part of their field value definition. In all other fields, parentheses are considered part of the field value. comment = "(" *( ctext | quoted-pair | comment ) ")" ctext = <any TEXT excluding "(" and ")"> The allowable characters in *TEXT overlaps completely with token, separators and quoted-string in the allowable characters except that *TEXT do not allow CTLs other than LWS (HT), and within *TEXT the '\' character have no special meaning. Which means that to properly parse '\' quoted constructs one must know in detail every header processed in order to know if the '\' is quoting the next character or if it's just a literal '\'. Because of this it's important that the overall message parsing is the same regardless if quoted-pair is processed or not, only producing slightly different results in the raw header value. Or put in other words, it needs to be possible to completely defer quoting and comment processing until the header value as such is examined in detail, with general message parsing using *TEXT for all header values. And for chunk headers *TEXT minus folding for the general message format, only needing to dive into quoting etc when eventually processing the chunk extension values (if at all). Regarding the allowable characters there imho is absolutely no need to allow for control characters anywhere in HTTP headers or chunk headers, quoted or not, and it's additionally very very likely many parsers will fail on such constructs making them quite non-interoperable. And additionally if restricting the allowed set of quoted characters to exclude \x00, NL and CR as already done in HTTPbis then it becomes very questionable from a technical point of view (ignoring parsing) to allow the use of other CTLs in quoted form. The use of having CTLs in header values is very limited to begin with, basically only needed to support transmission of (non-UTF8) multibyte charactersets or binary non-text data, in which case having those three excluded is already a signifcant issue for such use. So imho quoted-pair should be quoted-text = %x09 / %x20-%x7E / obs-text ; WSP / VCHAR / obs-text quoted-pair = "\" qchar to match the use of *TEXT in 2616, making comments and quoted strings all fit within *TEXT as those constructs is only used in detailed forms which should be a subset of the more generic *TEXT. This reasoning is also consistent with the current field-content definition using VTEXT etc.. field-value = *( field-content / OWS ) field-content = *( WSP / VCHAR / obs-text ) This field-content definition DOES NOT allow for CTLs other than HT. Allowing quoted-pair to include CTLs other than HT is incompatible with the above (from latest p1) definition of field-content. If you look closely you'll notice the quoted-text and field-contents definitions above are equal. Perhaps a common term should be defined for that similar to the *TEXT element used in 2616. There is probably more places where using said term would make sense. And sorry, no I do not have a good suggested BNF name for this construct.. TEXT would be confusing with 2616 and text in lower case too generic to be used in describing text. general-text? Regards Henrik
Received on Tuesday, 25 August 2009 12:21:54 UTC