Re: #173: CR and LF in chunk extension values

Henrik Nordstrom wrote:
> Should probably change topic here, but it's still relevant so keeping
> the issue topic. Most of this is taking a more generic view of
> quoted-pair, not isolated to chunk extension values.
> 
> tis 2009-08-25 klockan 09:11 +0200 skrev Julian Reschke:
> 
>> quoted-pair is also used in comments. Are we ok with restricting the set 
>> here as well? And, if yes, shouldn't we then also adjust the allowed set 
>> for non-quoted characters in comments?
> 
> What? Restricting how? I thought we were talking about restricting the
> use of CTLs?

Yes. I wanted to confirm that we do that for quoted-strings *and* 
comments. Do we?

> Now some further rambling on the use of quoted-pair and the difficulties
> this causes for parsers:
> 
> 
> qdtext is for text within a quoted-string, and MUST NOT include '"' or
> '\'. Those two must be produced as quoted-pair to be used within a
> quoted-string.
> 
>     qdtext         = OWS / %x21 / %x23-5B / %x5D-7E / obs-text
>                    ; OWS / <VCHAR except DQUOTE and "\"> / obs-text 
> 
> ctext is the same but for comment, and MUST NOT include '(', ')' or '\'.
> Those three must be produced as quoted-pair to be used within a comment.
> 
>     ctext          = OWS / %x21-27 / %x2A-5B / %x5D-7E / obs-text 
>                    ; OWS / <VCHAR except "(", ")", and "\"> / obs-text
> 
> Neither of qdtext or ctext allows for CTLs, except for HT or obsoleted
> CRLF folding (from OWS).

Yes. But quoted-string and comment allow quoted-pair which currently 
does allow CTLs.

> Specifications (2616) is very strict on where quoted-pair is alowed to
> be used, but it's at the same time very subtle where those areas are
> creating a large grey area where parsing is somewhat non-obvious.
> 
> It's the same question as been raised earlier regarding comments. A
> construct looking like a comment is only a comment if the header in
> question is defined to allow comments, if not it's literally part of the
> header value.
> 
> Quoted-string is also only quoted-string if the header in question is
> defined to accept quoted-string, if not it may be a literal part of the
> header value even if it may look like a quoted-string (for a header
> defined as taking *TEXT as value, 2616 has no such headers however)
> 
> RFC2616 BNF and relevant comments:
> 
>       generic-message = start-line
>                         *(message-header CRLF)
>                         CRLF
>                         [ message-body ]
>        message-header = field-name ":" [ field-value ]
>        field-name     = token
>        field-value    = *( field-content | LWS )
>        field-content  = <the OCTETs making up the field-value
>                         and consisting of either *TEXT or combinations
>                         of token, separators, and quoted-string>
> 
>        TEXT           = <any OCTET except CTLs,
>                         but including LWS>
> 
>    A CRLF is allowed in the definition of TEXT only as part of a header
>    field continuation.
> 
>    Comments can be included in some HTTP header fields by surrounding
>    the comment text with parentheses. Comments are only allowed in
>    fields containing "comment" as part of their field value definition.
>    In all other fields, parentheses are considered part of the field
>    value.
> 
>        comment        = "(" *( ctext | quoted-pair | comment ) ")"
>        ctext          = <any TEXT excluding "(" and ")">
> 
> The allowable characters in *TEXT overlaps completely with token,
> separators and quoted-string in the allowable characters except that
> *TEXT do not allow CTLs other than LWS (HT), and within *TEXT the '\'
> character have no special meaning.
> 
> Which means that to properly parse '\' quoted constructs one must know
> in detail every header processed in order to know if the '\' is quoting
> the next character or if it's just a literal '\'.

Yes.

> Because of this it's important that the overall message parsing is the
> same regardless if quoted-pair is processed or not, only producing
> slightly different results in the raw header value. Or put in other
> words, it needs to be possible to completely defer quoting and comment
> processing until the header value as such is examined in detail, with
> general message parsing using *TEXT for all header values. And for chunk
> headers *TEXT minus folding for the general message format, only needing
> to dive into quoting etc when eventually processing the chunk extension
> values (if at all).
> 
> 
> Regarding the allowable characters there imho is absolutely no need to
> allow for control characters anywhere in HTTP headers or chunk headers,
> quoted or not, and it's additionally very very likely many parsers will
> fail on such constructs making them quite non-interoperable.

Agreed.

> And additionally if restricting the allowed set of quoted characters to
> exclude \x00, NL and CR as already done in HTTPbis then it becomes very
> questionable from a technical point of view (ignoring parsing) to allow
> the use of other CTLs in quoted form. The use of having CTLs in header
> values is very limited to begin with, basically only needed to support
> transmission of (non-UTF8) multibyte charactersets or binary non-text
> data, in which case having those three excluded is already a signifcant
> issue for such use.

Yes.

> So imho quoted-pair should be
> 
>     quoted-text = %x09 / %x20-%x7E / obs-text
>                 ; WSP / VCHAR / obs-text
>     quoted-pair = "\" qchar	
> 
> to match the use of *TEXT in 2616, making comments and quoted strings
> all fit within *TEXT as those constructs is only used in detailed forms
> which should be a subset of the more generic *TEXT.

"qchar" being...?

> This reasoning is also consistent with the current field-content
> definition using VTEXT etc..
> 
>     field-value    = *( field-content / OWS )
>     field-content  = *( WSP / VCHAR / obs-text )
> 
> This field-content definition DOES NOT allow for CTLs other than HT.
> Allowing quoted-pair to include CTLs other than HT is incompatible with
> the above (from latest p1) definition of field-content.
> 
> If you look closely you'll notice the quoted-text and field-contents
> definitions above are equal. Perhaps a common term should be defined for
> that similar to the *TEXT element used in 2616. There is probably more
> places where using said term would make sense. And sorry, no I do not
> have a good suggested BNF name for this construct.. TEXT would be
> confusing with 2616 and text in lower case too generic to be used in
> describing text. general-text?
> ...

"characters"?

Anyway, my take away from your analysis is: "yes, CTLs need to be 
disallowed both in comments and quoted-text", right?

BR, julian

Received on Tuesday, 25 August 2009 12:47:56 UTC