Re: #173: CR and LF in chunk extension values from Henrik Nordstrom on 2009-08-25 (ietf-http-wg@w3.org from July to September 2009)

From: Henrik Nordstrom <henrik@henriknordstrom.net>
Date: Tue, 25 Aug 2009 14:21:05 +0200
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>, Bjoern Hoehrmann <derhoermi@gmx.net>
Message-Id: <1251202865.31991.146.camel@henriknordstrom.net>
Should probably change topic here, but it's still relevant so keeping
the issue topic. Most of this is taking a more generic view of
quoted-pair, not isolated to chunk extension values.

tis 2009-08-25 klockan 09:11 +0200 skrev Julian Reschke:

> quoted-pair is also used in comments. Are we ok with restricting the set 
> here as well? And, if yes, shouldn't we then also adjust the allowed set 
> for non-quoted characters in comments?

What? Restricting how? I thought we were talking about restricting the
use of CTLs?


Now some further rambling on the use of quoted-pair and the difficulties
this causes for parsers:


qdtext is for text within a quoted-string, and MUST NOT include '"' or
'\'. Those two must be produced as quoted-pair to be used within a
quoted-string.

    qdtext         = OWS / %x21 / %x23-5B / %x5D-7E / obs-text
                   ; OWS / <VCHAR except DQUOTE and "\"> / obs-text 

ctext is the same but for comment, and MUST NOT include '(', ')' or '\'.
Those three must be produced as quoted-pair to be used within a comment.

    ctext          = OWS / %x21-27 / %x2A-5B / %x5D-7E / obs-text 
                   ; OWS / <VCHAR except "(", ")", and "\"> / obs-text

Neither of qdtext or ctext allows for CTLs, except for HT or obsoleted
CRLF folding (from OWS).

Specifications (2616) is very strict on where quoted-pair is alowed to
be used, but it's at the same time very subtle where those areas are
creating a large grey area where parsing is somewhat non-obvious.

It's the same question as been raised earlier regarding comments. A
construct looking like a comment is only a comment if the header in
question is defined to allow comments, if not it's literally part of the
header value.

Quoted-string is also only quoted-string if the header in question is
defined to accept quoted-string, if not it may be a literal part of the
header value even if it may look like a quoted-string (for a header
defined as taking *TEXT as value, 2616 has no such headers however)

RFC2616 BNF and relevant comments:

      generic-message = start-line
                        *(message-header CRLF)
                        CRLF
                        [ message-body ]
       message-header = field-name ":" [ field-value ]
       field-name     = token
       field-value    = *( field-content | LWS )
       field-content  = <the OCTETs making up the field-value
                        and consisting of either *TEXT or combinations
                        of token, separators, and quoted-string>

       TEXT           = <any OCTET except CTLs,
                        but including LWS>

   A CRLF is allowed in the definition of TEXT only as part of a header
   field continuation.

   Comments can be included in some HTTP header fields by surrounding
   the comment text with parentheses. Comments are only allowed in
   fields containing "comment" as part of their field value definition.
   In all other fields, parentheses are considered part of the field
   value.

       comment        = "(" *( ctext | quoted-pair | comment ) ")"
       ctext          = <any TEXT excluding "(" and ")">

The allowable characters in *TEXT overlaps completely with token,
separators and quoted-string in the allowable characters except that
*TEXT do not allow CTLs other than LWS (HT), and within *TEXT the '\'
character have no special meaning.

Which means that to properly parse '\' quoted constructs one must know
in detail every header processed in order to know if the '\' is quoting
the next character or if it's just a literal '\'.

Because of this it's important that the overall message parsing is the
same regardless if quoted-pair is processed or not, only producing
slightly different results in the raw header value. Or put in other
words, it needs to be possible to completely defer quoting and comment
processing until the header value as such is examined in detail, with
general message parsing using *TEXT for all header values. And for chunk
headers *TEXT minus folding for the general message format, only needing
to dive into quoting etc when eventually processing the chunk extension
values (if at all).


Regarding the allowable characters there imho is absolutely no need to
allow for control characters anywhere in HTTP headers or chunk headers,
quoted or not, and it's additionally very very likely many parsers will
fail on such constructs making them quite non-interoperable.

And additionally if restricting the allowed set of quoted characters to
exclude \x00, NL and CR as already done in HTTPbis then it becomes very
questionable from a technical point of view (ignoring parsing) to allow
the use of other CTLs in quoted form. The use of having CTLs in header
values is very limited to begin with, basically only needed to support
transmission of (non-UTF8) multibyte charactersets or binary non-text
data, in which case having those three excluded is already a signifcant
issue for such use.

So imho quoted-pair should be

    quoted-text = %x09 / %x20-%x7E / obs-text
                ; WSP / VCHAR / obs-text
    quoted-pair = "\" qchar	

to match the use of *TEXT in 2616, making comments and quoted strings
all fit within *TEXT as those constructs is only used in detailed forms
which should be a subset of the more generic *TEXT.


This reasoning is also consistent with the current field-content
definition using VTEXT etc..

    field-value    = *( field-content / OWS )
    field-content  = *( WSP / VCHAR / obs-text )

This field-content definition DOES NOT allow for CTLs other than HT.
Allowing quoted-pair to include CTLs other than HT is incompatible with
the above (from latest p1) definition of field-content.

If you look closely you'll notice the quoted-text and field-contents
definitions above are equal. Perhaps a common term should be defined for
that similar to the *TEXT element used in 2616. There is probably more
places where using said term would make sense. And sorry, no I do not
have a good suggested BNF name for this construct.. TEXT would be
confusing with 2616 and text in lower case too generic to be used in
describing text. general-text?

Regards
Henrik
Received on Tuesday, 25 August 2009 12:21:54 UTC