Re: Processing requirements

On Wed, Dec 30, 2009 at 12:23 AM, Philip Jägenstedt <philipj@opera.com> wrote:
> On Wed, 23 Dec 2009 15:05:55 +0100, Philip Jägenstedt <philipj@opera.com>
> wrote:
>
>> On Wed, 23 Dec 2009 13:43:02 +0100, Silvia Pfeiffer
>> <silviapfeiffer1@gmail.com> wrote:
>>
>>> Hi Philip,
>>>
>>> Thanks for continuing to give implementer/browser developer feedback -
>>> it's really awesome to have this input! And such a shame you cannot
>>> join Davy, Conrad and I at FOMS to discuss this further.
>>>
>>>
>>> On Wed, Dec 23, 2009 at 11:28 PM, Philip Jägenstedt <philipj@opera.com>
>>> wrote:
>>>>
>>>> On Sat, 12 Dec 2009 05:46:53 +0100, Silvia Pfeiffer
>>>> <silviapfeiffer1@gmail.com> wrote:
>>>>>
>>>>> On Thu, Dec 3, 2009 at 10:01 AM, Philip Jägenstedt <philipj@opera.com>
>>>>> wrote:
>>>>>>
>>>>>> On Wed, 02 Dec 2009 21:51:47 +0100, Jack Jansen <Jack.Jansen@cwi.nl>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> On 2 dec 2009, at 12:55, Philip Jägenstedt wrote:
>>>>>>>
>>>>>>>> Following up on my previous email and todays IRC-conference (for
>>>>>>>> me).
>>>>>>>>
>>>>>>>> I won't get involved in the editors stylistic choices between ABNF,
>>>>>>>> equivalent parsing algorithms (only the side effects of which are
>>>>>>>> normative)
>>>>>>>> or any other spec technique, but would request that at least the
>>>>>>>> following
>>>>>>>> are defined:
>>>>>>>>
>>>>>>>> 1. Splitting of name-value pairs
>>>>>>>>
>>>>>>>> The current ABNF only allows joining timesegment / spacesegment /
>>>>>>>> tracksegment by "&", which means that e.g. #t=5& is not allowed
>>>>>>>> because
>>>>>>>> it
>>>>>>>> has a trailing &, which is very easy to get by accident if you write
>>>>>>>> a
>>>>>>>> script like this:
>>>>>>>>
>>>>>>>> urifrag = '#':
>>>>>>>> for d in dimensions:
>>>>>>>>  urifrag += d + '&'
>>>>>>>
>>>>>>> I'm not thrilled by this idea. The web has a long history of features
>>>>>>> where an initial implementation was syntactically forgiving because
>>>>>>> it
>>>>>>> was
>>>>>>> deemed to be user-friendly at the time. Many of these have been
>>>>>>> causing
>>>>>>> endless headaches until today. Think of the ability to use filenames
>>>>>>> (especially Windows filenames) in the URL-bar, or in attributes in
>>>>>>> the
>>>>>>> HTML
>>>>>>> code. Think of global variables in JavaScript.
>>>>>>
>>>>>> Let's be clear that validity and processing requirements are separate
>>>>>> things. That the processing for a certain input is well defined does
>>>>>> not
>>>>>> mean that said input is valid. The validity definition is useful for
>>>>>> authors
>>>>>> to check their syntax against (using a validator) to find some
>>>>>> mistakes,
>>>>>> etc. In my opinion, processing requirements should be as strict as
>>>>>> possible
>>>>>> (staying close to the valid syntax) while still being easy to
>>>>>> understand
>>>>>> (for test suite writers, implementors and actual authors) and
>>>>>> degrading
>>>>>> gracefully for forward-compatibility in the contexts where it is
>>>>>> necessary.
>>>>>>
>>>>>> I am not suggesting relaxing e.g. any of the temporal syntaxes because
>>>>>> there
>>>>>> is no benefit in doing so -- they are fixed and will not be changed by
>>>>>> future spec revisions.
>>>>>>
>>>>>> The Web platform is full of ugly and broken features, but that is not
>>>>>> because specs had unambiguous but lax processing requirements, it is
>>>>>> because
>>>>>> they either did not exist or left processing ambiguous or undefined.
>>>>>> This
>>>>>> results in poor interoperability and an inevitable race towards the
>>>>>> most
>>>>>> forgiving parsing possible. We absolutely do not want this to happen
>>>>>> yet
>>>>>> again with media fragments.
>>>>>
>>>>> I have added two paragraphs to the the ABNF specification section, see
>>>>>
>>>>>
>>>>> http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#naming-syntax,
>>>>> which specifies how we look at media fragment URIs. I think this is
>>>>> necessary. I have kept it slightly more generic than just specifying
>>>>> "&" as a separator and also allowed ";" as a separator, since that is
>>>>> being used often by applications as a separator (see
>>>>> http://en.wikipedia.org/wiki/Query_string). I think that's a good
>>>>> compromise to take to address Philip's concern.
>>>>
>>>> I am a bit skeptical of allowing both & and ; as separators, as it adds
>>>> a
>>>> little bit of complexity without any obvious benefit. Wikipedia links
>>>> back
>>>> to
>>>>
>>>> <http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.2.2>,
>>>> but has that advice been followed? Personally I can't remember ever
>>>> reading
>>>> or writing a query string using ; as a separator. I'd like to see some
>>>> research on actual deployed software to see if not allowing ; as a
>>>> separator
>>>> would cause problems. Also, if ; stays the ABNF needs to be updated.
>>>>
>>>> I appreciate this part: "A conformant server or user agent will need to
>>>> be
>>>> able to parse a random URI query or fragment string for a media resource
>>>> and
>>>> identify the relevant parts. E.g. the relevant field-value pair out of a
>>>> media fragment URI like this
>>>> http://www.example.com/video.ogv#&&=&=tom;jerry=&t=34&t=meow:0# is
>>>> t=34."
>>>>
>>>> However, it's vague on what exactly the conformance requirements are.
>>>> I'd
>>>> like the spec to be explicit about how to split the fragment into
>>>> segments,
>>>> especially if & and ; are both allowed as separators. Having defined
>>>> that,
>>>> simply refer to the ABNF and say that any string which is not "a valid
>>>> production of the mediasegment syntax" should be discarded. I suppose
>>>> the
>>>> error handling section is an appropriate place, although it is currently
>>>> defined in terms of MF concepts and not strings. Perhaps the error
>>>> handling
>>>> section should be split into two parts, one which gets us from an
>>>> arbitrary
>>>> string to a list of dimensions, and then the existing section that
>>>> defines
>>>> which of those dimensions actually apply.
>>>
>>> It is actually deliberately vague, because we are piggybacking onto a
>>> mechanism that has been developed outside the media fragments working
>>> group and is not part of what we should be specifying: how to compose
>>> a query string. We are already leaning out of the window by also
>>> applying it to URI fragments, but believe that is acceptable.
>>
>> Wikipedia points to
>> <http://www.w3.org/TR/REC-html40/interact/forms.html#form-content-type>
>> which states: "The control names/values are listed in the order they
>> appear in the document. The name is separated from the value by `=' and
>> name/value pairs are separated from each other by `&'." This is much too
>> vague though.
>>
>> HTML5
>> <http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html#url-encoded-form-data>
>> is a lot more specific and defines how to encode something which looks like
>> a query string. However, what we need is a spec for how to decode it. If
>> such a spec already exists we should reference it explicitly and make it
>> part of the conformance requirements, the same for query strings and URI
>> fragments.
>
> I have searched high and low and consulted with #whatwg [1] and it seems
> there simply is no specification for how to decode query strings, not even a
> vague one. It looks as though the most relevant spec we could be referencing
> is IRI [2] or IRI bis [3], which would replace our reference to URI (RFC
> 3986).
>
> The IRI spec (both versions) states: "Specifications that define the syntax
> of any of the above components MAY divide them further and define smaller
> parts to be components according to this document. [snip] Even where the
> components are not defined formally, it may be helpful to think about some
> syntax in terms of components and to apply the relevant restrictions.  For
> example, for the usual name/value syntax in query parts, it is convenient to
> treat each name and each value as a component."
>
> Note that IRI DOES NOT define any kind of name/value pair syntax, the
> smallest components defined are iquery and ifragment. Any smaller
> subdivision must be defined by the MF spec, including the processing
> requirements with error handling. However, I believe it is inappropriate and
> unnecessary to have any normative requirements involving the query syntax,
> as explained below.
>
>>> I do not think that the media fragment URI specification is the place
>>> to define how a query string on a media resource has to be parsed.
>>> There could be any number of other query parameters used in a query
>>> string and they could be perfectly valid because the particular client
>>> and the server both support them. So, we cannot actually write an
>>> algorithm that expresses all possible query parameters in a media
>>> fragment URI. We can only hint at it saying then where "?" or ";" are
>>> being used as separators on a media resource and the particular
>>> parameters that we specify are in use, we can prescribe what they
>>> mean.
>>
>> If MF should not be concerned with composing query strings or URI
>> fragments, a position which I am inclined to support, then we are already
>> violating spec layers in e.g. the axissegment ABNF (defining how to compose
>> query string with "&") and in timesegment, spacesegment, tracksegment and
>> namesegment (defining how to join name-value pairs with "=").
>>
>> I would support layering MF on top of an abstract list of name-value
>> pairs, ignoring the underlying string encoding. If we can find no
>> specification which defines how to map an arbitrary string to that abstract
>> list, then we could put it in an appendix or publish it as a separate spec.
>> Thoughts?
>
> Please read the parts that I have highlighted in the IRC discussion [1].
> IMHO, it is best if this WG does not say anything about the query component
> at all. Defining smaller parts of iquery is similar in most regards to
> defining smaller parts of ipath, i.e. having the spec say that URIs on the
> form http://example.com/video/id=chapter-1/t=5/ have some special meaning.
> The format and parsing of both the path and query component have always been
> server-specific, it would make our job simpler if we didn't try to change
> that. Otherwise, what does MF say about all the URLs already out there that
> use t=0 in their query strings?
>
> I realize that this WG has spent quite some effort on discussing and writing
> spec text for URI queries, so I am not suggesting just throwing that away.
> See suggestion at bottom.
>
>>> BTW: for the same reasoning, we cannot exclude ";" as a separator - if
>>> for years it has been proposed to be used as separator, then that's
>>> what it should be. I believe, however, that ";" is not a separator
>>> between parameters, but probably rather between parameter values and
>>> we can totally make use of that.
>>
>> But are deployed servers handling it? I honestly don't know. If there is
>> no spec for how to handle ; perhaps we should reverse-engineer current
>> popular server software and do what they do. This is surely the most
>> real-world compatible solution.
>
> I did some testing of the query string parsing of PHP, ASP (VBScript) and
> JSP.
>
> In their default configurations, all of these split on '&' and then split
> the name/value pair on the first occurence of '=', so that e.g. v==b results
> in a ('v','=b') pair. This is as expected. The only interesting differences
> are in how % encoded entities are decoded. For example, ASP silently
> discards some malformed entities such as '%%' while PHP preserves them. JSP
> is quite aggressive in throwing away any key/value pair with malformed %
> encoding. For duplicate names, PHP uses the last value, while ASP and JSP
> represents the values as a list (in the order they appeared). I haven't run
> it myself, but it looks like Perl CGI splits on either & or ;. I also want
> to test what mod_perl does, but haven't yet.
>
> There is no spec for parsing *query* components and I don't think there
> should be one. Instead, I propose that we define a parsing algorithm for
> splitting a *fragment* component into name/value pairs that is close to what
> PHP, ASP and JSP do for the query component. This is only so that authors
> can use techniques they are familiar with to produce media fragments, e.g.
> using urlencode to escape each part. We need to decide what to do about '+'
> as that is usually used to encode a space. (Consider what would happen if
> someone tried to use #t=clock:2009-12-29T13:58+01:00 without escaping the +
> as %2B.)
>
> Having defined how the fragment component is parsed into a list of
> name/value pairs, we can define media fragments in terms of that list. Then
> we should make all the text about query components non-normative and note
> that some server software may support evaluating the fragment server-side
> for some resources if the name/value pairs are given to it in a format it
> understands, be that http://example.com/video?t=4&xywh=160,120,320,240 or
> http://example.com/video/fragment/t=4/xywh=160,120,320,240
>
> Is this a course of action the WG would be willing to pursue? The lack of
> processing requirements for the fragment component is a blocking issue for
> implementation. By discussing it up-front I hope we can find a solution
> everyone is happy with, instead of letting the first implementor make
> something up and then having to rely on that behavior forever.
>
> I would very much appreciate if we can discuss this by email and not
> teleconference, as it is much easier to follow and to refer to afterwards.
> (I'll also be idling in #whatwg most of the time if anyone cares to discuss
> some details on IRC.)
>
> [1] http://krijnhoetmer.nl/irc-logs/whatwg/20091229#l-5
> [2] http://tools.ietf.org/html/rfc3987
> [3] http://tools.ietf.org/html/draft-duerst-iri-bis-07
>
> --
> Philip Jägenstedt
> Core Developer
> Opera Software
>

You've come upon similar facts that we have: the only real
"specification" of subdivision of URI query components is the CGI
specification (see http://hoohoo.ncsa.illinois.edu/cgi/forms.html),
which uses & (and sometimes ;) for subdivision. For URI fragments,
nothing has been specified.

I guess, we only differ in the conclusions that we have taken from these facts.

IIUC, your argument is that for URI fragments, we can use whatever
subdivision we like because we're starting to define it and therefore
we should define the whole thing. Further, we should not make any
assumptions for URI queries, because these depend on what the server
implements

Now, I'd say that we're probably safe using "&" as a separator for URI
queries, since that has been specified in the CGI "standard" and has
continuously been applied, even if never formally specified. It is a
de-facto standard.

As for URI fragments, the idea is to keep it in sync with URI queries
and thus we also used the "&".

Now, both approaches (URI fragment and query) may conflict with some
already created specifications (as analysed and listed in
http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-reqs/#ExistingSchemes).
This is unavoidable when standardising the use of something that has
been in the wild so far.
http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#processing-overview-standardisation
talks about this problem and makes clear that harmonisation is
necessary and that it is not possible to "prescribe" this format.
Which probably means that media fragments will always be a
recommendation rather than a standard.

We could do one thing though: maybe we should add the link to the CGI
specification to the spec to explain where the formatting comes from.

Philip, note that the specification only defines a syntax for the URI
fragment case, but leaves out the URI query case and just alludes to
the fact that it is done in the same way. I think that is already what
you are suggesting, no?

Cheers,
Silvia.

Received on Tuesday, 29 December 2009 14:04:44 UTC