Re: Processing requirements from Philip Jägenstedt on 2009-12-23 (public-media-fragment@w3.org from December 2009)

From: Philip Jägenstedt <philipj@opera.com>
Date: Wed, 23 Dec 2009 15:05:55 +0100
To: "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>
Cc: "Jack Jansen" <Jack.Jansen@cwi.nl>, "Media Fragment" <public-media-fragment@w3.org>
Message-ID: <op.u5d9f5qeatwj1d@sisko.linkoping.osa>
On Wed, 23 Dec 2009 13:43:02 +0100, Silvia Pfeiffer
<silviapfeiffer1@gmail.com> wrote:

> Hi Philip,
>
> Thanks for continuing to give implementer/browser developer feedback -
> it's really awesome to have this input! And such a shame you cannot
> join Davy, Conrad and I at FOMS to discuss this further.
>
>
> On Wed, Dec 23, 2009 at 11:28 PM, Philip Jägenstedt <philipj@opera.com>  
> wrote:
>> On Sat, 12 Dec 2009 05:46:53 +0100, Silvia Pfeiffer
>> <silviapfeiffer1@gmail.com> wrote:
>>> On Thu, Dec 3, 2009 at 10:01 AM, Philip Jägenstedt <philipj@opera.com>
>>> wrote:
>>>>
>>>> On Wed, 02 Dec 2009 21:51:47 +0100, Jack Jansen <Jack.Jansen@cwi.nl>
>>>> wrote:
>>>>
>>>>>
>>>>> On 2 dec 2009, at 12:55, Philip Jägenstedt wrote:
>>>>>
>>>>>> Following up on my previous email and todays IRC-conference (for  
>>>>>> me).
>>>>>>
>>>>>> I won't get involved in the editors stylistic choices between ABNF,
>>>>>> equivalent parsing algorithms (only the side effects of which are
>>>>>> normative)
>>>>>> or any other spec technique, but would request that at least the
>>>>>> following
>>>>>> are defined:
>>>>>>
>>>>>> 1. Splitting of name-value pairs
>>>>>>
>>>>>> The current ABNF only allows joining timesegment / spacesegment /
>>>>>> tracksegment by "&", which means that e.g. #t=5& is not allowed  
>>>>>> because
>>>>>> it
>>>>>> has a trailing &, which is very easy to get by accident if you  
>>>>>> write a
>>>>>> script like this:
>>>>>>
>>>>>> urifrag = '#':
>>>>>> for d in dimensions:
>>>>>>  urifrag += d + '&'
>>>>>
>>>>> I'm not thrilled by this idea. The web has a long history of features
>>>>> where an initial implementation was syntactically forgiving because  
>>>>> it
>>>>> was
>>>>> deemed to be user-friendly at the time. Many of these have been  
>>>>> causing
>>>>> endless headaches until today. Think of the ability to use filenames
>>>>> (especially Windows filenames) in the URL-bar, or in attributes in  
>>>>> the
>>>>> HTML
>>>>> code. Think of global variables in JavaScript.
>>>>
>>>> Let's be clear that validity and processing requirements are separate
>>>> things. That the processing for a certain input is well defined does  
>>>> not
>>>> mean that said input is valid. The validity definition is useful for
>>>> authors
>>>> to check their syntax against (using a validator) to find some  
>>>> mistakes,
>>>> etc. In my opinion, processing requirements should be as strict as
>>>> possible
>>>> (staying close to the valid syntax) while still being easy to  
>>>> understand
>>>> (for test suite writers, implementors and actual authors) and  
>>>> degrading
>>>> gracefully for forward-compatibility in the contexts where it is
>>>> necessary.
>>>>
>>>> I am not suggesting relaxing e.g. any of the temporal syntaxes because
>>>> there
>>>> is no benefit in doing so -- they are fixed and will not be changed by
>>>> future spec revisions.
>>>>
>>>> The Web platform is full of ugly and broken features, but that is not
>>>> because specs had unambiguous but lax processing requirements, it is
>>>> because
>>>> they either did not exist or left processing ambiguous or undefined.  
>>>> This
>>>> results in poor interoperability and an inevitable race towards the  
>>>> most
>>>> forgiving parsing possible. We absolutely do not want this to happen  
>>>> yet
>>>> again with media fragments.
>>>
>>> I have added two paragraphs to the the ABNF specification section, see
>>>
>>> http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#naming-syntax,
>>> which specifies how we look at media fragment URIs. I think this is
>>> necessary. I have kept it slightly more generic than just specifying
>>> "&" as a separator and also allowed ";" as a separator, since that is
>>> being used often by applications as a separator (see
>>> http://en.wikipedia.org/wiki/Query_string). I think that's a good
>>> compromise to take to address Philip's concern.
>>
>> I am a bit skeptical of allowing both & and ; as separators, as it adds  
>> a
>> little bit of complexity without any obvious benefit. Wikipedia links  
>> back
>> to
>> <http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.2.2>,
>> but has that advice been followed? Personally I can't remember ever  
>> reading
>> or writing a query string using ; as a separator. I'd like to see some
>> research on actual deployed software to see if not allowing ; as a  
>> separator
>> would cause problems. Also, if ; stays the ABNF needs to be updated.
>>
>> I appreciate this part: "A conformant server or user agent will need to  
>> be
>> able to parse a random URI query or fragment string for a media  
>> resource and
>> identify the relevant parts. E.g. the relevant field-value pair out of a
>> media fragment URI like this
>> http://www.example.com/video.ogv#&&=&=tom;jerry=&t=34&t=meow:0# is  
>> t=34."
>>
>> However, it's vague on what exactly the conformance requirements are.  
>> I'd
>> like the spec to be explicit about how to split the fragment into  
>> segments,
>> especially if & and ; are both allowed as separators. Having defined  
>> that,
>> simply refer to the ABNF and say that any string which is not "a valid
>> production of the mediasegment syntax" should be discarded. I suppose  
>> the
>> error handling section is an appropriate place, although it is currently
>> defined in terms of MF concepts and not strings. Perhaps the error  
>> handling
>> section should be split into two parts, one which gets us from an  
>> arbitrary
>> string to a list of dimensions, and then the existing section that  
>> defines
>> which of those dimensions actually apply.
>
> It is actually deliberately vague, because we are piggybacking onto a
> mechanism that has been developed outside the media fragments working
> group and is not part of what we should be specifying: how to compose
> a query string. We are already leaning out of the window by also
> applying it to URI fragments, but believe that is acceptable.

Wikipedia points to
<http://www.w3.org/TR/REC-html40/interact/forms.html#form-content-type>
which states: "The control names/values are listed in the order they
appear in the document. The name is separated from the value by `=' and
name/value pairs are separated from each other by `&'." This is much too
vague though.

HTML5  
<http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html#url-encoded-form-data>  
is a lot more specific and defines how to encode something which looks  
like a query string. However, what we need is a spec for how to decode it.  
If such a spec already exists we should reference it explicitly and make  
it part of the conformance requirements, the same for query strings and  
URI fragments.

> I do not think that the media fragment URI specification is the place
> to define how a query string on a media resource has to be parsed.
> There could be any number of other query parameters used in a query
> string and they could be perfectly valid because the particular client
> and the server both support them. So, we cannot actually write an
> algorithm that expresses all possible query parameters in a media
> fragment URI. We can only hint at it saying then where "?" or ";" are
> being used as separators on a media resource and the particular
> parameters that we specify are in use, we can prescribe what they
> mean.

If MF should not be concerned with composing query strings or URI  
fragments, a position which I am inclined to support, then we are already  
violating spec layers in e.g. the axissegment ABNF (defining how to  
compose query string with "&") and in timesegment, spacesegment,  
tracksegment and namesegment (defining how to join name-value pairs with  
"=").

I would support layering MF on top of an abstract list of name-value  
pairs, ignoring the underlying string encoding. If we can find no  
specification which defines how to map an arbitrary string to that  
abstract list, then we could put it in an appendix or publish it as a  
separate spec. Thoughts?

> BTW: for the same reasoning, we cannot exclude ";" as a separator - if
> for years it has been proposed to be used as separator, then that's
> what it should be. I believe, however, that ";" is not a separator
> between parameters, but probably rather between parameter values and
> we can totally make use of that.

But are deployed servers handling it? I honestly don't know. If there is  
no spec for how to handle ; perhaps we should reverse-engineer current  
popular server software and do what they do. This is surely the most  
real-world compatible solution.

>>>> On #t=5&t=10, I'll note that the spec currently *allows*
>>>> overspecificaton.
>>>> However, I agree with you that it should be invalid, so that  
>>>> validators
>>>> can
>>>> warn authors about their mistake. The processing rules should however
>>>> should
>>>> tolerate it because a parser which rejects it is much more complex  
>>>> for no
>>>> real gain, resulting in more work and more bugs.
>>>
>>> Should it be invalid instead of using the last occurrence? I prefer to
>>> do something that makes sense rather then putting the specification
>>> screws on too tightly for programs and users.
>>
>> It should be invalid syntax (so that validators will warn authors), but  
>> the
>> error handling section should tell implementations to use the last
>> occurrence.
>
> I'd be very happy to make it an invalid syntax. But if we allow
> browsers/servers to deal with it, it becomes legal very quickly. So,
> we should probably always return an error, such that authors will not
> start creating faulty URLs and URL parsers.

I'm not clear on what "legal" means and who is to return an error. It will  
certainly not become valid unless we change the spec. It will however work  
in UAs (which will at most log it to the error console). Validators should  
show an error, certainly.

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Wednesday, 23 December 2009 14:40:20 UTC