Re: Processing requirements from Philip Jägenstedt on 2009-12-23 (public-media-fragment@w3.org from December 2009)

From: Philip Jägenstedt <philipj@opera.com>
Date: Wed, 23 Dec 2009 13:28:28 +0100
To: "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>
Cc: "Jack Jansen" <Jack.Jansen@cwi.nl>, "Media Fragment" <public-media-fragment@w3.org>
Message-ID: <op.u5d4xqy2atwj1d@sisko.linkoping.osa>
Thanks for working on this Silvia! Comments inline.

On Sat, 12 Dec 2009 05:46:53 +0100, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> Hi all,
>
> While in the process of editing the specification, I have made some
> adjustments that go a fair way towards solving these issues. So, let
> me contribute these here.
>
> On Thu, Dec 3, 2009 at 10:01 AM, Philip Jägenstedt <philipj@opera.com>  
> wrote:
>> On Wed, 02 Dec 2009 21:51:47 +0100, Jack Jansen <Jack.Jansen@cwi.nl>  
>> wrote:
>>
>>>
>>> On 2 dec 2009, at 12:55, Philip Jägenstedt wrote:
>>>
>>>> Following up on my previous email and todays IRC-conference (for me).
>>>>
>>>> I won't get involved in the editors stylistic choices between ABNF,
>>>> equivalent parsing algorithms (only the side effects of which are  
>>>> normative)
>>>> or any other spec technique, but would request that at least the  
>>>> following
>>>> are defined:
>>>>
>>>> 1. Splitting of name-value pairs
>>>>
>>>> The current ABNF only allows joining timesegment / spacesegment /
>>>> tracksegment by "&", which means that e.g. #t=5& is not allowed  
>>>> because it
>>>> has a trailing &, which is very easy to get by accident if you write a
>>>> script like this:
>>>>
>>>> urifrag = '#':
>>>> for d in dimensions:
>>>>   urifrag += d + '&'
>>>
>>> I'm not thrilled by this idea. The web has a long history of features
>>> where an initial implementation was syntactically forgiving because it  
>>> was
>>> deemed to be user-friendly at the time. Many of these have been causing
>>> endless headaches until today. Think of the ability to use filenames
>>> (especially Windows filenames) in the URL-bar, or in attributes in the  
>>> HTML
>>> code. Think of global variables in JavaScript.
>>
>> Let's be clear that validity and processing requirements are separate
>> things. That the processing for a certain input is well defined does not
>> mean that said input is valid. The validity definition is useful for  
>> authors
>> to check their syntax against (using a validator) to find some mistakes,
>> etc. In my opinion, processing requirements should be as strict as  
>> possible
>> (staying close to the valid syntax) while still being easy to understand
>> (for test suite writers, implementors and actual authors) and degrading
>> gracefully for forward-compatibility in the contexts where it is  
>> necessary.
>>
>> I am not suggesting relaxing e.g. any of the temporal syntaxes because  
>> there
>> is no benefit in doing so -- they are fixed and will not be changed by
>> future spec revisions.
>>
>> The Web platform is full of ugly and broken features, but that is not
>> because specs had unambiguous but lax processing requirements, it is  
>> because
>> they either did not exist or left processing ambiguous or undefined.  
>> This
>> results in poor interoperability and an inevitable race towards the most
>> forgiving parsing possible. We absolutely do not want this to happen yet
>> again with media fragments.
>
> I have added two paragraphs to the the ABNF specification section, see
> http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#naming-syntax,
> which specifies how we look at media fragment URIs. I think this is
> necessary. I have kept it slightly more generic than just specifying
> "&" as a separator and also allowed ";" as a separator, since that is
> being used often by applications as a separator (see
> http://en.wikipedia.org/wiki/Query_string). I think that's a good
> compromise to take to address Philip's concern.

I am a bit skeptical of allowing both & and ; as separators, as it adds a  
little bit of complexity without any obvious benefit. Wikipedia links back  
to  
<http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.2.2>,  
but has that advice been followed? Personally I can't remember ever  
reading or writing a query string using ; as a separator. I'd like to see  
some research on actual deployed software to see if not allowing ; as a  
separator would cause problems. Also, if ; stays the ABNF needs to be  
updated.

I appreciate this part: "A conformant server or user agent will need to be  
able to parse a random URI query or fragment string for a media resource  
and identify the relevant parts. E.g. the relevant field-value pair out of  
a media fragment URI like this  
http://www.example.com/video.ogv#&&=&=tom;jerry=&t=34&t=meow:0# is t=34."

However, it's vague on what exactly the conformance requirements are. I'd  
like the spec to be explicit about how to split the fragment into  
segments, especially if & and ; are both allowed as separators. Having  
defined that, simply refer to the ABNF and say that any string which is  
not "a valid production of the mediasegment syntax" should be discarded. I  
suppose the error handling section is an appropriate place, although it is  
currently defined in terms of MF concepts and not strings. Perhaps the  
error handling section should be split into two parts, one which gets us  
 from an arbitrary string to a list of dimensions, and then the existing  
section that defines which of those dimensions actually apply.

>> By the way, is anyone developing a MF validator? One could surely be  
>> written
>> in JavaScript quickly.
>
> No, not yet, but please go ahead and do so! It would be awesome to
> have that. I have, of course, as part of my demo at
> http://www.annodex.net/~silvia/itext/mediafrag.html implemented a
> quick and dirty parser, but it's in no way shape or form complete.
>
>
>>>> This specific case *can* be fixed in the ABNF, but leads into the next
>>>> issue:
>>>>
>>>> 2. Handling of unrecognized syntax
>>>>
>>>> This means that #u=12&t=5 can still proceed to getting the time  
>>>> offset 5.
>>>> Not allowing this makes it impossible to extend MF in the future as  
>>>> any new
>>>> syntax is invalid per the current spec.
>>>>
>>>> As a necessary (but unsightly) side-effect, anything between & that  
>>>> isn't
>>>> recognized should be ignored, including the empty string. Thus a  
>>>> conforming
>>>> UA should be able to handle this extreme:
>>>>
>>>> #&&=&=tom&jerry=&t=34&t=meow:0# (time offset 34 seconds)
>>>
>>> This is a very difficult issue, we already touched on it in the last
>>> teleconf. The problem is that there are two types of future  
>>> extensions, and
>>> they need opposite solutions. Some future attributes should preferably  
>>> be
>>> ignored by older implementations, think of a hypothetical
>>> "preferred-languages=english-french-german" attribute. Other future
>>> attributes should lead to an error if the older implementation doesn't
>>> understand the attribute, think of "rating=pg" (which would return only
>>> tracks with a rating of G or PG, supposedly).
>>>
>>> But: I have an idea that may be a solution to this, loosely based on  
>>> the
>>> SMIL skip-content attribute
>>> (http://www.w3.org/TR/2008/REC-SMIL3-20081201/smil-content.html#adef-skip-content).
>>> If we add an attribute that tells older implementations what to do  
>>> (ignore
>>> unknown attributes, or raise an error) we could have our cake and eat  
>>> it.
>>> The first example would then usually be coded as
>>> "....&preferred-languages=english-french-german&unknown=ignore", the  
>>> second
>>> as "....&rating=pg&unknown=error". The only remaining question is now:  
>>> what
>>> is the default value for the unknown attribute.
>>>
>>> What do y'all think? Would this fly?
>>
>> Adding processing instructions on the same level as the actual syntax
>> strikes me as very odd, but is technically possible.
>>
>> Defaulting to unknown=error would be a bad idea. When an author tests  
>> their
>> syntax in a UA that does understand "rating=pg", unknown=ignore has no
>> effect so they will not use it (and validators won't complain because  
>> the
>> new syntax is valid per the new spec). A good portion of authors write  
>> by
>> trial and error, so at this point they think they are done. However,  
>> all old
>> UAs are now required to fail. They get angry bug reports from their  
>> users,
>> while users of UAs which ignored the spec are still happy.
>>
>> Defaulting to unknown=ignore and honoring unknown=error would be  
>> possible,
>> but is still a worse behavior than if the UA can use all of the  
>> components
>> it *does* understand. The rare case of mandatory failure must,  
>> logically, be
>> handled outside of MF because UAs which don't understand MF at all  
>> (e.g. all
>> web browsers ever shipped to date) would otherwise bypass it.
>
> I agree with Philip and would not really want to add processing
> instructions into the URI fragment or query string.
>
> I think what I described above already addresses the issues that
> Philip brought up. But I may have missed something, so please check
> and let me know.

Thanks again for working on this, but I don't think the error handling for  
invalid (per ABNF) syntax is sufficiently defined yet, as outlined above.

>>>> 3. Processing order
>>>>
>>>> As an example, what is the result of processing #t=5&t=10 ? I think  
>>>> the
>>>> result should be 10, because it is what you would usually implement by
>>>> mistake if not making a conscious choice.
>>>>
>>>> The other option is that duplicating any dimension should cause the
>>>> entire fragment to be ignored, which I do not support.
>>>
>>> This is somewhat similar to the first case, but much more serious.
>>> Personally, I am heavily opposed to letting over-specified do anything  
>>> but
>>> return a hard error. If the URL was generated by a program this means  
>>> the
>>> program is buggy, if it was done by a human, similarly, the person  
>>> should be
>>> taught to mend their ways. Guessing that "the last one is probably  
>>> what was
>>> meant" is a random choice. Actually, I would argue that if it was a  
>>> human
>>> who created this specific URL the "right thing" to do is probably to  
>>> start
>>> at second 15. (I send you a fragment starting at second 5. You don't  
>>> like
>>> the first 10 seconds of that clip, so before you forward it to another
>>> friend you tack a "&t=10" to the end).
>>
>> The more important case to considered is #t=npt:5&t=foo:12
>
> According to the syntax that we are standardising, the second
> field-name parameter is invalid, so t=npt:5 dominates IMO.

Thanks.

>> When new temporal syntax foo arrives in MF 2.0, there will be both UAs
>> supporting MF 1.0 and those supporting MF 2.0 in existence for a very  
>> long
>> time. In that very long time, it should be possible to use both  
>> syntaxes and
>> have MF 1.0 UAs simply fall back to the one they understand which
>> approximates the new foo. Degrade gracefully! This is best achieved by
>> having the UA use the last fragment it recognizes, which is also very  
>> simple
>> for authors to understand and work with.
>
> Yes, I think this makes sense and that's also what I have added into
> the specification. Check out the newly created section
> http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#processing-overview-errors
> - it contains a start at the list of errors that a system may
> encounter and proposes what to do in those cases. I have specified
> that for over-specified dimensions the last occurrence is being used.
> This is indeed the opposite of what we previously proposed, but I do
> agree with Philip here.

Thanks.

>> On #t=5&t=10, I'll note that the spec currently *allows*  
>> overspecificaton.
>> However, I agree with you that it should be invalid, so that validators  
>> can
>> warn authors about their mistake. The processing rules should however  
>> should
>> tolerate it because a parser which rejects it is much more complex for  
>> no
>> real gain, resulting in more work and more bugs.
>
> Should it be invalid instead of using the last occurrence? I prefer to
> do something that makes sense rather then putting the specification
> screws on too tightly for programs and users.

It should be invalid syntax (so that validators will warn authors), but  
the error handling section should tell implementations to use the last  
occurrence.

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Wednesday, 23 December 2009 12:29:05 UTC