Re: Media Fragments URI parsing: pseudo algorithm code from Philip Jägenstedt on 2010-06-30 (public-media-fragment@w3.org from June 2010)

From: Philip Jägenstedt <philipj@opera.com>
Date: Wed, 30 Jun 2010 15:49:27 +0200
To: "Yves Lafon" <ylafon@w3.org>
Cc: "Jack Jansen" <Jack.Jansen@cwi.nl>, "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>, Raphaël Troncy <raphael.troncy@eurecom.fr>, "Media Fragment" <public-media-fragment@w3.org>
Message-ID: <op.ve38opbjatwj1d@philip-pc.linkoping.osa>
On Wed, 30 Jun 2010 14:13:57 +0200, Yves Lafon <ylafon@w3.org> wrote:

> On Wed, 30 Jun 2010, Philip Jägenstedt wrote:
>
>> On Wed, 30 Jun 2010 10:39:29 +0200, Yves Lafon <ylafon@w3.org> wrote:
>>
>>> On Tue, 29 Jun 2010, Jack Jansen wrote:
>>>
>>>>  On 29 jun 2010, at 22:30, Yves Lafon wrote:
>>>>
>>>>> The ABNF describe the whole syntax, and then the different parts.  
>>>>> There is no need for a multi-step parsing scheme requiring to  
>>>>> re-read multiple time the same bytes.
>>>>> To me "%74=%6ept%3A%310" is not a media fragment. %-escaped values  
>>>>> are allowed only where they are allowed (see grammar).
>>
>> No, the ABNF doesn't define the whole syntax. If I am mistaken, please  
>> point to the production which in some way includes "&" and "=" to  
>> separate name-value pairs. That production is segment, but is  
>> non-normative. Since it is also wrong, the solution is not to make it  
>> normative.
>
> mediasegment     = namesegment / axissegment
> axissegment      = ( timesegment / spacesegment / tracksegment )
>                 *( "&" ( timesegment / spacesegment / tracksegment )
> timesegment      = timeprefix "=" timeparam
> ...
> It should be normative, if it's not it is a mistake.

You cannot write a robust MF parser based on this grammar, because  
t=1&foo=bar is not a valid production, meaning that any future extension  
foo of MF will cause that parser to fail completely. Either the grammar  
itself must be relaxed, or the parsing must be defined normatively and  
handle some things which are not valid productions of the grammar.

>>>> Interesting...
>>>>  Unlike Yves, I think the sketched example _is_ a media fragment, but  
>>>> unlike Philip I don't think we need to specify it in our ABNF.
>>>  the URI RFC makes it quite clear where percent encoding is allowed  
>>> and where it is not. For example, h%74%54p://www.example.com/ is _not_  
>>> htTp://www.example.com/
>>
>> Of course, but simply knowing where it is allowed isn't enough. I don't  
>> think this is disputed, but for the record we cannot completely  
>> delegate the issue of percent encoding to URI, because:
>>
>> 1. URI doesn't define the syntax of name-value pairs delimited by "&"  
>> and "=", so MF must.
>
> http://www.ietf.org/rfc/rfc3986.txt section 2.4
> So you parse your uri in components (that are identified by our  
> grammar), then you percent-decode what is needed.
> With the current grammar, it is allowed only in track and id productions.
> So it is perfectly compatible with the processing defined in rfc3986 and  
> perfectly allows #track=A%20%26%20B&t=10

No disagreement that we need to define it, thankfully. The disagreement is  
only where to decode percent-encoding.

>> 2. If we want to allow & in track names and ids, then percent-decoding  
>> must happen *after* splitting the name-value pairs. For example, in  
>> #track=A%20%26%20B&t=10 the track name is "A & B".
>>
>> If we agree, then the question is where to perform percent-decoding.
>>
>> Only performing percent-decoding for track and id is certainly  
>> possible, but something I object to because:
>>
>> 1. It is more complicated than simply always performing  
>> percent-decoding..
>
> Not if you have a parser based on the grammar, but it is not mandatory  
> to build an efficient parser, see below.
>
>> 2. Deployed server software doesn't parse query strings like this, so  
>> it wouldn't be possible to use those existing tools to build  
>> server-side Media Fragment parsers.
>
> If you (or we) define a parsing algorithm that matches what is in the  
> grammar, I am all for it (in fact we put the algorithm-based definition  
> in the appendix for that reason), implementers will use different  
> trade-offs in parsing and that's perfectly ok

If the algorithm matched exactly the ABNF it would have little value in my  
opinion. It is only because it *does not* match the ABNF that I wrote it  
in the first place.

Given how long-running this issue is, do we have a bug tracker or other  
formal way of tracking it?

<issue>

MF parsing must be defined normatively in the MF spec itself, meeting  
these conditions:

1. should handle all valid productions of the ABNF syntax correctly and,  
where necessary, input which is not valid per the syntax.

2. must be forward-compatible, so that future extensions to MF do not  
break existing MF parsers. (Compare to how new HTML elements and  
attributes or CSS properties degrade in implementations that don't  
understand them.)

3. should match as closely as possible how query components on the form  
a=1&b=2 are parsed by existing server-side software (e.g. ASP, PHP, JSP,  
Perl CGI)

</issue>

An implementation that conforms exactly to the (current, non-normative)  
ABNF fails condition 2 (e.g. t=1&foo=bar) and is not an option.

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Wednesday, 30 June 2010 13:50:10 UTC