Re: Media Fragments URI parsing: pseudo algorithm code

On Wed, 07 Jul 2010 11:10:27 +0200, Yves Lafon <ylafon@w3.org> wrote:

> On Tue, 6 Jul 2010, Philip Jägenstedt wrote:
>
>> On Wed, 30 Jun 2010 22:23:51 +0200, Bjoern Hoehrmann  
>> <derhoermi@gmx.net> wrote:
>>
>>> * Philip Jägenstedt wrote:
>>>>>>> With the current grammar, it is allowed only in track and id
>>>>>>> productions.
>>>>>>> So it is perfectly compatible with the processing defined in  
>>>>>>> rfc3986
>>>>>>> and perfectly allows #track=A%20%26%20B&t=10
>>>>>>  No disagreement that we need to define it, thankfully. The  
>>>>>> disagreement
>>>>>> is only where to decode percent-encoding.
>>>>>  RFC3986 gives the answer, after the URI components are parsed (and  
>>>>> we
>>>>> define here how to split out in components).
>>>>  The disagreement here is only for which components to decode
>>>> percent-encoding, RFC3986 will not help us.
>>>  RFC 3986 requires implementations when processing a fragment  
>>> identifiers
>>> to treat %74 and "t" the same regardless of where either occurs, as "t"
>>> is not a reserved character and URIs that differ only in the escaping  
>>> of
>>> unreserved characters are defined to be equivalent. So the answer here
>>> is "all components". You can only have special requirements for  
>>> reserved
>>> characters when they occur unescaped.
>>
>> If I understand this correctly, this means that percent-decoding must  
>> be performed on all names and values, which I welcome.
>>
>> However, given this situation, how is it possible to express parsing in  
>> a single layer of ABNF? When the ABNF says "t", it really means "t" or  
>> "%74", if these are indeed supposed to be equivalent. How do other  
>> specs layered on top of URI handle this?
>>
>> (I think it would be cleaner to split the syntax into two levels -- one  
>> that identifies arbitrary name-value pairs, and one that is defined in  
>> terms of the Unicode strings that those names/values represent.)
>
> Ok, so you want to mandate a first normalization step, doing at least  
> percent-decoding of characters taht are not delims, sub-delims, and  
> other non-safe characters.
>
> On identifying arbitrary name-value pairs, I am not keen on doing that.
> http://www.example.com/foo.mov#foo=bar is _not_ a media fragment.
>
> Also, the fact that we are doing processing before receiving the content  
> (and hence know the mime type), means that we are doing speculation  
> based on the syntax, which is ok in most cases, especially as we  
> designed what we could do in a way that would be harmless if the  
> assumption is wrong, but assuming that every name-value pair is part of  
> a media fragment seems just wrong.
>

For those who were not on the phoneconf:

No, I don't want a normalization step, as that would require performing  
percent-decoding twice. I want percent decoding to happen only once: after  
splitting name-value pairs.

I would be absolutely fine with saying that e.g. #t=5&foo=bar is not a  
valid media fragment, but not with ignoring the entire fragment in such a  
case.

Validity and processing are two different matters. A comparison with CSS  
isn't perfect, but serves to illustrate my point:

body {
   background-color: red;
   foo: bar;
}

This is not valid CSS. However, how to process it is perfectly well  
defined. I don't know the ins and outs of CSS parsing, but the net result  
is that unknown things are ignored.

So, for #t=5&foo=bar, I would expect a MF validator to warn that foo is  
not a known MF dimension, but for implementations to happily ignore it.  
This allows implementations to not break completely when faced with future  
extensions. It is assumed that future extensions will not change the  
meaning of existing dimensions. Even if there were such a spec, a web  
browser could not implement it because it would break existing pages that  
depended on the old behavior.

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Wednesday, 7 July 2010 10:54:51 UTC