Re: Media Fragments URI parsing: pseudo algorithm code from Yves Lafon on 2010-07-07 (public-media-fragment@w3.org from July 2010)

From: Yves Lafon <ylafon@w3.org>
Date: Wed, 7 Jul 2010 05:10:27 -0400 (EDT)
To: Philip Jägenstedt <philipj@opera.com>
cc: Bjoern Hoehrmann <derhoermi@gmx.net>, public-media-fragment@w3.org
Message-ID: <alpine.DEB.1.10.1007070506510.30410@wnl.j3.bet>

On Tue, 6 Jul 2010, Philip Jägenstedt wrote:

> On Wed, 30 Jun 2010 22:23:51 +0200, Bjoern Hoehrmann <derhoermi@gmx.net> 
> wrote:
>
>> * Philip Jägenstedt wrote:
>>>>>> With the current grammar, it is allowed only in track and id
>>>>>> productions.
>>>>>> So it is perfectly compatible with the processing defined in rfc3986
>>>>>> and perfectly allows #track=A%20%26%20B&t=10
>>>>> 
>>>>> No disagreement that we need to define it, thankfully. The disagreement
>>>>> is only where to decode percent-encoding.
>>>> 
>>>> RFC3986 gives the answer, after the URI components are parsed (and we
>>>> define here how to split out in components).
>>> 
>>> The disagreement here is only for which components to decode
>>> percent-encoding, RFC3986 will not help us.
>> 
>> RFC 3986 requires implementations when processing a fragment identifiers
>> to treat %74 and "t" the same regardless of where either occurs, as "t"
>> is not a reserved character and URIs that differ only in the escaping of
>> unreserved characters are defined to be equivalent. So the answer here
>> is "all components". You can only have special requirements for reserved
>> characters when they occur unescaped.
>
> If I understand this correctly, this means that percent-decoding must be 
> performed on all names and values, which I welcome.
>
> However, given this situation, how is it possible to express parsing in a 
> single layer of ABNF? When the ABNF says "t", it really means "t" or "%74", 
> if these are indeed supposed to be equivalent. How do other specs layered on 
> top of URI handle this?
>
> (I think it would be cleaner to split the syntax into two levels -- one that 
> identifies arbitrary name-value pairs, and one that is defined in terms of 
> the Unicode strings that those names/values represent.)

Ok, so you want to mandate a first normalization step, doing at least 
percent-decoding of characters taht are not delims, sub-delims, and other 
non-safe characters.

On identifying arbitrary name-value pairs, I am not keen on doing that.
http://www.example.com/foo.mov#foo=bar is _not_ a media fragment.

Also, the fact that we are doing processing before receiving the content 
(and hence know the mime type), means that we are doing speculation based 
on the syntax, which is ok in most cases, especially as we designed what 
we could do in a way that would be harmless if the assumption is wrong, 
but assuming that every name-value pair is part of a media fragment seems 
just wrong.


-- 
Baroula que barouleras, au tiéu toujou t'entourneras.

         ~~Yves

Received on Wednesday, 7 July 2010 09:10:30 UTC