Re: Percent encoding from Silvia Pfeiffer on 2010-03-02 (public-media-fragment@w3.org from March 2010)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Wed, 3 Mar 2010 09:45:53 +1100
To: Philip Jägenstedt <philipj@opera.com>
Cc: Raphaël Troncy <raphael.troncy@cwi.nl>, raphael.troncy@eurecom.fr, Media Fragment <public-media-fragment@w3.org>, Jack Jansen <Jack.Jansen@cwi.nl>
Message-ID: <2c0e02831003021445r6c0075f4gde3b9627e9345b9d@mail.gmail.com>
Hi, all,

I'm gonna try and find some URLs that reflect on Philips statements,
out of curiosity and because we might want to link to them in the
spec. Let's see how far I can get... (see below)...


On Tue, Mar 2, 2010 at 10:41 PM, Philip Jägenstedt <philipj@opera.com> wrote:
> On Tue, 02 Mar 2010 18:39:32 +0800, Raphaël Troncy <raphael.troncy@cwi.nl>
> wrote:
>
>> Dear Philip,
>>
>>> Perhaps YouTube decodes first and splits last, or perhaps they just use
>>> a regexp to find v=XXXXX anywhere. Whatever is the case with YouTube, I
>>> assume we want to match as closely as possible how query strings works
>>> in e.g. ASP, PHP, JSP and Perl CGI, or there is no benefit in using
>>> something that resembles query strings.
>>>
>>> We can never be 100% compatible, for reasons listed in a note after
>>>
>>> http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#decode-a-percent-encoded-string
>>
>> Thanks, the note is indeed really useful. For all the following
>> statements, do you think it is possible to indicate a suitable reference?
>
> Do you mean like a reference in the spec? The results were derived simply
> from testing the various server environments so for most of them there's
> nothing like another spec or a document to reference, but here's the
> environments/languages that each point is about:
>
>>     *  "&" is the only primary separator for name-value pairs, but some
>> server-side languages also treat ";" as a separator.
>
> CGI Perl also accepts ";" as a separator.

Ha! I just found a specification for the name-value pair separation in
forms for HTML4:
http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4 for
application/x-www-form-urlencoded mime type. It doesn't include ";",
but it does specify what needs encoding.

OTOH http://en.wikipedia.org/wiki/Query_string talks about both & and
; as separators in query strings.


PERL:
This tells me that Perl uses "&" as the default separator for parsing
queries in CGI->param():
http://perldoc.perl.org/CGI.html#FETCHING-THE-NAMES-OF-ALL-THE-PARAMETERS-PASSED-TO-YOUR-SCRIPT:
and
http://perldoc.perl.org/CGI.html#PRAGMAS
explains that you need to set the -newstyle_urls pragma to parse URLs
with the ";" separator. The text confuses me though, since it says
that this is now the default since 2.64.


JavaScript:
http://www.irt.org/articles/js157/index.htm - Creating 'Encoded' Name
& Value Pairs (JavaScript)
http://prettycode.org/2009/04/21/javascript-query-string/ -
introducing window.location.querystring in JavaScript

http://blog.stevenlevithan.com/archives/parseuri has a good discussion
in the comments on why some people prefer ";" as separator: "there is
no need to escape it when used in XML/XHTML documents". It also says
that less than 0.1% of developers use anything else but "&" as the
separator.
A modern jQuery plugin for query string parsing also supports
semicolons: http://plugins.jquery.com/project/query-object .
This jQuery plugin
(http://allmarkedup.com/journal/2009/10/jquery-url-toolbox-beta/) only
supports '&' as separator, but also applies that parsing to url
fragments (yay!).

Prototype also has a function that uses '&' as the default separator,
but allows replacing it with a different separator:
http://prototypejs.org/api/string/toQueryParams.


ASP:
For ASP there is a QueryString property in the HttpRequest class, see
http://msdn.microsoft.com/en-us/library/system.web.httprequest.querystring.aspx
. It also does the same kind of parsing.


PHP:
In php, only & separated name-value pairs will be parsed:
http://php.net/manual/en/function.parse-str.php .


Ruby:
CGI::parse(query) parses at '&'
http://www.tutorialspoint.com/ruby/ruby_cgi_methods.htm



>>     * name-value pairs with invalid percent-encoding should be ignored,
>> but some server-side languages silently mask such errors.
>
> Of the tested, only JSP outright rejected invalid input like our spec does.
> Of ASP, PHP and Perl CGI, some removed the invalid part and some simply left
> them intact, but I can't remember exactly which did which.


Most of the links I looked at above require url encoding for the name
and values. The '&' character has to stay as it is for identifying the
separator.


>>     * The "+" character should not be treated specially, but some
>> server-side languages replace it with a space (" ") character.
>
> All of the tested languages do this, it's just how query fragments are used.
> We could duplicate this behavior, but since we actually have syntax that
> requires using "+" (timezones) it would be quite annoying.

A lot of the examples above treated '+' as an encoding for " " (space)
- others required percent-encoding spaces. It would be only relevant
for us where we allow random text, i.e. in the named media fragment
urls.


>>     * Multiple occurrences of the same name must be preserved, but some
>> server-side languages only preserve the last occurrence.
>
> Here again I can't remember the exact behavior of each language. I'm pretty
> sure PHP and CGI Perl only preserve the last occurence, while JSP has a
> list. I'm not sure about ASP and ASP.NET.

Just about all of them regarded multiple occurrances of the same name
as an array of values. In particular PHP and Ruby work like that, but
also the JavaScript libraries.

Cheers,
Silvia.
Received on Tuesday, 2 March 2010 22:46:48 UTC