[whatwg] URL parsing and same-document references [was: Re: Citing multiple <blockquote> elements in HTML5] from Calogero Alex Baldacchino on 2008-12-12 (public-whatwg-archive@w3.org from December 2008)

From: Calogero Alex Baldacchino <alex.baldacchino@email.it>
Date: Fri, 12 Dec 2008 20:36:32 +0100
Message-ID: <4942BD40.3070303@email.it>
Calogero Alex Baldacchino ha scritto:
> Maybe the above needs a further clarification. Let me start from URL 
> parsing (and resolving) rules: after the URL is validated, it's 
> divided into its components, but nothing is stated about normalization 
> and/or %-encoded characters. I think that applying a somewhat 
> normalization may be useful to parse equivalent URLs in a consistent 
> manner, helpful when dealing with the interfaces for URL manipulation, 
> as described in section 2.5.5, and, last but not least, an improvement 
> in relative references matching (especially same-document references). 
> A minimum requirement, for standardization sake, may consist of 
> decoding any %-encoded characters in the <fragment> production, which 
> are part of the <unreserved> production as defined in RFC 3986 with 
> the changes defined in HTML 5 specification for URLs parsing and 
> restricted to the Unicode ranges representing valid characters for an 
> attribute value (those which are not prohibited neither as 'text' nor 
> as 'character references'). This way, a character-for-character 
> comparison between a fragment identifier and an id attribute value, 
> which would have been equivalent but not matching without the 
> normalization, should success most of times, because, as a consequence 
> of the changes applied by HTML 5 current specification to the 
> <unreserved> production, such characters might or might not be 
> %-encoded in a valid URL, while an id value is likely to contain them 
> non-encoded.
>
> After the above <fragment> normalization, a character-for-character 
> comparison would fail if the id value contained any %-encoded triplet 
> representing a decoded character, such as "foo%20bar". Anyway, such 
> may be a weird thing to deal with, since it can be the %-encoded form 
> of "foo bar", but also the decoded form of "foo%2520bar". In other 
> words, if we apply the same normalization to two complete URLs, then 
> we compare them, the result is quite reliable, but if we start from a 
> component (such as a fragment identifier stored in an id attribute 
> value) it's not easy to tell whether any normalization has been 
> applied and which one, so there are always chances for false positives 
> or false negatives to happen. According with RFC 3986, section "4.4. 
> Same-Document Reference", the correct interpretation of a URI as a 
> same-document reference cannot be hold as guaranteed, thus the 
> mismatch between, for instance, the  decoded fragment identifier "foo 
> bar" and the id attribute value "foo%20bar", in front of (as I think) 
> a wide majority of good matches, can be reasonable. Anyway, a kind of 
> double check might be considered, such as:
>
> - comparing the %-unescaped fragment identifier with the ID of each 
> element in the DOM;
> - upon failure, applying a %-unescape algorithm to the ID, then 
> comparing again with the fragment identifier and, if matching, marking 
> the element as a 'possible choice';
> - upon a perfect (exact) match, without unescaping the evaluated 
> element ID, choosing such element as the referenced document part 
> (actually defined as "the indicated part of the document" in the spec) 
> and stopping;
> - without any perfect match in the whole document, choosing the first 
> 'possible choice', if any;
> - without any match at all, the search for the referenced document 
> part fails.
>
> With respect to a "single check" for an exact match, the overall 
> computational time should increase linearly, thus not being a 
> performance issue.
>
> Best regards, Alex.

The above (but the 'double check' I was suggesting) is about the way 
Firefox (2.x and 3.0.4) behaves (both href="#foo%20bar" and, in a 
different page, href="./example.html#foo%20bar" match id="foo bar"), 
while IE7 and Opera 9.x perform an exact comparison, and show, in the 
address bar, an url with eventual blank spaces, thus applying the 
relaxation allowed by URL parsing rules, but not conforming to RFC 3986, 
as a complete URI string. It seems different browsers implement (more or 
less) different normalization/resolution algorithms, leading to 
different matches, thus the specification of a uniform behaviour 
(whatever one) might be reasonable and useful. Actual resolving 
algorithm, while explicitly asking for %-encoding in a path component 
and for conformance with RFC 3986 in general, doesn't talk about 
fragment identifiers; the referred algorithm for relative resolutions 
(section 5.2 of RFC 3986), AIUI, might not require the creation of a 
complete URI string, but instead be accomplished by returning an object 
holding a separated string for each URI part, thus not necessarily 
requiring %-encoding and potentially leaving out to UAs a certain degree 
of freedom. Furthermore, about URL decomposition attributes it is said, 
'On setting, the new value must first be mutated as described by the 
"setter preprocessor" column, then mutated by %-escaping any characters 
in the new value that are not valid in the relevant component as given 
by the "component" column.'; such seems to refer to the stricter RFC3986 
requirements (which in turn might be relaxed, since any part of a 
decomposed URL may contain unescaped characters), however, the 
'component column' points, for each component, to the corresponding 
definition givent for a parsed-URL component, which is not strictly 
required to have escaped characters by actual parsing rules. I'd suggest 
to re-consider the whole mechanism to avoid any free interpretation and 
make each phase/operation (parsing, resolving, attributes setting) more 
consistent both with each other and cross-browser, if possible (I'd also 
consider one or more DOM methods to help an easy comparison between 
URL-strings and/or between component attributes).

Best regards,
Alex.
 
 
 --
 Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f
 
 Sponsor:
 Scopri le supernovit? dei games per cellulare! Giocale tutte!
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8271&d=12-12
Received on Friday, 12 December 2008 11:36:32 UTC