- From: Calogero Alex Baldacchino <alex.baldacchino@email.it>
- Date: Fri, 05 Dec 2008 01:57:55 +0100
Calogero Alex Baldacchino ha scritto: > > Maybe the first is wrong, and I'm still unsure of the second. My > concern is, a character-by-character comparison between an id value > and a fragment identifier may fail several ways. What for href="#foo > bar " and id="foo bar "? Actual rules would strip the trailing space > only for the href, so the matching would fail (but we might survive > broken links). Escaping both, then comparing would succed, as well as > first escaping then unescaping the href value before comparing (should > it be pointed out, somewhere, that a fragment identifier must be > unescaped before comparing to an id or a name? is it and I've missed > it? - having space characters in the unreserved production means thy > don't need to be escaped, but does it mean also they must be decoded > from their pct-production, after parsing and for resolving?). As well, > stripping the trailing spaces in both cases would succed, but would > fail when comparing id="foo bar " with href="#foo bar%20" (which is a > valid url, according with actual parsing rules), even with escaping > rules (in this case the id value trailing space must stay there). And > what about id="foo%20bar" in http://foo.example.org/foo.html and > href="#foo bar" on the same page, or on a page having the same base > URL, or a base element with href="http://foo.example.org/foo.html" ? > My point is, since comparisons for matching purpose happen after the > URL parsing and resolution, and the id value is not involved in such > steps, character-by-character comparisons may fail without a prior > normalization of both th fragment-identifier an the id value (or one > of them). However, if the above is yet solved with parsing and > resolving rules and I've misunderstood the spec, I retire all and > apologize. Or, perhaps, must a valid url with a valid fragment, which > is equivalent but not exactly matching an id value, be considered as a > broken link? > Maybe the above needs a further clarification. Let me start from URL parsing (and resolving) rules: after the URL is validated, it's divided into its components, but nothing is stated about normalization and/or %-encoded characters. I think that applying a somewhat normalization may be useful to parse equivalent URLs in a consistent manner, helpful when dealing with the interfaces for URL manipulation, as described in section 2.5.5, and, last but not least, an improvement in relative references matching (especially same-document references). A minimum requirement, for standardization sake, may consist of decoding any %-encoded characters in the <fragment> production, which are part of the <unreserved> production as defined in RFC 3986 with the changes defined in HTML 5 specification for URLs parsing and restricted to the Unicode ranges representing valid characters for an attribute value (those which are not prohibited neither as 'text' nor as 'character references'). This way, a character-for-character comparison between a fragment identifier and an id attribute value, which would have been equivalent but not matching without the normalization, should success most of times, because, as a consequence of the changes applied by HTML 5 current specification to the <unreserved> production, such characters might or might not be %-encoded in a valid URL, while an id value is likely to contain them non-encoded. After the above <fragment> normalization, a character-for-character comparison would fail if the id value contained any %-encoded triplet representing a decoded character, such as "foo%20bar". Anyway, such may be a weird thing to deal with, since it can be the %-encoded form of "foo bar", but also the decoded form of "foo%2520bar". In other words, if we apply the same normalization to two complete URLs, then we compare them, the result is quite reliable, but if we start from a component (such as a fragment identifier stored in an id attribute value) it's not easy to tell whether any normalization has been applied and which one, so there are always chances for false positives or false negatives to happen. According with RFC 3986, section "4.4. Same-Document Reference", the correct interpretation of a URI as a same-document reference cannot be hold as guaranteed, thus the mismatch between, for instance, the decoded fragment identifier "foo bar" and the id attribute value "foo%20bar", in front of (as I think) a wide majority of good matches, can be reasonable. Anyway, a kind of double check might be considered, such as: - comparing the %-unescaped fragment identifier with the ID of each element in the DOM; - upon failure, applying a %-unescape algorithm to the ID, then comparing again with the fragment identifier and, if matching, marking the element as a 'possible choice'; - upon a perfect (exact) match, without unescaping the evaluated element ID, choosing such element as the referenced document part (actually defined as "the indicated part of the document" in the spec) and stopping; - without any perfect match in the whole document, choosing the first 'possible choice', if any; - without any match at all, the search for the referenced document part fails. With respect to a "single check" for an exact match, the overall computational time should increase linearly, thus not being a performance issue. Best regards, Alex. -- Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f Sponsor: RC Auto? * Con Direct Line risparmi oltre il 30% sulla tua polizza! In pi? per te, 15% di extra sconto! Scopri subito l?offerta! * Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8496&d=5-12
Received on Thursday, 4 December 2008 16:57:55 UTC