Re: URL comparison

Dear Anne,

On Thu, Apr 25, 2013 at 12:34 PM, Anne van Kesteren <annevk@annevk.nl> wrote:
> Background reading: http://dev.w3.org/csswg/selectors/#local-pseudo
> and http://url.spec.whatwg.org/

Local link pseudoselector, as presently specified, seems to be tightly
coupled to the absolute URI design of the domain from which the
resource is served. This seems quite brittle and will cause weird
behavior when resources are inevitably rearranged. Consider the case
of a compressed archive with markup documents and CSS files.

I believe the web already has several syntaxes for dealing with this
and related problems. I propose standards harmonization below.

> :local-link() seems like a special case API for doing URL comparison
> within the context of selectors. It seems like a great feature, but
> I'd like it if we could agree on common comparison rules so that when
> we eventually introduce the JavaScript equivalent they're not wildly
> divergent.

I agree that URI comparison is quite important and powerful. I am very
excited to see a consistent and simple comparison specification
emerge.

> Requests I've heard before I looked at :local-link():
>
> * Simple equality
> * Ignore fragment
> * Ignore fragment and query
> * Compare query, but ignore order (e.g. ?x&y will be identical to
> ?y&x, which is normally not the case)
> * Origin equality (ignores username/password/path/query/fragment)

These are all types of pattern-matching.

> * Further normalization (browsers don't normalize as much as they
> could during parsing, but maybe this should be an operation to modify
> the URL object rather than a comparison option)

This is a function specification.

> :local-link() seems to ask for: Ignore fragment and query and only
> look at a subset of path segments. However, :local-link() also ignores
> port/scheme which is not typical. We try to keep everything
> origin-scoped (ignoring username/password probably makes sense).
> Furthermore, :local-link() ignores a final empty path segment, which
> seems to mimic some popular server architectures (although those
> ignore most empty path segments, not just the final), but does not
> match URL architecture.

Fundamentally, comparison is about structural pattern-matching. As it
happens, the WWW already has an incredibly widespread syntax which
internally performs pattern-matching: relative URI references. (aside:
mathematically, relative URI refs are hylomorphic function
specifications on otherwise opaque identifier strings)

To that end, I propose a factorization of the tightly coupled parsing
specification in WHATWG URL resulting in 3 separate functions:

1. parsing
2. normalization
3. relative reference resolution

Once this specification is properly factored, discussion,
specification and implementation of the pattern-matching semantics of
:local-link() and equivalent JavaScript functionality becomes much
easier. This is elucidated in my proposal below.

> For JavaScript I think the basic API will have to be something like:
>
> url.equals(url2, {query:"ignore-order"})
> url.equals(url2, {query:"ignore-order", upto:"fragment"}) // ignores fragment
> url.equals(url2, {upto:"path"}) // compares everything before path,
> including username/password
> url.origin == url2.origin // ignores username/password
> url.equals(url2, {pathSegments:2}) // implies ignoring query/fragment

PROPOSAL

I believe the primary objective of this work is to define a syntax for
URI patterns. There are presently 2 different but compatible standard
URI pattern syntaxes:

1. RFC 3986 Relative URI references
2. RFC 6570 <http://tools.ietf.org/html/rfc6570> URI templates

Through re-use of these pattern syntaxes, the :local-link()
pseudoselector gains incredible flexibility, expressivity,
consistency, and (to my eye) simplicity.

I will use the notation [path] | [pattern] for pattern matching.

I haven't worked out *all* the details yet, but here are some possible
patterns to get us started:

"" = own-document links
"." = this document's path ("/foo/bar/baz" | "." = "/foo/bar/" and
"/foo/bar/" | "." = "/foo/bar/" | "")
"./" = this document's path or deeper (includes path-sibling resources and self)
".." = this document's parent path ("/foo/bar/" | ".." = "/foo/")
"../" = this document's parent path or deeper (includes aunts/uncles and self)
"../." = ".."
"/" = this domain (':local-link(0)')
"https:///" = HTTPS resources on this domain

Now, at this point, you may say "But, David, this syntax can't even
express the current :local-link() examples!" and you would be right.
However, this syntax can easily express links to resources *relative*
to the present one which is a crucial feature for any URI pattern
matching system.

Let's now consider the extension of this syntax with the syntax of RFC
6570. In particular, the construct we appear to be missing from the
URI reference syntax is *binding*. RFC 6570 concerns URI construction
but we can easily envision using the same syntax for the inverse of
construction, destruction. Because we do not require to bind
structural elements into a local environment, I propose the adoption
of "{}" as the self-match syntax (not yet allowed under RFC 6570 but
could easily be a constructor no-op) and "{_}" as the wildcard match
(though "{comment}" could be used for commentary or when porting
patterns from systems which *do* bind into environments):

"{}" = "" = self
"{_}" = "./{_}" = siblings or self (but not deeper; "/foo/bar/" |
"{_}" = "/foo/bar/{any single path segment}")
"{}/" = any descendant of this document ("/foo/bar" | "{}/" =
"/foo/bar/{anything}" and "/foo/bar/" | "{}/" =
"/foo/bar//{anything}")
"{_}/" = anything with a deeper same-prefix path as this document
("/foo/bar/" | "{_}/" = "/foo/bar/{anything}")
"{}/{_}" = any nieces/nephews/children of this document ("/foo/bar/" |
"{}/{_}" is "/foo/bar/{any}/{other}")
"/{}" = the resource with the identity of the first path segment
equivalent to this document
"/{_}" = any first-level resource
"/{}/" = same first path segment (':local-link(1)')
"/{_}/" = any resource at least 1 level deep
"/~{_}/" = any resource with first path segment beginning with "~"

All of these patterns ignore the fragment (it indicates a delegated
resource subordinate to the primary resource) and require the same
username/password. None of these patterns match URIs with query
strings (their semantics are totally dependent on the server).

If you wish to match URIs with query strings, the syntax is simple:

"{?}" = self with same query string (incl. none)
"{?_}" = self with any query string (incl. none)
"{}/{?}" = any descendant with same query string (incl. none)
"{}/{?_}" = any descendant with any query string (incl. none)

Perhaps you wish to style only certain same-document references:

"#defn-{_}" = any same-document reference to fragments beginning with "defn-"

I understand these semantics are not trivial to implement and I have
begun a prototype implementation in my URI library.

The benefit of reusing the syntax and semantics of relative URI
references and URI templates are manifest. Structural pattern matching
(or, in this case, relative URI reference predicates) is incredibly
powerful as I hope I have demonstrated. Additionally, this design
leverages knowledge and important design constraints found in these
other specifications and their users.

I have not yet devised algebraic solutions for query string
permutation or URI normalization.

Disjunction of patterns is possible through normal CSS selector disjunction.

A major barrier to widespread deployment of a system of this kind is
the WHATWG URL specification. By unnecessarily coupling parsing,
normalization, and relative reference resolution, implementations
conforming to only the WHATWG URL specification cannot offer
developers control over the level and type of normalization nor the
ability to manipulate relative URIs without resolving them. Humanity
deserves a better foundation on which to construct algebras over its
global namespace.

As for speedy deployment, I would rather start on the path toward
correct, consistent, and powerful pattern matching than see something
rushed into standards due to feature anxiety. 3 or 6 more months to
get this language right is a constant factor on a potentially
unbounded technology lifetime.

I hope you've found this design proposal stimulating and I warmly
welcome any and all constructive (or destructive) response.

Happy Holidays,

David Sheets

Received on Sunday, 28 April 2013 00:06:33 UTC