- From: Maciej Stachowiak <mjs@apple.com>
- Date: Mon, 26 Jul 2010 21:12:41 -0700
On Jul 25, 2010, at 11:16 PM, Adam Barth wrote: > 2010/7/26 Maciej Stachowiak <mjs at apple.com>: >> On Jul 25, 2010, at 5:57 AM, Adam Barth wrote: >>> 2010/7/24 Maciej Stachowiak <mjs at apple.com>: >>>> On Jul 24, 2010, at 9:55 AM, Adam Barth wrote: >>>>> 2010/7/23 Ian Fette (????????) <ifette at google.com>: >>>>>> http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization lists >>>>>> some interesting cases we've come across on the anti-phishing team in >>>>>> Google. To the extent you're concerned with / interested in >>>>>> canonicalizaiton, it may be worth taking a look at (not to suggest you >>>>>> follow that in determining how to parse/canonicalize URLs, but rather to >>>>>> make sure that you have some "correct" way of handling the listed URLs). >>>>> >>>>> Thanks. That's helpful. >>>>> >>>>>> BTW, are you covering canonicalization? >>>>> >>>>> Yes. The three main things I'm hoping to cover are parsing, >>>>> canonicalization, and resolving relative URLs. >>>> >>>> Is there any place in the Web platform where "canonicalize" is exposed by itself in a Web-facing way? I think resolve against a base and parse into components are the only algorithms whose effects can be observed directly. I think we only need to spec "canonicalize" if it turns out to be a useful subroutine. >>> >>> As far as I know, you can only see f(x) = >>> canonicalize(parse(resolve(x))) and also some breakdown components of >>> f(x) in HTMLAnchorElement and window.location.hash (and friends). >>> >>> Conceptually, it's a bit easier to think about them as three separate >>> functions. The main difference between parse and canonicalize is that >>> parse segments the input and canonicalize takes the segments, mutates >>> them, and assembles them into a new string. >>> >>> I haven't studied resolve in as much detail yet, so I'm less clear how >>> that fits into the puzzle. >> >> I would consider canonicalize() to be part of resolve(). Every time you retrieve a "cooked" URL (as opposed to original source text), you both resolve it against a possible base and canonicalize it as a single step. The two are not exposed separately. It's not clear to me that making this operation into three separate steps with a parse in the middle is helpful, or even representative of a good implementation strategy. I would think of parse() as something that happens after canonicalization in the cases where single components of the URL are exposed. > > That's an interesting way to think about what's going on. Different > parts of the URL get different canonicalization transformations > applied to them. For example, the range of characters that make sense > in a host name are different than those that make sense in a port or > query, so, in some sense, the canonicalization algorithm needs to > understand something about how the URL parses, or at least how to > distinguish host names from, e.g., ports and queries. Yes, but the relative resolution algorithm needs to find URL part boundaries as well. I guess part of the issue here is that we have two different senses of "parse": (1) Find the URL component boundaries in a source string, to be used by other algorithms for reference purposes. In that sense, you may need to do it to both the base URL and the possibly-relative reference before resolve(). However, this step isn't really exposed directly to the Web. (2) Extract URL components of a resolved canonicalized URL, with the appropriate post-processing to expose them via APIs like Location and HTMLAnchorElement. I've been thinking of parse() in sense #2, since that is the version actually exposed as API. You can think of this as taking a resolved canonicalized URL as input, and having a tuple of strings representing the components as output. The only other public operation is resolve+canonicalize, which conceptually takes a base URL, a possibly relative URL reference, and an optional document encoding as input, and which produces the resolved canonicalized URL as output. While there are other ways to factor these operations, using a different approach will make it less obvious how to glue them to the relevant other specs. Regards, Maciej
Received on Monday, 26 July 2010 21:12:41 UTC