Re: [SelectorsAPI] Thoughts on querySelectorAll from Boris Zbarsky on 2008-05-01 (public-webapi@w3.org from May 2008)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Thu, 01 May 2008 12:51:31 -0500
To: public-webapi@w3.org
Message-ID: <481A0323.5070209@mit.edu>
John Resig wrote:
>> So say the original |selector| is ":not(:link)" and the UA doesn't 
>> support :link.  This will presumably split the string into ":not("
>> and ":link)", neither of which is that useful.
> 
> Absolutely - however that's a semi-trivial check for the library
> (just look for an open :FOO(... before beginning to parse). What's
> important about the position is that we can determine,
> programmatically, WHAT selector is failing.

The problem is defining this concept of "what selector is failing".

For example, consider the following selector:

   :not(a:link)

In Gecko, the character where we fail is the second ':', I think.  Which 
selector is failing exactly?  And while this case can be handed with the 
"back up to the preceding '(' and then to the preceding ':'" suggestion, 
what about:

   :not|test|

? Here the failing character is the first '|' but there is just no way 
to extract a valid selector out of the whole thing.  Worse yet, what about:

   :note

in one of the UAs supporting :not?  Is the failing character the ':' or 
the 'e'?  I realize that for any particular case like this it's clear 
what should be happening.  What I question is that we can easily create 
a rigorous definition of where the error character pointer should point 
that works across all the various existing selectors and does reasonable 
things for currently-invalid selectors.  That doesn't even include 
future-proofing issues, more on which below.

> The critical issue, right now, is that there is no way to do defensive, unobtrusive, testing of
> a browser's querySelectorAll implementation - it's a complete black
> box.

Well..  The thing is, selectors are _very_ context-sensitive.  That is, 
whether a selector string is valid can easily change if characters are 
added to the beginning, end, or anywhere in the middle.  It's easy to 
make a valid string into an invalid one by removing characters at 
beginning, end, or middle.  So really, the only question that can 
consistently be asked and answered is "is this a valid selector from the 
point of view of this UA?"  At least as far as I can tell...

> It's also not sufficient to provide the sequence of characters that
> are valid - since that still leaves us with a "black box" problem. If
> all you say is that "something inside the :not() isn't valid" then
> that doesn't help - we're back to where we started.

See the example above:  :not(a:link).  In CSS3 Selectors, what's invalid 
is the concept of putting multiple simple selectors inside :not.  All 
the parts are valid on their own; it's the way of putting them together 
that's invalid.

But maybe I'm misunderstanding your real concern here. Above you say 
that :not would be handled by backing out of it anyway, which makes 
sense to me.  At that point, why do you care what inside the :not is 
invalid, exactly?  You're going to have to do the entire :not match 
yourself anyway....

> That's not really an issue - there isn't a single, publicly
> available, selector engine that queries in that manner. They all work
> from left-to-right (finding divs, then finding spans).

I assume you mean the JS library ones, right?  That makes sense, and 
does make the combinator thing less of a problem.

> That's not really an issue, either, look at the following:
> 
> div, :bad, span
> 
> Most JavaScript libraries, when they see the ',' interpret it to mean
> something like "take what we already have and push it on a stack for
> later retrieval" - that way when we hit an exception with :bad it'll
> really just be like handling any other selector.

   div, :future(a, b, c), :bad, span

How is this handled in a UA that supports :future?  How is it handled in 
a UA that does not?  Just splitting on ',' is not quite right....  For 
that matter, how would you handle:

   div, :not( , span

?  Just splitting on the ',' gives very different results from what 
querySelectorAll() would return (which is an exception).

> This really must be done *now* before implementations get too baked.

Maybe I'm missing something.  Adding more error-reporting to the thrown 
exception is a backwards-compatible change to an implementation.  If, 
say, Firefox 4 ships querySelectorAll() without such error reporting, 
that would not preclude Firefox 5 adding it.

> The fact that there's no way to determine what a useragent is capable
> of supporting (only through the crude "try it and see if it fails"
> technique) means that a new querySelectorAll will have to be
> performed on *every single selector call* just to see if it works or
> not. There is no way to say "Oh, hey, Mozilla doesn't like :hidden we
> should save the overhead of calling that every time."

I'm not sure I follow.  You could cache the fact that the selector 
":hidden" is not supported.  That wouldn't help you if someone then used 
":hidden" as part of another selector, but unless you pre-parse the 
selectors all the time, you wouldn't detect that anyway.  If you do 
pre-parse them, what's the overhead of that?

Maybe I'm just misunderstanding what information you want and what you 
want to do with it...

More importantly, how does the overhead we're trying to save compare to 
the resources the operation we're performing consumes?  I just did a 
quick test, and on a 2-year-old MacBook (not Pro), a 
querySelectorAll(":hidden") in Gecko inside a try/catch with a counter 
increment in the catch takes about 46 microseconds.   Here's the code:

   const kMaxCount = 10000;

   function func1(sel) {
     var start = new Date();
     var j = 0;
     for (var i = 0; i < kMaxCount; ++i) {
       try {
         var list = document.querySelectorAll(sel);
       } catch (e) {
         ++j;
       }
     }
     var end = new Date();
     alert((end - start) + "ms to try/catch " + j + " times");
   }

   func1(":hidden");

20% of that time is taken up by CSS error reporting, which we should 
consider disabling for the querySelectorAll case.  What's the time it 
usually takes :hidden to actually match in jquery in similar circumstances?

> Not to mention the fact that libraries are going to need to try and
> use querySelectorAll for as many queries as possible (or for as much
> of the query, as possible, if there's something bad in it).

Agreed on the former.  I'm just saying that it might be that the latter 
is hard to do for the libraries, a pain for the UAs, and not much of a 
performance win.  I could be wrong, and then we need some very careful 
spec text that would allow one to actually determine how much "as much 
of the query as possible" is.  That'll be a lot of work (== time) for 
the spec author and the question is whether the spec should be held for 
that or whether it can be added in a later revision.  Right now we have 
two more-or-less interoperable implementations of the specification as 
written (one in beta, one released), with at least one more on the way 
soon.  Other implementors may be waiting for the spec to go to CR before 
implementing.  I certainly would have if it had seemed like major 
changes were going to still happen to the specification.  So the real 
question I have is whether it's better to have querySelectorAll without 
the extra error reporting in UAs 6 months from now and then add error 
reporting another 12 months after that or whether it's better to have 
querySelectorAll with extra error reporting in UAs but not until 12 
months from now (and nothing before that).

It seems to me that the former is better, but note that the numbers 
involved are pretty fuzzy at best.

> div span > a[href]:hidden
> 
> With the extra index all of the leading "div span > a[href]" could be
> lopped off and re-run without a hitch - and then the extra :hidden
> could be handled by the library. However, as it stands now, the only
> thing that we can do is say "Oh well, I'm not sure what went wrong -
> I guess we'll do it the slow way."

Right.  I agree that your proposal works in a number of cases.  The 
problem is making sure it works in all cases, and possibly to make sure 
it continues to work as selectors are added.

> Why this is of such great concern to me is that "invalid selectors"
> (ones provided by libraries) are actually used very frequently. For
> example, here's a break-down of the most common selectors used by
> some popular jQuery sites: http://ejohn.org/files/selectors.html

Excellent.  Data!

Looking at this data, the top four red selectors would be quite fast to 
redo by hand, since they would just involve getElementById calls.  Note 
that such calls are actually _faster_ than querySelectorAll on the #id 
selector, since the latter has to walk the whole DOM every time.  All 
the character index would save you in this case is finding the start of 
the pseudo-class.

The next three are a bare red pseudo-class.  They would be done 
completely by hand even with the indexing proposal.

The next two are "tag:red-pseudo-class".  Doing it by hand means a 
getElementsByTagName() (which is about as fast, if not faster, as a 
querySelectorAll() on the tagname, I suspect), followed by a filter on 
the red pseudo-class.  This is very similar to the #id case above.

After that you start getting into some selectors where you might get 
more savings (I'm looking at ".class .class tag:gt(2)"), but these are 
not that common on an absolute basis compared to the more-common things 
discussed above, and there are lots of red things with similar 
frequencies to ".class .class tag:gt(2)" that fall into one of the 
categories above.

Again, I agree there are cases where the index would help.  I'm just 
wondering whether they're common enough, and whether the help is 
noticeable in those cases in terms of performance, and whether it's easy 
to tell those cases apart from the cases when the index just doesn't 
help much.

> Libraries would, again, be completely stuck for a way to work around those
> issues.

You've said this repeatedly, but in the end the worst-case scenario for 
a library is that it falls back on exactly what it does now after trying 
querySelectorAll and seeing that it fails.  That's not the same as 
"completely stuck", by a long shot....

-Boris
Received on Thursday, 1 May 2008 17:52:44 UTC