Re: CSS regexes from Tab Atkins Jr. on 2011-08-04 (www-style@w3.org from August 2011)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Thu, 4 Aug 2011 12:18:33 -0700
To: jwl@worldmusic.de
Cc: www-style@w3.org
Message-ID: <CAAWBYDDkhRyY8-SBos7kZzOTSPHmCgWP5W6vCAUuo3mz2Yahxw@mail.gmail.com>
On Thu, Jul 21, 2011 at 3:59 PM, Joergen W. Lang <joergen_lang@gmx.de> wrote:
> Am 18.07.11 22:58, schrieb Tab Atkins Jr.:
>> Unfortunately, this sort of thing has several problems that make it
>> hard to implement.
>
> So does "hard" mean "too hard to even bother" or "hard but not impossible"?
> Unfortunately I am not an implementer. I have a strong perlish background
> and done a fair bit of web programming. Currently I am more on the authoring
> side (web sites and books that is).

Probably the former.  If it was shown to be sufficiently valuable, or
we decided to also do another thing with similar problems (there are
several), it could be worth solving the issues.  But hey, hope springs
eternal!  These are problems I'd like to solve for other features.


>> 1. Does it match across element boundaries?  If so, it'll be a lot
>> slower.  If not, it's much less useful.
>
> If I imagine using :regex() it would certainly be limited by whatever
> selector it is attached to. If I use
>
>  p:regex(/position|top|left|bottom|right/) {
>    color: red;
>  }

Sorry, let me clarify my question.  Take this situation:

<p>This is <em>so</em> cool!</p>
<style>
p::regex(/so cool/) { color: red; }
p::regex(/is so cool/) { color: green; }
p::regex(/s s/) { color: blue; }
</style>

Do any of these match?  If you're operating just on the total text of
the p, they would, but there's an element in the middle now.


> This leads to more questions:
>
> * Just how much slower is 'slower' actually?
> * Would it be acceptable under certain cicumstances?
> * Is there any way to benchmark these things?

I'm not sure how much slower it is, but I know it's probably at least
a decent bit.  In the internal data structures that every browser
uses, the text is stored across at least three bits - "This is " and "
cool!" are in separate text nodes, and "so" is in another textnode
inside the <em> node.  Trying to match a regex across those probably
involves walking through and constructing an intermediate string that
concatenates them all.  This is potentially slow and expensive;
imagine, for example, you were doing a "body:regex(...)" or, even
worse, "*:regex(...)".


>> 2. Does it match across textnodes?  Even when the page *looks* like
>> it's just continuous text, the text may actually be broken across
>> separate textnodes.  This has the same implications as the previous,
>> except it's more confusing because you can only tell when a run of
>> text is broken into multiple textnodes by examining it from script.
>
> Very likely yes. Yet, limited by the selector to which :regex() was
> attached.

Right.  I'd certainly agree with this; the fact that contiguous text
separates into separate textnodes is normally irrelevant.  Browsers
can (and do, in some browsers?) merge them together.


>> 3. What happens if two regexes (or two applications of the same regex)
>> overlap?  CSS always works with trees, so you'd need some way to
>> determine which one gets broken apart, which one is innermost, etc.
>
> Sorry, not sure what you mean by 'overlap'.
> 'Trying to style the same content'?
>
> Also not sure what you mean by 'broken apart' or 'innermost'.
> Could you please explain? Regexes trying to match nested elements? Nested
> regexes?

Take this example:

<p>foobarbaz</p>
<style>
::regex(/foobar/) { color: red; }
::regex(/barbaz/) { color: green; }
</style>

The first regex "overlaps" the second.  If you treated this naively,
you'd have a non-tree structure, where "bar" has two parent boxes.
CSS doesn't allow that; a lot of CSS is based on the page structure
being a tree.  This can already happen in current CSS, with things
like ::first-line, and we fix it (in a somewhat undefined way) by
"breaking up" the ::first-line pseudo to preserve a tree structure.

Here, you'd have to either do some similar fixup (and specify it), or
perhaps just disallow overlapping by making one of them not match.
Whatever you do, though, it'll end up being somewhat confusing.


> Generally, I would expect race conditions to be treated by the rules of
> cascade and specificity as much as possible.
>
> If two instances of :regex() try to style the same content that one attached
> to the selector with the highest specificity wins. Regexes should not
> overlap.

The problem occurs at a lower level than specificity.  You first have
to figure out just what the structure of the CSS box-tree is - the
::regex pseudo-elements have to be specified to nest in a particular
way to make a tree.


>> 4. There are performance implications with running regex across the
>> entire DOM like that, as you have to rerun all of them any time the
>> text content of anything in the page changes.
>
> Why should a regex run against the entire DOM? I do not see that except
> someone would really want to to do something like
>
>  :root:regex(/[something crazily complex]/) { ... }

People will do that more often than you'd think.  ^_^  Or even just
things like "section::regex(...)" or "::regex(...)" by itself can end
up dealing with a lot of text.


> Thinking of performance issues, these things could maybe help to speed up
> things:
>
> * allow :regex() only on a subset of selectors
> * allow only a basic subset of operators in the regex to
>  cover the most common use cases
>  (do we need backreferences, lookahead, case sensitivity?)
> * re-use an already implemented regex engine
>  (could this actually be done?
>   Even if JS is deactivated by the user?)
> * allow only a certain number of regexes in one style sheet
>  (including all types of style sheets)
> * limit the amount of properties that could be applied
>  via a regex

I suspect that most of the performance problems are inherent to the
idea, rather than to the details about the regex, unfortunately.
There's no easy fix.


>> So, while I think it's a pretty cool idea that would be useful in
>> several ways (I wanna style all my ampersands with a pretty font!), I
>> don't think it's something that can actually be done.  :/
>
> Hmmm that sounds much more pessimistic than the previous 'hard to
> implement'. If it was only for ampersands I would probably be happy to
> continue inserting <span>s via search and replace in BBedit or whatever.
>
> I was thinking mostly of [syntax] highlighting in
>
> * code examples
> * books (Koran, Bible, Talmud, Veda, Tao-te-ching, ...)
> * legal texts
> * technical documentation
> * ...

Yup, syntax highlighting unfortunately hits right on all the hard issues.

~TJ
Received on Thursday, 4 August 2011 19:19:20 UTC