W3C home > Mailing lists > Public > public-shex-dev@w3.org > March 2016

tree results from repeated capture groups

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sun, 6 Mar 2016 06:47:05 -0500
To: rsc@swtch.com
Cc: public-shex-dev@w3.org, Iovka Boneva <iovka.boneva@univ-lille1.fr>, contact@regex101.com
Message-ID: <20160306114702.GA17971@w3.org>
You seem to be a bit of a regexp geek so I'm contacting you out of the
blue with a crackpot question.

Given a pattern with repeated capture groups, e.g. /(a(b)*(c*))*/,
why don't any tools (that I've seen) return a tree of solutions?

Using javascript notation because it's in every browser console:
  "abbcccacc".match(/(a(b)*(c*))*/)
returns (the matched text and) three capture groups:
  ["abbcccacc", "acc", undefined, "cc"]
Presumably, these are the matches from the last nested captures. Adding
distinct markers confirms this:
  "a1b2b3c4c5c6a7c8c9".match(/(a.(b.)*((?:c.)*))*/)
  ["a1b2b3c4c5c6a7c8c9", "a7c8c9", undefined, "c8c9"]

Why not something a bit more informative like a tree of the matched
captures?:

  ["a1b2b3c4c5c6a7c8c9",
    ["a1b2b3c4c5c6", ["b1", "b2", "b3"], ["c4c5c6"]],
    ["a7c8c9", [], ["c8c9"]]
  ]

Note that https://regex101.com/ does the one level of this where:
  /(a.(b.)*((?:c.)*))/g
gives:
  MATCH 1
  1.	[0-12]	`a1b2b3c4c5c6`
  2.	[4-6]	`b3`
  3.	[6-12]	`c4c5c6`
  MATCH 2
  1.	[12-18]	`a7c8c9`
  3.	[14-18]	`c8c9`
There's no nested capture of b1 and b2 but there is:
[[
  Note: A repeated capturing group will only capture the last
  iteration. Put a capturing group around the repeated group to
  capture all iterations or use a non-capturing group instead if
  you're not interested in the data
]]

If the automata coding the beginning of a capture group essentially
null out all of the nested capture groups (how we get an undefined for
the empty capture of (b.)* above), it seems like they could just as
easily push a new empty capture group. We can mechanically synthesize
this tree by iterating through the repeated groups, providing ranges
of 1:
                                                 ↓
    "a1b2b3c4c5c6a7c8c9".match(/(a.(b.)*((?:c.)*)){1}/)
    ["a1b2b3c4c5c6", "a1b2b3c4c5c6", "b3", "c4c5c6"]

and recursively visiting each capture group

    "a1b2b3c4c5c6".match(/a.(b.){1}((?:c.)*)/) -> "b2"
    "a1b2b3c4c5c6".match(/a.(b.){2}((?:c.)*)/) -> "b3"
    "a1b2b3c4c5c6".match(/a.(b.){3}((?:c.)*)/) -> null; next group
    "a1b2b3c4c5c6".match(/a.(b.)*((?:c.)*)/) -> "c4c5c6" done
    note that "(?:c.)*" is not captured so we don't iterate there.
  "a1b2b3c4c5c6a7c8c9".match(/(a.(b.)*((?:c.)*)){2}/)
  ["a1b2b3c4c5c6a7c8c9", "a7c8c9", undefined, "c8c9"] recurse nested...

So my real question is, if this is so easy, why don't implementations
do it? (Am I the first person on earth to think this would be useful?)

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.
Received on Sunday, 6 March 2016 11:47:19 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 4 June 2019 11:00:16 UTC