- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Sun, 6 Mar 2016 06:47:05 -0500
- To: rsc@swtch.com
- Cc: public-shex-dev@w3.org, Iovka Boneva <iovka.boneva@univ-lille1.fr>, contact@regex101.com
You seem to be a bit of a regexp geek so I'm contacting you out of the blue with a crackpot question. Given a pattern with repeated capture groups, e.g. /(a(b)*(c*))*/, why don't any tools (that I've seen) return a tree of solutions? Using javascript notation because it's in every browser console: "abbcccacc".match(/(a(b)*(c*))*/) returns (the matched text and) three capture groups: ["abbcccacc", "acc", undefined, "cc"] Presumably, these are the matches from the last nested captures. Adding distinct markers confirms this: "a1b2b3c4c5c6a7c8c9".match(/(a.(b.)*((?:c.)*))*/) ["a1b2b3c4c5c6a7c8c9", "a7c8c9", undefined, "c8c9"] Why not something a bit more informative like a tree of the matched captures?: ["a1b2b3c4c5c6a7c8c9", ["a1b2b3c4c5c6", ["b1", "b2", "b3"], ["c4c5c6"]], ["a7c8c9", [], ["c8c9"]] ] Note that https://regex101.com/ does the one level of this where: /(a.(b.)*((?:c.)*))/g gives: MATCH 1 1. [0-12] `a1b2b3c4c5c6` 2. [4-6] `b3` 3. [6-12] `c4c5c6` MATCH 2 1. [12-18] `a7c8c9` 3. [14-18] `c8c9` There's no nested capture of b1 and b2 but there is: [[ Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data ]] If the automata coding the beginning of a capture group essentially null out all of the nested capture groups (how we get an undefined for the empty capture of (b.)* above), it seems like they could just as easily push a new empty capture group. We can mechanically synthesize this tree by iterating through the repeated groups, providing ranges of 1: ↓ "a1b2b3c4c5c6a7c8c9".match(/(a.(b.)*((?:c.)*)){1}/) ["a1b2b3c4c5c6", "a1b2b3c4c5c6", "b3", "c4c5c6"] and recursively visiting each capture group "a1b2b3c4c5c6".match(/a.(b.){1}((?:c.)*)/) -> "b2" "a1b2b3c4c5c6".match(/a.(b.){2}((?:c.)*)/) -> "b3" "a1b2b3c4c5c6".match(/a.(b.){3}((?:c.)*)/) -> null; next group "a1b2b3c4c5c6".match(/a.(b.)*((?:c.)*)/) -> "c4c5c6" done note that "(?:c.)*" is not captured so we don't iterate there. "a1b2b3c4c5c6a7c8c9".match(/(a.(b.)*((?:c.)*)){2}/) ["a1b2b3c4c5c6a7c8c9", "a7c8c9", undefined, "c8c9"] recurse nested... So my real question is, if this is so easy, why don't implementations do it? (Am I the first person on earth to think this would be useful?) -- -ericP office: +1.617.599.3509 mobile: +33.6.80.80.35.59 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution. There are subtle nuances encoded in font variation and clever layout which can only be seen by printing this message on high-clay paper.
Received on Sunday, 6 March 2016 11:47:19 UTC