- From: Jeni Tennison <jeni@jenitennison.com>
- Date: Sun, 6 Jan 2002 17:19:31 +0000
- To: www-xml-query-comments@w3.org
- CC: David Carlisle <davidc@nag.co.uk>, xsl-list@lists.mulberrytech.com
It's great to see regular expression support in the F&O WD :) David Carlisle wrote: > xf:match > This seems to be underspecified in cases that the matching regions > overlap. if the regexp is aa and the string is aaa do you just get > (1) or (1 2) (this also applies to xf:replace) Most regular expression languages don't find overlapping matches, do they? It seems to add a lot of extra complexity if they do. > Slightly worried that, since xpath sequences do not nest, this > semantic will prevent any future extension to allow sed/emacs/perl > style numbered subexpressions. Also it forces the system always to > match the entire string, which may be rather long, rather than > stopping once a match is found. > > If instead it just returned the position of the first match a > plausible extension would be that if the regexp was > \(aa\)xx\(bb\) > then what was returned was a sequence consisting of the position of > the entire match follwed by the positions of each of the > subexpressions. > a future extension to xf:replace could then use (something > equivalent to &1 or $1 or \1 in current regexp languages) to access > the matched subexpressions in the replacement text. In the description of xf:replace() it says: The value of $repval may use the standard regular expression syntax of "$N" (where N is some integer) to represent the N-th part of the matched pattern indicated by parentheses in the value of $regexp. So it seems that the intention is that you can pull out specific subexpressions, as illustrated in the example: replace("aFOOa aBARa", "a(.*)a", "b$1b") => "bFOOb bBARb" Part of the reason, I think, that the xf:match() and xf:replace() functions are so under-specified is that the regular expression syntax in XML Schema Datatypes is just not designed for this kind of use - it is purely designed for testing whether an entire string matches a particular regular expression. Thus there is no support in XML Schema regular expressions for things that are common in other regular expressions and the functions that use them: - matches covering the entirety of the string, or a portion of the string - meta-characters matching the start and end of the string - single vs. global matches - greedy vs. parsimonious matches - non-capturing matches - backreferences within regular expressions I think that the difference between a match on a portion of the string and on the string as a whole could be managed by introducing meta-characters matching the start and end of the string - if you want the regular expression to match the entire string, then you can always add these characters at the start and end of the regular expressions. Taking the usual ^ and $ to match the start and end of the string, for example, given the string "aFOOa aBARa": "a(.*)a" matches "aFOOa" "aBARa" [assuming parsimonious and non-overlapping matches] whereas: "^a(.*)a$" matches "aFOOa aBARa" I think that these would be useful generally, to support a regular-expression starts-with()-type function. Looking at the individual vs. global match, with an individual match on the string "aFOOa aBARa", the regular expression "a(.*)a" would match "aFOOa". With a global match it would match the two strings "aFOOa" and "aBARa" (assuming non-overlapping matches). One option would be to always do a global match, with the xf:match() function returning the sequence of all the matched strings, thus: xf:match("aFOOa aBARa", "a(.*)a") => ("aFOOa", "aBARa") The user could then take the first of these results to get the same result as with a single match. However, as David pointed out, this would lead to problems if you had a long string or if you had something like: xf:match("aFOOa aBARa", ".") since "." could match any character within the match string. Which leads on to the question of parsimonious or greedy matches. From what I gather, regular expressions elsewhere usually do greedy matches, where, given the string "aFOOa aBARa", the regular expression "a(.*)a" would match the entire string, since it starts and ends with an 'a'. Given this, the example: replace("aFOOa aBARa", "a(.*)a", "b$1b") in the F&O WD should actually return "bFOOa aBARb" rather than "bFOOb bBARb" as given in the F&O WD. Given that greedy matches are the norm in other languages, I think they should be the norm here. Supporting parsimonious matches would involve supplementing the XML Schema regular expression syntax, for example by allowing a "?" after quantifiers. To match "aFOOa" rather than "aFOOa aBARa" you would use the regular expression "a(.*?)a". On what's returned by xf:match(), I think that getting the index of the start of the match is insufficient - you also need to know the length of the matched string in order to do anything useful with the results of the match. (Of course in some situations you might not be interested in the results of the match, just in whether or not the string matches - for this reason, I think a xf:test() function with the signature: xf:test(string? $srcval, string? $regexp) => boolean would be useful, returning true if the string matches the regular expression at all. Alternatively, you could have regular expression versions of the current string manipulation functions contains(), starts-with() and ends-with(), and possibly even of substring-before() and substring-after().) There are a couple of possibilities about the result of xf:match() - you could have a sequence of pairs of integers, each giving the start index and length of the matched string. Or you could return a sequence of the matched strings themselves (which would only be a single string if the match was not global). I don't think that the xf:match() function needs to return the positions of the subexpressions, or the subexpressions themselves, because that functionality could be achieved via xf:replace(). For example, to find out what string was matched by the first subexpression you could just use "$1" as the replace value. Cheers, Jeni --- Jeni Tennison http://www.jenitennison.com/
Received on Sunday, 6 January 2002 12:19:34 UTC