- From: Ashok Malhotra <ashokma@microsoft.com>
- Date: Mon, 18 Aug 2003 14:57:11 -0700
- To: "Tobias Reif" <tobiasreif@pinkjuice.com>, <public-qt-comments@w3.org>
- Cc: "Jeni Tennison" <jeni@jenitennison.com>
Yes, but the spec says that if reluctant quantifiers are used, i.e. those with ?, then the regex "matches the shortest possible substring consistent with the match as a whole succeeding." All the best, Ashok > -----Original Message----- > From: public-qt-comments-request@w3.org [mailto:public-qt-comments- > request@w3.org] On Behalf Of Tobias Reif > Sent: Monday, August 18, 2003 2:50 PM > To: public-qt-comments@w3.org > Cc: Jeni Tennison > Subject: Re: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a") > > > Hi Jeni > > > The definition of the fn:tokenize() function says: > > > > "If the supplied $pattern matches a zero-length string, the > > fn:tokenize() function breaks the string into its component > > characters. The nth character in the $input string becomes the nth > > string in the result sequence; each string in the result sequence > > has a string length of one." > > Exactly. > > > In the example above, the pattern ".?" is a pattern that matches a > > zero-length string; > > But it matches more than a zero-length string AFAICS. It is not > explicitly specified what happens when the pattern matches more than the > zero-length string. IMHO returning an empty sequence is the only > consistent behaviour; anything else is hard to specify unambiguously. > > > therefore, the fn:tokenize() function breaks the > > string into its component characters. You would get the same behaviour > > from any pattern that matches a zero-length string, such as "" or > > ".*". > > But of the regexen you list only "" matches nothing but the zero-length > string. > > http://www.w3.org/TR/xmlschema-2/#regexs > says > > (empty string) > => > the set containing just the empty string > > S? > => > the empty string, and all strings in L(S). > > S* > => > All strings in L(S?) and all strings st with s in L(S*) and t in L(S). > ( all concatenations of zero or more strings from L(S) ) > > > I agree with you that when fed patterns matching the zero-length string > eg > fn:tokenize("abba", "") > or > fn:tokenize("abba", ".{0}") > tokenize should return ("a", "b", "b", "a"). > > This is consistent with commonly expected regex behaviour: > > ruby -e "p('abba'.split(//))" > ["a", "b", "b", "a"] > ruby -e "p('abba'.split(/.{0}/))" > ["a", "b", "b", "a"] > > But with patterns that match more than the zero length string it can > only return the empty sequence IMHO > > ruby -e "p('abba'.split(/.*/))" > [] > ruby -e "p('abba'.split(/.?/))" > [] > ruby -e "p('abba'.split(/./))" > [] > > Here's what the patterns match in Ruby: > ruby -e "p('abba'.scan(/.*/))" > ["abba", ""] > ruby -e "p('abba'.scan(/.?/))" > ["a", "b", "b", "a", ""] > ruby -e "p('abba'.scan(/./))" > ["a", "b", "b", "a"] > > > Does that explain the example sufficiently? > > As I described it is not specific enough, leaving room for different > interpretations and thus differing implementations. > > It is not clear why > 'fn:tokenize("abba", ".?") should return ("a", "b", "b", "a")' > since the regex matches more than just the zero-lenght string. > > I suggest to change the spec in this regard and clarify the wording, for > example by changing it to: > > "If the supplied $pattern matches nothing but a zero-length string, > [...]" > > 'fn:tokenize("abba", "") returns ("a", "b", "b", "a")' > 'fn:tokenize("abba", ".?") returns ()' > 'fn:tokenize("abba", ".") returns ()' > > Tobi > > -- > http://www.pinkjuice.com/ >
Received on Monday, 18 August 2003 17:57:46 UTC