- From: Tobias Reif <tobiasreif@pinkjuice.com>
- Date: Mon, 18 Aug 2003 23:50:17 +0200
- To: public-qt-comments@w3.org
- CC: Jeni Tennison <jeni@jenitennison.com>
Hi Jeni > The definition of the fn:tokenize() function says: > > "If the supplied $pattern matches a zero-length string, the > fn:tokenize() function breaks the string into its component > characters. The nth character in the $input string becomes the nth > string in the result sequence; each string in the result sequence > has a string length of one." Exactly. > In the example above, the pattern ".?" is a pattern that matches a > zero-length string; But it matches more than a zero-length string AFAICS. It is not explicitly specified what happens when the pattern matches more than the zero-length string. IMHO returning an empty sequence is the only consistent behaviour; anything else is hard to specify unambiguously. > therefore, the fn:tokenize() function breaks the > string into its component characters. You would get the same behaviour > from any pattern that matches a zero-length string, such as "" or > ".*". But of the regexen you list only "" matches nothing but the zero-length string. http://www.w3.org/TR/xmlschema-2/#regexs says (empty string) => the set containing just the empty string S? => the empty string, and all strings in L(S). S* => All strings in L(S?) and all strings st with s in L(S*) and t in L(S). ( all concatenations of zero or more strings from L(S) ) I agree with you that when fed patterns matching the zero-length string eg fn:tokenize("abba", "") or fn:tokenize("abba", ".{0}") tokenize should return ("a", "b", "b", "a"). This is consistent with commonly expected regex behaviour: ruby -e "p('abba'.split(//))" ["a", "b", "b", "a"] ruby -e "p('abba'.split(/.{0}/))" ["a", "b", "b", "a"] But with patterns that match more than the zero length string it can only return the empty sequence IMHO ruby -e "p('abba'.split(/.*/))" [] ruby -e "p('abba'.split(/.?/))" [] ruby -e "p('abba'.split(/./))" [] Here's what the patterns match in Ruby: ruby -e "p('abba'.scan(/.*/))" ["abba", ""] ruby -e "p('abba'.scan(/.?/))" ["a", "b", "b", "a", ""] ruby -e "p('abba'.scan(/./))" ["a", "b", "b", "a"] > Does that explain the example sufficiently? As I described it is not specific enough, leaving room for different interpretations and thus differing implementations. It is not clear why 'fn:tokenize("abba", ".?") should return ("a", "b", "b", "a")' since the regex matches more than just the zero-lenght string. I suggest to change the spec in this regard and clarify the wording, for example by changing it to: "If the supplied $pattern matches nothing but a zero-length string, [...]" 'fn:tokenize("abba", "") returns ("a", "b", "b", "a")' 'fn:tokenize("abba", ".?") returns ()' 'fn:tokenize("abba", ".") returns ()' Tobi -- http://www.pinkjuice.com/
Received on Monday, 18 August 2003 17:51:54 UTC