- From: Tobias Reif <tobiasreif@pinkjuice.com>
- Date: Mon, 18 Aug 2003 23:50:17 +0200
- To: public-qt-comments@w3.org
- CC: Jeni Tennison <jeni@jenitennison.com>
Hi Jeni
> The definition of the fn:tokenize() function says:
>
> "If the supplied $pattern matches a zero-length string, the
> fn:tokenize() function breaks the string into its component
> characters. The nth character in the $input string becomes the nth
> string in the result sequence; each string in the result sequence
> has a string length of one."
Exactly.
> In the example above, the pattern ".?" is a pattern that matches a
> zero-length string;
But it matches more than a zero-length string AFAICS. It is not
explicitly specified what happens when the pattern matches more than the
zero-length string. IMHO returning an empty sequence is the only
consistent behaviour; anything else is hard to specify unambiguously.
> therefore, the fn:tokenize() function breaks the
> string into its component characters. You would get the same behaviour
> from any pattern that matches a zero-length string, such as "" or
> ".*".
But of the regexen you list only "" matches nothing but the zero-length
string.
http://www.w3.org/TR/xmlschema-2/#regexs
says
(empty string)
=>
the set containing just the empty string
S?
=>
the empty string, and all strings in L(S).
S*
=>
All strings in L(S?) and all strings st with s in L(S*) and t in L(S).
( all concatenations of zero or more strings from L(S) )
I agree with you that when fed patterns matching the zero-length string
eg
fn:tokenize("abba", "")
or
fn:tokenize("abba", ".{0}")
tokenize should return ("a", "b", "b", "a").
This is consistent with commonly expected regex behaviour:
ruby -e "p('abba'.split(//))"
["a", "b", "b", "a"]
ruby -e "p('abba'.split(/.{0}/))"
["a", "b", "b", "a"]
But with patterns that match more than the zero length string it can
only return the empty sequence IMHO
ruby -e "p('abba'.split(/.*/))"
[]
ruby -e "p('abba'.split(/.?/))"
[]
ruby -e "p('abba'.split(/./))"
[]
Here's what the patterns match in Ruby:
ruby -e "p('abba'.scan(/.*/))"
["abba", ""]
ruby -e "p('abba'.scan(/.?/))"
["a", "b", "b", "a", ""]
ruby -e "p('abba'.scan(/./))"
["a", "b", "b", "a"]
> Does that explain the example sufficiently?
As I described it is not specific enough, leaving room for different
interpretations and thus differing implementations.
It is not clear why
'fn:tokenize("abba", ".?") should return ("a", "b", "b", "a")'
since the regex matches more than just the zero-lenght string.
I suggest to change the spec in this regard and clarify the wording, for
example by changing it to:
"If the supplied $pattern matches nothing but a zero-length string,
[...]"
'fn:tokenize("abba", "") returns ("a", "b", "b", "a")'
'fn:tokenize("abba", ".?") returns ()'
'fn:tokenize("abba", ".") returns ()'
Tobi
--
http://www.pinkjuice.com/
Received on Monday, 18 August 2003 17:51:54 UTC