W3C home > Mailing lists > Public > public-qt-comments@w3.org > August 2003

Re: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a")

From: Tobias Reif <tobiasreif@pinkjuice.com>
Date: Mon, 18 Aug 2003 23:50:17 +0200
Message-ID: <3F414A19.7000200@pinkjuice.com>
To: public-qt-comments@w3.org
CC: Jeni Tennison <jeni@jenitennison.com>

Hi Jeni

 > The definition of the fn:tokenize() function says:
 >
 >  "If the supplied $pattern matches a zero-length string, the
 >   fn:tokenize() function breaks the string into its component
 >   characters. The nth character in the $input string becomes the nth
 >   string in the result sequence; each string in the result sequence
 >   has a string length of one."

Exactly.

 > In the example above, the pattern ".?" is a pattern that matches a
 > zero-length string;

But it matches more than a zero-length string AFAICS. It is not 
explicitly specified what happens when the pattern matches more than the 
zero-length string. IMHO returning an empty sequence is the only 
consistent behaviour; anything else is hard to specify unambiguously.

 > therefore, the fn:tokenize() function breaks the
 > string into its component characters. You would get the same behaviour
 > from any pattern that matches a zero-length string, such as "" or
 > ".*".

But of the regexen you list only "" matches nothing but the zero-length 
string.

http://www.w3.org/TR/xmlschema-2/#regexs
says

(empty string)
=>
the set containing just the empty string

S?
=>
the empty string, and all strings in L(S).

S*
=>
All strings in L(S?) and all strings st  with s in L(S*)  and t in L(S). 
( all concatenations of zero or more strings from L(S) )


I agree with you that when fed patterns matching the zero-length string
eg
fn:tokenize("abba", "")
or
fn:tokenize("abba", ".{0}")
tokenize should return ("a", "b", "b", "a").

This is consistent with commonly expected regex behaviour:

ruby -e "p('abba'.split(//))"
["a", "b", "b", "a"]
ruby -e "p('abba'.split(/.{0}/))"
["a", "b", "b", "a"]

But with patterns that match more than the zero length string it can 
only return the empty sequence IMHO

ruby -e "p('abba'.split(/.*/))"
[]
ruby -e "p('abba'.split(/.?/))"
[]
ruby -e "p('abba'.split(/./))"
[]

Here's what the patterns match in Ruby:
ruby -e "p('abba'.scan(/.*/))"
["abba", ""]
ruby -e "p('abba'.scan(/.?/))"
["a", "b", "b", "a", ""]
ruby -e "p('abba'.scan(/./))"
["a", "b", "b", "a"]

 > Does that explain the example sufficiently?

As I described it is not specific enough, leaving room for different 
interpretations and thus differing implementations.

It is not clear why
'fn:tokenize("abba", ".?") should return ("a", "b", "b", "a")'
since the regex matches more than just the zero-lenght string.

I suggest to change the spec in this regard and clarify the wording, for 
example by changing it to:

  "If the supplied $pattern matches nothing but a zero-length string,
   [...]"

  'fn:tokenize("abba", "") returns ("a", "b", "b", "a")'
  'fn:tokenize("abba", ".?") returns ()'
  'fn:tokenize("abba", ".") returns ()'

Tobi

-- 
http://www.pinkjuice.com/
Received on Monday, 18 August 2003 17:51:54 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:45:13 UTC