[xslt2 func/op] tokenizing "abba" to ("a","b","b","a") from Tobias Reif on 2003-08-18 (public-qt-comments@w3.org from August 2003)

From: Tobias Reif <tobiasreif@pinkjuice.com>
Date: Mon, 18 Aug 2003 18:33:05 +0200
To: public-qt-comments@w3.org
Message-ID: <3F40FFC1.9020303@pinkjuice.com>

Hi

I need to get a sequence of the characters in a string.

The current draft
  http://www.w3.org/TR/xquery-operators/#func-tokenize
says
  'fn:tokenize("abba", ".?") returns ("a", "b", "b", "a")'

I don't understand why it should.

After days of confusion (caused by various factors), I really would
appreciate a friendly, helpful, and clear explanation.

Here's what I would expect:
(and what Ruby does)

ruby -e "p('abba'.split(/.?/))"
[]
ruby -e "p('abba'.split(/./))"
[]
ruby -e "p('abba'.split(//))"
["a", "b", "b", "a"]

Ruby's [] is an empty array,
   and is equivalent to XSLT2's () the empty sequence.
Ruby's ["a", "b", "b", "a"] is an array of all the characters in the 
string (and nothing else),
   and is equivalent to XSLT's ("a", "b", "b", "a").

The spec says
"This function breaks the $input string into a sequence of strings,
treating any substring that matches $pattern as a separator."

.? matches the characters, thus the separators split the string into
the empty strings between the characters.
Either a sequence of empty strings should be returned, or perhaps most
sensible the empty sequence.

The empty regex matches all zero length strings, thus the separators 
split the string into it's characters.
A sequence containing each charecter of the string (and nothing else)
should be returned.

Unless I'll get an explanation convincing me that the current version
of the example is correct I suggest to change
  'fn:tokenize("abba", ".?") returns ("a", "b", "b", "a")'
to
  'fn:tokenize("abba", "") returns ("a", "b", "b", "a")'

... and perhaps add
  'fn:tokenize("abba", ".?") returns ()'
and
  'fn:tokenize("abba", ".") returns ()'

In any case I need to be able to rely on an unambiguous spec.

If the example in the spec is correct from your POV, please add a clear 
and unambiguous explanatory specification of the behaviour.

If it is incorrect, please consider changing it to
  'fn:tokenize("abba", "") returns ("a", "b", "b", "a")'

If you say or decide that your spec describes both examples (the one in 
the spec and the above) as being correct, or if you think that 
tokenize("abba", "") should not return the sequence of characters, then 
please consider adding the above example (tokenize("abba", "")) plus a 
clear explanatory specification of the behaviour in each case.

Tobi

-- 
http://www.pinkjuice.com/

Received on Monday, 18 August 2003 12:34:42 UTC