RE: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a")

Yes, but the spec says that if reluctant quantifiers are used, i.e.
those with ?, then the regex "matches the shortest possible substring
consistent with the match as a whole succeeding."  

All the best, Ashok

> -----Original Message-----
> From: public-qt-comments-request@w3.org [mailto:public-qt-comments-
> request@w3.org] On Behalf Of Tobias Reif
> Sent: Monday, August 18, 2003 2:50 PM
> To: public-qt-comments@w3.org
> Cc: Jeni Tennison
> Subject: Re: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a")
> 
> 
> Hi Jeni
> 
>  > The definition of the fn:tokenize() function says:
>  >
>  >  "If the supplied $pattern matches a zero-length string, the
>  >   fn:tokenize() function breaks the string into its component
>  >   characters. The nth character in the $input string becomes the
nth
>  >   string in the result sequence; each string in the result sequence
>  >   has a string length of one."
> 
> Exactly.
> 
>  > In the example above, the pattern ".?" is a pattern that matches a
>  > zero-length string;
> 
> But it matches more than a zero-length string AFAICS. It is not
> explicitly specified what happens when the pattern matches more than
the
> zero-length string. IMHO returning an empty sequence is the only
> consistent behaviour; anything else is hard to specify unambiguously.
> 
>  > therefore, the fn:tokenize() function breaks the
>  > string into its component characters. You would get the same
behaviour
>  > from any pattern that matches a zero-length string, such as "" or
>  > ".*".
> 
> But of the regexen you list only "" matches nothing but the
zero-length
> string.
> 
> http://www.w3.org/TR/xmlschema-2/#regexs
> says
> 
> (empty string)
> =>
> the set containing just the empty string
> 
> S?
> =>
> the empty string, and all strings in L(S).
> 
> S*
> =>
> All strings in L(S?) and all strings st  with s in L(S*)  and t in
L(S).
> ( all concatenations of zero or more strings from L(S) )
> 
> 
> I agree with you that when fed patterns matching the zero-length
string
> eg
> fn:tokenize("abba", "")
> or
> fn:tokenize("abba", ".{0}")
> tokenize should return ("a", "b", "b", "a").
> 
> This is consistent with commonly expected regex behaviour:
> 
> ruby -e "p('abba'.split(//))"
> ["a", "b", "b", "a"]
> ruby -e "p('abba'.split(/.{0}/))"
> ["a", "b", "b", "a"]
> 
> But with patterns that match more than the zero length string it can
> only return the empty sequence IMHO
> 
> ruby -e "p('abba'.split(/.*/))"
> []
> ruby -e "p('abba'.split(/.?/))"
> []
> ruby -e "p('abba'.split(/./))"
> []
> 
> Here's what the patterns match in Ruby:
> ruby -e "p('abba'.scan(/.*/))"
> ["abba", ""]
> ruby -e "p('abba'.scan(/.?/))"
> ["a", "b", "b", "a", ""]
> ruby -e "p('abba'.scan(/./))"
> ["a", "b", "b", "a"]
> 
>  > Does that explain the example sufficiently?
> 
> As I described it is not specific enough, leaving room for different
> interpretations and thus differing implementations.
> 
> It is not clear why
> 'fn:tokenize("abba", ".?") should return ("a", "b", "b", "a")'
> since the regex matches more than just the zero-lenght string.
> 
> I suggest to change the spec in this regard and clarify the wording,
for
> example by changing it to:
> 
>   "If the supplied $pattern matches nothing but a zero-length string,
>    [...]"
> 
>   'fn:tokenize("abba", "") returns ("a", "b", "b", "a")'
>   'fn:tokenize("abba", ".?") returns ()'
>   'fn:tokenize("abba", ".") returns ()'
> 
> Tobi
> 
> --
> http://www.pinkjuice.com/
> 

Received on Monday, 18 August 2003 17:57:46 UTC