RE: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a") from Ashok Malhotra on 2003-08-18 (public-qt-comments@w3.org from August 2003)

From: Ashok Malhotra <ashokma@microsoft.com>
Date: Mon, 18 Aug 2003 14:57:11 -0700
To: "Tobias Reif" <tobiasreif@pinkjuice.com>, <public-qt-comments@w3.org>
Cc: "Jeni Tennison" <jeni@jenitennison.com>
Message-ID: <E5B814702B65CB4DA51644580E4853FB0A4C144B@red-msg-12.redmond.corp.microsoft.com>

Yes, but the spec says that if reluctant quantifiers are used, i.e.
those with ?, then the regex "matches the shortest possible substring
consistent with the match as a whole succeeding."  

All the best, Ashok

> -----Original Message-----
> From: public-qt-comments-request@w3.org [mailto:public-qt-comments-
> request@w3.org] On Behalf Of Tobias Reif
> Sent: Monday, August 18, 2003 2:50 PM
> To: public-qt-comments@w3.org
> Cc: Jeni Tennison
> Subject: Re: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a")
> 
> 
> Hi Jeni
> 
>  > The definition of the fn:tokenize() function says:
>  >
>  >  "If the supplied $pattern matches a zero-length string, the
>  >   fn:tokenize() function breaks the string into its component
>  >   characters. The nth character in the $input string becomes the
nth
>  >   string in the result sequence; each string in the result sequence
>  >   has a string length of one."
> 
> Exactly.
> 
>  > In the example above, the pattern ".?" is a pattern that matches a
>  > zero-length string;
> 
> But it matches more than a zero-length string AFAICS. It is not
> explicitly specified what happens when the pattern matches more than
the
> zero-length string. IMHO returning an empty sequence is the only
> consistent behaviour; anything else is hard to specify unambiguously.
> 
>  > therefore, the fn:tokenize() function breaks the
>  > string into its component characters. You would get the same
behaviour
>  > from any pattern that matches a zero-length string, such as "" or
>  > ".*".
> 
> But of the regexen you list only "" matches nothing but the
zero-length
> string.
> 
> http://www.w3.org/TR/xmlschema-2/#regexs
> says
> 
> (empty string)
> =>
> the set containing just the empty string
> 
> S?
> =>
> the empty string, and all strings in L(S).
> 
> S*
> =>
> All strings in L(S?) and all strings st  with s in L(S*)  and t in
L(S).
> ( all concatenations of zero or more strings from L(S) )
> 
> 
> I agree with you that when fed patterns matching the zero-length
string
> eg
> fn:tokenize("abba", "")
> or
> fn:tokenize("abba", ".{0}")
> tokenize should return ("a", "b", "b", "a").
> 
> This is consistent with commonly expected regex behaviour:
> 
> ruby -e "p('abba'.split(//))"
> ["a", "b", "b", "a"]
> ruby -e "p('abba'.split(/.{0}/))"
> ["a", "b", "b", "a"]
> 
> But with patterns that match more than the zero length string it can
> only return the empty sequence IMHO
> 
> ruby -e "p('abba'.split(/.*/))"
> []
> ruby -e "p('abba'.split(/.?/))"
> []
> ruby -e "p('abba'.split(/./))"
> []
> 
> Here's what the patterns match in Ruby:
> ruby -e "p('abba'.scan(/.*/))"
> ["abba", ""]
> ruby -e "p('abba'.scan(/.?/))"
> ["a", "b", "b", "a", ""]
> ruby -e "p('abba'.scan(/./))"
> ["a", "b", "b", "a"]
> 
>  > Does that explain the example sufficiently?
> 
> As I described it is not specific enough, leaving room for different
> interpretations and thus differing implementations.
> 
> It is not clear why
> 'fn:tokenize("abba", ".?") should return ("a", "b", "b", "a")'
> since the regex matches more than just the zero-lenght string.
> 
> I suggest to change the spec in this regard and clarify the wording,
for
> example by changing it to:
> 
>   "If the supplied $pattern matches nothing but a zero-length string,
>    [...]"
> 
>   'fn:tokenize("abba", "") returns ("a", "b", "b", "a")'
>   'fn:tokenize("abba", ".?") returns ()'
>   'fn:tokenize("abba", ".") returns ()'
> 
> Tobi
> 
> --
> http://www.pinkjuice.com/
>

Received on Monday, 18 August 2003 17:57:46 UTC