RE: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a")

 

-----Original Message-----
From: Michael Brundage [mailto:xquery@comcast.net] 
Sent: 20 August 2003 17:24
To: 'Kay, Michael'; 'Tobias Reif'; public-qt-comments@w3.org
Cc: 'Jeni Tennison'
Subject: RE: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a")


But Michael, .? doesn't match the zero-length string except when applied to
the empty string.  When applied to a non-empty string, it matches a single
character.  Therefore, 
 

Well, as I said, I was reading "the pattern matches the zero-length string"
as meaning [fn:matches("", $pattern)=true()]. In fact, it never occurred to
me it could be read any other way. But clearly we're going to have to look
at this again.
 
Michael Kay
 
 
 
 
fn:tokenize("abba", ".?") should break the string into "", "a", "", "b", "",
"b", "", "a", "" and then return the non-separators, resulting in ("", "",
"", "", "")
and
fn:tokenize("abba", ".??") or fn:tokenize("abba", "") should break the
string into "a", "", "b", "", "b", "", "a" and then return the
non-separators, resulting in ("a", "b", "b", "a")
 
If you want fn:tokenize("abba", ".?") to return "a", "b", "b", "a", then you
need to modify the spec's wording to say something like
"If the supplied $pattern can match a zero-length string (independent of the
input string) ..."
 
The current wording could be clarified:
"If the supplied $pattern matches a zero-length string (when applied to the
input string) ..."
but as I read it clearly has this meaning.
 
 
Cheers,
Michael Brundage
xquery@comcast.net

Writing as
Author, "XQuery: The XML Query Language" (Addison-Wesley, to appear 2003)
Co-author, "Professional XML Databases" (Wrox Press, 2000)

not as
Technical Lead
Common Query Runtime/XML Query Processing
WebData XML Team
Microsoft


-----Original Message-----
From: public-qt-comments-request@w3.org
[mailto:public-qt-comments-request@w3.org] On Behalf Of Kay, Michael
Sent: Tuesday, August 19, 2003 3:25 AM
To: Tobias Reif; public-qt-comments@w3.org
Cc: Jeni Tennison
Subject: RE: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a")



> 
>  > The definition of the fn:tokenize() function says: 
>  > 
>  >  "If the supplied $pattern matches a zero-length string, the 
>  >   fn:tokenize() function breaks the string into its component 
>  >   characters. The nth character in the $input string 
> becomes the nth 
>  >   string in the result sequence; each string in the result sequence 
>  >   has a string length of one." 
> 
> Exactly. 
> 
>  > In the example above, the pattern ".?" is a pattern that 
> matches a  > zero-length string; 
> 
> But it matches more than a zero-length string AFAICS. It is not 
> explicitly specified what happens when the pattern matches 
> more than the 
> zero-length string. IMHO returning an empty sequence is the only 
> consistent behaviour; anything else is hard to specify unambiguously. 

We say what happens when it matches a zero length string. This pattern
matches a zero length string. I can't see any ambiguity here. You can ask
why we decided to specify it this way, but I don't think you can claim that
the spec is ambiguous. The sentence "If the supplied $pattern matches a
zero-length string, the fn:tokenize() function breaks the string into its
component characters." seems about as clear as you can get.

You seem to be arguing for a different spec based on what ruby does. That
would be a valid argument if all existing languages were consistent. But
they aren't.

The Java split() method, for example, produces the sequence 

("", "a", "b", "b", "a", "") when the pattern is "" 

and the sequence 

("", "", "", "", "", "") when the pattern is ".?" 

We have to make some kind of decision about what to do when the pattern
matches a zero length string. Any decisions are going to be arbitrary, as
the ruby and Java examples illustrate. In my view, the rule that the string
is split into its individual characters is a usable specification and is
clearly explained. We could have defined it differently, but you haven't
convinced me that a different specification would be better.

Michael Kay 

Received on Saturday, 23 August 2003 07:42:14 UTC