W3C home > Mailing lists > Public > public-qt-comments@w3.org > August 2003

Re: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a")

From: Jim Melton <jim.melton@acm.org>
Date: Mon, 18 Aug 2003 17:59:53 -0600
Message-ID: <3F416879.8040608@acm.org>
To: Priscilla Walmsley <priscilla@walmsley.com>
CC: "'Tobias Reif'" <tobiasreif@pinkjuice.com>, public-qt-comments@w3.org, "'Jeni Tennison'" <jeni@jenitennison.com>

Priscilla Walmsley wrote:

>Just to pick nits, 
Pick away! That's the only way we'll all understand things the same (and 
the proper) way!

>Jim Melton wrote:
>>I disagree. We state early in the F&O specification that the 
>>rules are 
>>to be applied *in the order in which they are written*. If 
>>you do that, 
>>and read the rule in question properly (that is, without adding the 
>>incorrect "...and nothing else" in your mind), then the spec is 
>>unambiguous (in this respect, that is). 
>The rule about matching a zero-length string appears *after* the
>"This function breaks the $input string into a sequence of strings, 
>treating any substring that matches $pattern as a separator. The 
>separators themselves are not returned."
>Perhaps this sentence is not an official "rule", just a general
>description of the function.  
That is my interpretation...that that first sentence was meant as a 
high-level summary of what the function is supposed to do, not a "rule" 
that gives the detailed semantics.

>However, the sentence is false in the case
>we are talking about.  The letters "a", and "b" _do_ match the pattern
>.* , and therefore should be treated as separators according to this
>particular sentence.  Maybe the sentence should start with "If $pattern
>does not match a zero-length string, ..." 
>Or did I misunderstand what you mean by applying the rules in the order
>in which they are written?
I see your point. My interpretation --- that the sentence you quoted is 
not a "rule", but a "summary" --- would not dictate the sort of addition 
you propose. Other interpretations (that the sentence is a normative 
"rule") would require such an addition/clarification.

>Anyway, back to the real issue, I think this behavior is particularly
>confusing in the case of:
>fn:tokenize("abba", "b?")
>which apparently would also return ("a", "b", "b", "a"), since the
>pattern matches a zero-length string. I think the user would expect "b"
>to be treated like a separator in this case.  I know they could just use
>the pattern "b" if that's what they want, but it still seems like the
>function violates the principle of least surprise.  Particularly since
>the ? in this case should be greedy.
Now you're talking about something substantive! I think you said that 
the intended semantics are too confusing and that a pattern like "b?" 
should match "b" in preference to the zero-length string. (That is, if 
there is one or more instances of "b" in the input string, then match 
it/them and match the zero-length string only if there are no instances 
of "b".) Is that a proper interpretation of your statement?

I *personally* have no preference about this. I have a very strong 
preference of not violating the principle of least astonishment, though. 
But I must yield to the educated opinions of others who are familiar 
with various regular expression-using languages. The issue is certainly 
on the table as a result of the comment and I'm positive that the F&O 
Task Force will discuss it in some depth.

Many thanks,
Received on Monday, 18 August 2003 19:56:52 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:56:49 UTC