Re: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a") from Jim Melton on 2003-08-18 (public-qt-comments@w3.org from August 2003)

From: Jim Melton <jim.melton@acm.org>
Date: Mon, 18 Aug 2003 17:59:53 -0600
To: Priscilla Walmsley <priscilla@walmsley.com>
CC: "'Tobias Reif'" <tobiasreif@pinkjuice.com>, public-qt-comments@w3.org, "'Jeni Tennison'" <jeni@jenitennison.com>
Message-ID: <3F416879.8040608@acm.org>

Priscilla,

Priscilla Walmsley wrote:

>Hi,
>
>Just to pick nits, 
>
Pick away! That's the only way we'll all understand things the same (and 
the proper) way!

>
>
>Jim Melton wrote:
> 
>
>>I disagree. We state early in the F&O specification that the 
>>rules are 
>>to be applied *in the order in which they are written*. If 
>>you do that, 
>>and read the rule in question properly (that is, without adding the 
>>incorrect "...and nothing else" in your mind), then the spec is 
>>unambiguous (in this respect, that is). 
>>
>
>The rule about matching a zero-length string appears *after* the
>sentence:
>
>"This function breaks the $input string into a sequence of strings, 
>treating any substring that matches $pattern as a separator. The 
>separators themselves are not returned."
>
>Perhaps this sentence is not an official "rule", just a general
>description of the function.  
>
That is my interpretation...that that first sentence was meant as a 
high-level summary of what the function is supposed to do, not a "rule" 
that gives the detailed semantics.

>However, the sentence is false in the case
>we are talking about.  The letters "a", and "b" _do_ match the pattern
>.* , and therefore should be treated as separators according to this
>particular sentence.  Maybe the sentence should start with "If $pattern
>does not match a zero-length string, ..." 
>
>Or did I misunderstand what you mean by applying the rules in the order
>in which they are written?
>
I see your point. My interpretation --- that the sentence you quoted is 
not a "rule", but a "summary" --- would not dictate the sort of addition 
you propose. Other interpretations (that the sentence is a normative 
"rule") would require such an addition/clarification.

>
>
>Anyway, back to the real issue, I think this behavior is particularly
>confusing in the case of:
>
>fn:tokenize("abba", "b?")
>
>which apparently would also return ("a", "b", "b", "a"), since the
>pattern matches a zero-length string. I think the user would expect "b"
>to be treated like a separator in this case.  I know they could just use
>the pattern "b" if that's what they want, but it still seems like the
>function violates the principle of least surprise.  Particularly since
>the ? in this case should be greedy.
>
Now you're talking about something substantive! I think you said that 
the intended semantics are too confusing and that a pattern like "b?" 
should match "b" in preference to the zero-length string. (That is, if 
there is one or more instances of "b" in the input string, then match 
it/them and match the zero-length string only if there are no instances 
of "b".) Is that a proper interpretation of your statement?

I *personally* have no preference about this. I have a very strong 
preference of not violating the principle of least astonishment, though. 
But I must yield to the educated opinions of others who are familiar 
with various regular expression-using languages. The issue is certainly 
on the table as a result of the comment and I'm positive that the F&O 
Task Force will discuss it in some depth.

Many thanks,
   Jim

Received on Monday, 18 August 2003 19:56:52 UTC