RE: Higher-Order Functions in XPath 2.0

Thanks for submitting this, Dimitre. I'm personally very keen on getting
higher-order functions into the spec, because in the long run I think it
would save us work, but I can't say I'm confident about it getting in at
this round - mainly because someone is going to have to work out the
ramifications on the type system.

Syntactically, I think it can be done quite neatly.

To declare that an argument or result of a function is itself a function, we
need to extend the syntax for types to describe a function. The syntax might
be something like

function ( xs:string, xs:string* ) returning xs:integer

And one can then declare a higher-order function as

function (function(xsl:integer)returning xs:integer) returning integer

We then need a way to refer to a function as an object, without invoking it:
let's say we can refer to it as ?f, so a higher-order function call might
look like

distinct($nodes, ?my:func)

For usability, we probably also need some way of defining inline function
definitions (let's call them "anonymous functions", because "lambda
expressions" will put people off). I actually suspect we can get by with the
only anonymous functions being functions that take the context item as an
implicit argument: this would account for 90% of usage, and people could use
named functions in the other cases. We could write such functions as
?(expr). (An alternative I've previously proposed is [expr]). This would
then allow:

distinct(//employee, ?(@ssn))

to select employees with distinct social security numbers.

(One issue that has to be resolved is the binding of any variable references
within an anonymous function. No doubt the functional programming languages
have plenty of answers to this problem).

I think that defining some of the higher-order functionality we need (in
areas such as distinct, group-by, sorting, regular expressions) using a
common mechanism of higher-order function calls would be a much sounder
approach than defining custom syntax for each one, quite apart from the
benefits obtained by allowing users to define their own higher-order
functions. It would also make it easier for us to define the formal
semantics of those constructs where we do want to provide custom syntax,
such as "for", "some", and "every".

There's also a link into the requirement for dynamic evaluation: the
evaluate() function can be naturally modelled as a function that takes a
string as input and returns an anonymous function as its result.

Mike Kay 

> -----Original Message-----
> From: Dimitre Novatchev [mailto:dnovatchev@yahoo.com]
> Sent: 16 January 2002 17:49
> To: www-xpath-comments@w3.org
> Subject: Higher-Order Functions in XPath 2.0
> 
> 
> Hi,
> 
> Following is a proposal for implementing higher-order functions in
> XPath 2.0. for your consideration. Let me know if anything is unclear.
> 
> It is based on the draft, published in the xsl-list, with minor
> corrections and an additional reference to a text-processing problem.
> 
> Cheers,
> Dimitre Novatchev.
> -------------------------------------------------------------------
> 
> Higher-Order Functions in XPath 2.0
> 
> Contents
> --------
> Part I. Problems with XPath 2.0
> 
>        1. Using the "for" expression for incremental processing
>        2. Difficulties in returning aggregate (nodes) value 
>        3. Combining two sequences in producing a sequence
>        4. Text processing beyond the limits of regular expressions 
>        5. XPath language complexity has grown considerably 
>        6. Inflexibility, where equality and comparison-returning 
>           functions are needed 
>        7. Little or no reusability is possible for "for" expressions
> 
> Part II. Higher-Order Functions Solutions
> 
> III. Conclusion
> 
> IV. Recommendation
> 
> 
> 
> 
> Part I. Problems with XPath 2.0
> -------------------------------
> 
> A general problem with XPath 2.0, as specified by the current working
> draft, is that some tasks exist, which cannot be solved within the
> language, and another group of tasks can be accomplished in XPath 2.0,
> but in a rather inefficient way. In all of these cases programmers can
> solve the task by using recursive XSLT templates and this is 
> a powerful
> method, but at the same time writing and testing different recursive
> templates for every particular case is a time-consuming and 
> error-prone
> process and often the result has low reusability.
> 
> Listed bellow are some examples of such problems:
> 
> 1. Using the "for" expression for incremental processing
>    The "for" expression in XPath 2.0 cannot be used when the results
> of processing every item of a sequence depend on the processing of the
> previous items. Or in some cases it can be used, but in an obviously
> inefficient way:
> 
>    - Find the product of all numbers in a sequence.
>    - Reverse a sequence or a string.
>    - Concat all string items of a sequence.
> 
> A typical use-case, for which the XPath 2.0 solution is difficult and
> inefficient, is the following:
> 
> Given a sequence of "book" nodes, calculate and display the amount of
> money received from the sales of each book (price * sales), but also
> obtain and display a running total, as each book node from 
> the sequence
> is processed. To achieve this in XPath, one would write:
> 
> for $i in (1 to count($items)) 
>    return ($items[$i],
>            sum( for $j in (sublist($items, 1, $i)) 
>                    return (@price * @sales) ) 
> 
> In the above expression, if N is the number of items in $items, 
> N * (N + 1) / 2 additions and N * (N + 1) / 2 multiplications will be
> performed.
> 
> While the above may seem to be just a textbook example (and really a
> similar example can be found in Mike Kay's book), there are real-world
> examples, where a running total must be calculated and even several
> results must be accumulated in parallel. I am deeply obliged to Mark
> Nahabedian (naha@ai.mit.edu ) for allowing me to quote his work which
> has to deal with exactly this problem -- a complete example can be
> found at: 
> 
> http://www.ai.mit.edu/people/naha/itrack/about.html
> 
> 
> 2. Difficulties in returning aggregate (nodes) value from a sequence,
> especially when returning those nodes depends in a non-trivial way on
> the other nodes of the sequence:
> 
>    - Obtain all nodes with "maximum value" from a sequence, especially
> in the case when the node "value" is computed by a very complex
> expression.
>    - Obtain the nodes with "distinct values" from a sequence,
> especially in the case when the node "value" is computed in a very
> complex way.
>    - There's no general way to "filter" elements of a 
> sequence based on
> a predicate.
> 
> The reason is that a predicate (function) cannot be passed as a
> parameter to a general "filter" function. As a result programmers will
> write multitude of similar filtering expressions, without 
> being able to
> re-use them. Such repetitions are time-consuming, error-prone, and
> generally result in un-maintainable and non-reusable code.
> 
> Examples of problems in this group:
> 
>    - Return the sum of squares of the numbers in a sequence.
>    - Return all items in a sequence, for which f(item) has minimal
> value
>    - For some function f() test whether all the values of f(item) on a
> sequence  are equal (> 0, etc.)
>    - For some function f() test whether all values f(item) on a
> sequence are in increasing order.
> 
> Although a solution can be found in XPath, it will be difficult and
> inefficient.
> 
> Also, for every different function f() another version of the same
> solution will have to be produced, because functions cannot be passed
> as parameters.
> 
> 3. Combining two sequences in producing a sequence:
> Given (a1, a2, ..., aN) and (b1, b2, ..., bN) compute:
> 
>      (a1 + b1, a2 + b2, ..., aN + bN)
> 
>      (a1 * b1, a2 * b2, ..., aN * bN)
> 
>      (a1 and b1, a2 and b2, ..., aN and bN)
> 
>      (a1 or b1, a2 or b2, ..., aN or bN)
> etc.
> 
> 
> 4. Text processing beyond the limits of regular expressions is not
> possible.
> 
> A real world problem was pointed out by David Carlisle -- he needs in
> his work to match strings, surrounded by  (unknown in advance number
> of) balanced parenthesis.
> 
> For any such problem, it would be nice to have a general, table-driven
> parser() function. However this is not possible, because the parser()
> function will need to be passed as parameter a lex() function that it
> must call for obtaining the terminal symbols from the input text.
> 
> Another example of awkward to perform text-processing task, pointed by
> Jeni Tennison, can be found at:
> 
> http://lists.xml.org/archives/xml-dev/200201/msg00817.html  and
> http://www.kuro5hin.org/story/2002/1/15/1562/95011 
> 
> 
> 5. XPath language complexity has grown considerably and the language
> cannot continue to expand indefinitely:
> 
> Already there are hardly any spare characters left for operators.
> Often there are (more than one) different ways of performing similar
> tasks.
> 
> In contrast, a language that supports higher-order functions can be
> kept simple, small and elegant, while at the same time providing
> powerful means to produce any necessary new functionality. 
> 
> Thus the "standard" language features (e.g. operators and functions)
> can be kept to a minimum, while the language makes possible 
> desired new
> functionality to be easily produced and accumulated into a library of
> general and reusable useful functions.
> 
> Without support for higher-order functions such libraries are very
> limited in scope and usefulness.
> 
> 6. Inflexibility, where equality and comparison-returning 
> functions are
> needed to be passed to:
>    - sort
>    - distinct-values
>    - grouping, etc.
> 
> 7. Little or no reusability is possible for "for" expressions
> 
> In the expression bellow:
> 
>  for $i in (1 to count($items)) 
>          return expression
> 
> "expression" cannot be reused (simply copied and pasted) and will have
> to be modified every time it is used with differently named range
> variable so that references to the range variable are renamed.
> 
> In contrast, with higher-order functions support one can have a map()
> function, so that in 
> 
> map(f, $sequence) 
> 
> the code of f() will never have to be modified.
> 
> 
> Part II. Higher-Order Functions Solutions
> -----------------------------------------
> 
> Provided higher-order functions were available, the problems listed
> above have easy and natural solutions.
> 
> To demonstrate the compactness and high degree of 
> readability, the code
> bellow is written in Haskell. Haskell is used only for convenience, in
> no way should it be inferred that the same syntax is recommended for
> XPath 2.0. Some basic conventions from this language:
> 
>  f x y = x * y
> 
> This defines a function f(x,y) = x * y
> 
>  [1, 2, 3] 
> 
> This is a list of elements 1, 2, and 3. The same (for our purposes) as
> (1, 2, 3) in XPath 2.0.
> 
>  []
> 
> This denotes the empty list -- the same as () in XPath 2.0.
> 
>  x   -- denotes the name of a single element/function.
> 
>  xs -- any name ending in 's' denotes a list of elements. 
> 
> Any operator can be used also as a function, when put in brackets.
> Thus:
> 
> (+) 1 2 = 1 + 2 = 3
> 
> The (:) operator is used to prepend an element to the start of a list:
> 
>  x : xs 
> 
> defines a list with head x and tail xs.
> 
> The flip() function takes as argument any function with two arguments,
> and produces as result a the same function, which takes these two
> arguments in reverse order.
> 
>  flip f x y = f y x 
> 
> Primitive recursion over a list can be defined as follows:
> 
>  foldl f z []      = z
>  foldl f z (x:xs)  = foldl f (f z x) xs
> 
> The function "foldl" takes 3 arguments -- a function f(), 
> which takes 2
> arguments, a value z, and a list.
> 
> This is one of the most general functions over lists. It traverses the
> list from left to write, applying f() on each element and the 
> currently
> accumulated result.
> 
> There is a dual function (foldr), which behaves in a similar way, but 
> traverses the list from right to left:
> 
>  foldr f z []      = z
>  foldr f z (x:xs)  = f x (foldr f z xs)
> 
> 
> As can be easily seen:
> 
>  foldl (+) 0 xs
> 
> is the sum of all elements in a list xs. Therefore we could write:
> 
>  sum xs = foldl (+) 0 xs
> 
> Analogously:
> 
>  product xs = foldl (*) 1 xs
> 
> And this one liner is the solution to one of the problems in Part I,
> section 1.
> 
> We can ommit the last operand(s) from an equation, in case it is the
> same and we still get a valid function definition. Therefore, 
> the above
> function definitions could be simplified even further and written as:
> 
>  sum   = foldl (+) 0 
> 
>  product  = foldl (*) 1 
> 
> 
> Reversing the elements of a list (solution of another problem in Part
> I, section 1.) is simply defined as:
> 
>  reverse    = foldl (flip (:)) []
> 
> 
> Concatenating all elements of a list (solution of the next problem) is
> simply:
> 
>  concat     = foldr (++) []
> 
> where (++) is the concatenation operator for lists.
> 
> 
> Combining two lists with equal length into one can be performed using
> the zipWith() function:
> 
>  zipWith f (a:as) (b:bs)   = f a b : zipWith f as bs
>  zipWith _ _      _        = []
> 
> The function f() is applied on every pair of elements at position N
> from the two lists, and the result forms the element at position N in
> the result list.
> 
>  zipWith() solves directly all the problems from Part I, section 3:
> 
>    - (a1 + b1, a2 + b2, ..., aN + bN) is just:
> 
>       zipWith (+) as bs
> 
>    - (a1 * b1, a2 * b2, ..., aN * bN) is just:
> 
>       zipWith (*) as bs
> 
>    - (a1 and b1, a2 and b2, ..., aN and bN) is just:
> 
>       zipWith  and as bs
> 
>    - (a1 or b1, a2 or b2, ..., aN or bN) is just:
> 
>       zipWith or as bs
> 
> 
> 
> A very useful function is scanl:
> 
>  scanl f q xs   = q : (case xs of
> 		           []   -> []
> 			     x:xs -> scanl f (f q x) xs)
> 
> It is similar to foldl, but creates a list, every element of which
> contains the intermediate accumulated result. The first element of the
> result-list is q.
> 
> In case the list is guaranteed to be non-empty, then the following
> function can be defined:
> 
>  scanl1 f (x:xs)   = scanl f x xs
> 
> It behaves like scanl(), but doesn't use a "zero" argument.
> 
> As can be easily seen, 
> 
>  scanl1 `op` xs 
> 
> produces a list of the intermediate accumulated results of performing 
> op() on the list xs. For example:
> 
>  scanl1 (+) [1, 2, 3] = [1, 3, 6] = [1, 1+2, 3 + 3]
> 
> 
> scanl1() combined with zipWith() solves directly the problem of
> calculating the running total from Part I, section 1:
> 
>  scanl1 (+) (zipWith (*) [1,2,3] [2,2,2])
> 
> returns:
>         [2, 6, 12]
> 
> A direct solution to the filtering problems defined in Part I, section
> 2, is provided by the function filter():
> 
>  filter p xs       = [ x | x <- xs, p x ]
> 
> it takes a function p() defined on the type of elements of its second
> argument - a list xs, and returns a list of those elements of xs, for
> which p(x) = true.
> 
> Using it, we can write:
> 
>  - Return all items in a sequence, for which f(item) has minimal value
> 
>     minvals  f  xs     =   filter (= fmin)  ys
> 
>                               where ys   =  map f  xs,
>                                     fmin =  minimum  ys
> 
> 
>  - For some function f() test whether all the values of f(item) on a
> sequence  are equal (> 0, etc.)
> 
>    allFPositive  f  xs  =  foldl   and   ys
> 
>                              where  ys    =  map  (( > 0) . f)  xs
> 
>  - For some function f() test whether all values f(item) on a sequence
> are in increasing order.
> 
>   allFIncreasing  f  xs   =  foldl   and   ys
> 
>              where   ys   =  zipWith  (<)  (init  xs)  (tail  xs)
> 
> 
> In the last solution we used the init() function, which is the dual of
> tail() - from a list xs it produces another, containing all 
> elements of
> the first list , but the last one:
> 
>  init (x:xs)       = x : init xs
> 
> 
> 
> Finally, here's an example how to keep a language small and simple:
> 
> Whenever a programmer needs a "mapping operator", she could produce it
> immediately herself, without having to ask a working group for
> including it in the "standard language", as follows:
> 
> 1. She defines the function map():
> 
>  map  f   xs  =  foldl  ( (:) . f  )   [ ]  xs
> 
> 2. Because she needs to apply the "mapping operator" repeatedly, for
> convenience she defines a multiMap() function:
> 
>  multiMap   xs   fs    =    foldr   map   xs   fs
> 
>  multiMap   [1, 2, 3]   [(1+), (2*), (5+)] 
> 
> produces:
> 
>  [13,15,17]  , that is  [(1 + 5)*2 + 1, (2 + 5)*2 + 1, (3 + 5)*2 + 1]
> 
> 3. The multiMap() function as defined above applies the functions
> starting from the last one in the list. The programmer wants them
> applied starting from the first function in the list. She re-defines
> multiMap  by changing "foldr" to "foldl" as follows:
> 
>   multiMap   xs   fs    =    foldl   (flip  map)   xs   fs
> 
> 
> Now 
>     multiMap   [1, 2, 3]   [(1+),  (2*),  (5+)] 
> 
> produces:
> 
>     [9, 11, 13]  
> 
>  that is:  [(1 + 1)*2 + 5, (2 + 1)*2 + 5, (3 + 1)*2 + 5]
> 
> 4. She is ready to use the new function. For example, she specifies a
> series of SVG coordinate transformations (this is not in Haskell, but
> uses XPath 2.0 expressions as lambda expressions):
> 
>   multiMap  ($coordinates,
>                     ((. *2),
>                      (if (position() mod 2) then . + 50 else .)
>                       .....................
>                     )
>              )
> 
> 
> 
> III. Conclusion
> ---------------
>  
>  XPath 2.0 as specified in the current working draft has the
> problems described in Part I. A language with support for higher-order
> function is free of these problems.
> 
> IV. Recommendation
> ------------------
> 
>  Based on the above conclusion, I recommend that higher-order
> functions support be implemented in Xpath 2.0.
> 
> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Send FREE video emails in Yahoo! Mail!
> http://promo.yahoo.com/videomail/
> 

Received on Thursday, 17 January 2002 07:08:18 UTC