Higher-Order Functions in XPath 2.0

Hi,

Following is a proposal for implementing higher-order functions in
XPath 2.0. for your consideration. Let me know if anything is unclear.

It is based on the draft, published in the xsl-list, with minor
corrections and an additional reference to a text-processing problem.

Cheers,
Dimitre Novatchev.
-------------------------------------------------------------------

Higher-Order Functions in XPath 2.0

Contents
--------
Part I. Problems with XPath 2.0

       1. Using the "for" expression for incremental processing
       2. Difficulties in returning aggregate (nodes) value 
       3. Combining two sequences in producing a sequence
       4. Text processing beyond the limits of regular expressions 
       5. XPath language complexity has grown considerably 
       6. Inflexibility, where equality and comparison-returning 
          functions are needed 
       7. Little or no reusability is possible for "for" expressions

Part II. Higher-Order Functions Solutions

III. Conclusion

IV. Recommendation




Part I. Problems with XPath 2.0
-------------------------------

A general problem with XPath 2.0, as specified by the current working
draft, is that some tasks exist, which cannot be solved within the
language, and another group of tasks can be accomplished in XPath 2.0,
but in a rather inefficient way. In all of these cases programmers can
solve the task by using recursive XSLT templates and this is a powerful
method, but at the same time writing and testing different recursive
templates for every particular case is a time-consuming and error-prone
process and often the result has low reusability.

Listed bellow are some examples of such problems:

1. Using the "for" expression for incremental processing
   The "for" expression in XPath 2.0 cannot be used when the results
of processing every item of a sequence depend on the processing of the
previous items. Or in some cases it can be used, but in an obviously
inefficient way:

   - Find the product of all numbers in a sequence.
   - Reverse a sequence or a string.
   - Concat all string items of a sequence.

A typical use-case, for which the XPath 2.0 solution is difficult and
inefficient, is the following:

Given a sequence of "book" nodes, calculate and display the amount of
money received from the sales of each book (price * sales), but also
obtain and display a running total, as each book node from the sequence
is processed. To achieve this in XPath, one would write:

for $i in (1 to count($items)) 
   return ($items[$i],
           sum( for $j in (sublist($items, 1, $i)) 
                   return (@price * @sales) ) 

In the above expression, if N is the number of items in $items, 
N * (N + 1) / 2 additions and N * (N + 1) / 2 multiplications will be
performed.

While the above may seem to be just a textbook example (and really a
similar example can be found in Mike Kay’s book), there are real-world
examples, where a running total must be calculated and even several
results must be accumulated in parallel. I am deeply obliged to Mark
Nahabedian (naha@ai.mit.edu ) for allowing me to quote his work which
has to deal with exactly this problem -- a complete example can be
found at: 

http://www.ai.mit.edu/people/naha/itrack/about.html


2. Difficulties in returning aggregate (nodes) value from a sequence,
especially when returning those nodes depends in a non-trivial way on
the other nodes of the sequence:

   - Obtain all nodes with "maximum value" from a sequence, especially
in the case when the node "value" is computed by a very complex
expression.
   - Obtain the nodes with "distinct values" from a sequence,
especially in the case when the node "value" is computed in a very
complex way.
   - There's no general way to "filter" elements of a sequence based on
a predicate.

The reason is that a predicate (function) cannot be passed as a
parameter to a general "filter" function. As a result programmers will
write multitude of similar filtering expressions, without being able to
re-use them. Such repetitions are time-consuming, error-prone, and
generally result in un-maintainable and non-reusable code.

Examples of problems in this group:

   - Return the sum of squares of the numbers in a sequence.
   - Return all items in a sequence, for which f(item) has minimal
value
   - For some function f() test whether all the values of f(item) on a
sequence  are equal (> 0, etc.)
   - For some function f() test whether all values f(item) on a
sequence are in increasing order.

Although a solution can be found in XPath, it will be difficult and
inefficient.

Also, for every different function f() another version of the same
solution will have to be produced, because functions cannot be passed
as parameters.

3. Combining two sequences in producing a sequence:
Given (a1, a2, ..., aN) and (b1, b2, ..., bN) compute:

     (a1 + b1, a2 + b2, ..., aN + bN)

     (a1 * b1, a2 * b2, ..., aN * bN)

     (a1 and b1, a2 and b2, ..., aN and bN)

     (a1 or b1, a2 or b2, ..., aN or bN)
etc.


4. Text processing beyond the limits of regular expressions is not
possible.

A real world problem was pointed out by David Carlisle -- he needs in
his work to match strings, surrounded by  (unknown in advance number
of) balanced parenthesis.

For any such problem, it would be nice to have a general, table-driven
parser() function. However this is not possible, because the parser()
function will need to be passed as parameter a lex() function that it
must call for obtaining the terminal symbols from the input text.

Another example of awkward to perform text-processing task, pointed by
Jeni Tennison, can be found at:

http://lists.xml.org/archives/xml-dev/200201/msg00817.html  and
http://www.kuro5hin.org/story/2002/1/15/1562/95011 


5. XPath language complexity has grown considerably and the language
cannot continue to expand indefinitely:

Already there are hardly any spare characters left for operators.
Often there are (more than one) different ways of performing similar
tasks.

In contrast, a language that supports higher-order functions can be
kept simple, small and elegant, while at the same time providing
powerful means to produce any necessary new functionality. 

Thus the "standard" language features (e.g. operators and functions)
can be kept to a minimum, while the language makes possible desired new
functionality to be easily produced and accumulated into a library of
general and reusable useful functions.

Without support for higher-order functions such libraries are very
limited in scope and usefulness.

6. Inflexibility, where equality and comparison-returning functions are
needed to be passed to:
   - sort
   - distinct-values
   - grouping, etc.

7. Little or no reusability is possible for "for" expressions

In the expression bellow:

 for $i in (1 to count($items)) 
         return expression

"expression" cannot be reused (simply copied and pasted) and will have
to be modified every time it is used with differently named range
variable so that references to the range variable are renamed.

In contrast, with higher-order functions support one can have a map()
function, so that in 

map(f, $sequence) 

the code of f() will never have to be modified.


Part II. Higher-Order Functions Solutions
-----------------------------------------

Provided higher-order functions were available, the problems listed
above have easy and natural solutions.

To demonstrate the compactness and high degree of readability, the code
bellow is written in Haskell. Haskell is used only for convenience, in
no way should it be inferred that the same syntax is recommended for
XPath 2.0. Some basic conventions from this language:

 f x y = x * y

This defines a function f(x,y) = x * y

 [1, 2, 3] 

This is a list of elements 1, 2, and 3. The same (for our purposes) as
(1, 2, 3) in XPath 2.0.

 []

This denotes the empty list -- the same as () in XPath 2.0.

 x   -- denotes the name of a single element/function.

 xs -- any name ending in 's' denotes a list of elements. 

Any operator can be used also as a function, when put in brackets.
Thus:

(+) 1 2 = 1 + 2 = 3

The (:) operator is used to prepend an element to the start of a list:

 x : xs 

defines a list with head x and tail xs.

The flip() function takes as argument any function with two arguments,
and produces as result a the same function, which takes these two
arguments in reverse order.

 flip f x y = f y x 

Primitive recursion over a list can be defined as follows:

 foldl f z []      = z
 foldl f z (x:xs)  = foldl f (f z x) xs

The function "foldl" takes 3 arguments -- a function f(), which takes 2
arguments, a value z, and a list.

This is one of the most general functions over lists. It traverses the
list from left to write, applying f() on each element and the currently
accumulated result.

There is a dual function (foldr), which behaves in a similar way, but 
traverses the list from right to left:

 foldr f z []      = z
 foldr f z (x:xs)  = f x (foldr f z xs)


As can be easily seen:

 foldl (+) 0 xs

is the sum of all elements in a list xs. Therefore we could write:

 sum xs = foldl (+) 0 xs

Analogously:

 product xs = foldl (*) 1 xs

And this one liner is the solution to one of the problems in Part I,
section 1.

We can ommit the last operand(s) from an equation, in case it is the
same and we still get a valid function definition. Therefore, the above
function definitions could be simplified even further and written as:

 sum   = foldl (+) 0 

 product  = foldl (*) 1 


Reversing the elements of a list (solution of another problem in Part
I, section 1.) is simply defined as:

 reverse    = foldl (flip (:)) []


Concatenating all elements of a list (solution of the next problem) is
simply:

 concat     = foldr (++) []

where (++) is the concatenation operator for lists.


Combining two lists with equal length into one can be performed using
the zipWith() function:

 zipWith f (a:as) (b:bs)   = f a b : zipWith f as bs
 zipWith _ _      _        = []

The function f() is applied on every pair of elements at position N
from the two lists, and the result forms the element at position N in
the result list.

 zipWith() solves directly all the problems from Part I, section 3:

   - (a1 + b1, a2 + b2, ..., aN + bN) is just:

      zipWith (+) as bs

   - (a1 * b1, a2 * b2, ..., aN * bN) is just:

      zipWith (*) as bs

   - (a1 and b1, a2 and b2, ..., aN and bN) is just:

      zipWith  and as bs

   - (a1 or b1, a2 or b2, ..., aN or bN) is just:

      zipWith or as bs



A very useful function is scanl:

 scanl f q xs   = q : (case xs of
		           []   -> []
			     x:xs -> scanl f (f q x) xs)

It is similar to foldl, but creates a list, every element of which
contains the intermediate accumulated result. The first element of the
result-list is q.

In case the list is guaranteed to be non-empty, then the following
function can be defined:

 scanl1 f (x:xs)   = scanl f x xs

It behaves like scanl(), but doesn’t use a “zero” argument.

As can be easily seen, 

 scanl1 `op` xs 

produces a list of the intermediate accumulated results of performing 
op() on the list xs. For example:

 scanl1 (+) [1, 2, 3] = [1, 3, 6] = [1, 1+2, 3 + 3]


scanl1() combined with zipWith() solves directly the problem of
calculating the running total from Part I, section 1:

 scanl1 (+) (zipWith (*) [1,2,3] [2,2,2])

returns:
        [2, 6, 12]

A direct solution to the filtering problems defined in Part I, section
2, is provided by the function filter():

 filter p xs       = [ x | x <- xs, p x ]

it takes a function p() defined on the type of elements of its second
argument - a list xs, and returns a list of those elements of xs, for
which p(x) = true.

Using it, we can write:

 - Return all items in a sequence, for which f(item) has minimal value

    minvals  f  xs     =   filter (= fmin)  ys

                              where ys   =  map f  xs,
                                    fmin =  minimum  ys


 - For some function f() test whether all the values of f(item) on a
sequence  are equal (> 0, etc.)

   allFPositive  f  xs  =  foldl   and   ys

                             where  ys    =  map  (( > 0) . f)  xs

 - For some function f() test whether all values f(item) on a sequence
are in increasing order.

  allFIncreasing  f  xs   =  foldl   and   ys

             where   ys   =  zipWith  (<)  (init  xs)  (tail  xs)


In the last solution we used the init() function, which is the dual of
tail() - from a list xs it produces another, containing all elements of
the first list , but the last one:

 init (x:xs)       = x : init xs



Finally, here’s an example how to keep a language small and simple:

Whenever a programmer needs a “mapping operator”, she could produce it
immediately herself, without having to ask a working group for
including it in the “standard language”, as follows:

1. She defines the function map():

 map  f   xs  =  foldl  ( (:) . f  )   [ ]  xs

2. Because she needs to apply the “mapping operator” repeatedly, for
convenience she defines a multiMap() function:

 multiMap   xs   fs    =    foldr   map   xs   fs

 multiMap   [1, 2, 3]   [(1+), (2*), (5+)] 

produces:

 [13,15,17]  , that is  [(1 + 5)*2 + 1, (2 + 5)*2 + 1, (3 + 5)*2 + 1]

3. The multiMap() function as defined above applies the functions
starting from the last one in the list. The programmer wants them
applied starting from the first function in the list. She re-defines
multiMap  by changing “foldr” to “foldl” as follows:

  multiMap   xs   fs    =    foldl   (flip  map)   xs   fs


Now 
    multiMap   [1, 2, 3]   [(1+),  (2*),  (5+)] 

produces:

    [9, 11, 13]  

 that is:  [(1 + 1)*2 + 5, (2 + 1)*2 + 5, (3 + 1)*2 + 5]

4. She is ready to use the new function. For example, she specifies a
series of SVG coordinate transformations (this is not in Haskell, but
uses XPath 2.0 expressions as lambda expressions):

  multiMap  ($coordinates,
                    ((. *2),
                     (if (position() mod 2) then . + 50 else .)
                      .....................
                    )
             )



III. Conclusion
---------------
 
 XPath 2.0 as specified in the current working draft has the
problems described in Part I. A language with support for higher-order
function is free of these problems.

IV. Recommendation
------------------

 Based on the above conclusion, I recommend that higher-order
functions support be implemented in Xpath 2.0.




__________________________________________________
Do You Yahoo!?
Send FREE video emails in Yahoo! Mail!
http://promo.yahoo.com/videomail/

Received on Wednesday, 16 January 2002 15:16:00 UTC