- From: Dimitre Novatchev <dnovatchev@yahoo.com>
- Date: Wed, 16 Jan 2002 12:49:03 -0500 (EST)
- To: www-xpath-comments@w3.org
Hi, Following is a proposal for implementing higher-order functions in XPath 2.0. for your consideration. Let me know if anything is unclear. It is based on the draft, published in the xsl-list, with minor corrections and an additional reference to a text-processing problem. Cheers, Dimitre Novatchev. ------------------------------------------------------------------- Higher-Order Functions in XPath 2.0 Contents -------- Part I. Problems with XPath 2.0 1. Using the "for" expression for incremental processing 2. Difficulties in returning aggregate (nodes) value 3. Combining two sequences in producing a sequence 4. Text processing beyond the limits of regular expressions 5. XPath language complexity has grown considerably 6. Inflexibility, where equality and comparison-returning functions are needed 7. Little or no reusability is possible for "for" expressions Part II. Higher-Order Functions Solutions III. Conclusion IV. Recommendation Part I. Problems with XPath 2.0 ------------------------------- A general problem with XPath 2.0, as specified by the current working draft, is that some tasks exist, which cannot be solved within the language, and another group of tasks can be accomplished in XPath 2.0, but in a rather inefficient way. In all of these cases programmers can solve the task by using recursive XSLT templates and this is a powerful method, but at the same time writing and testing different recursive templates for every particular case is a time-consuming and error-prone process and often the result has low reusability. Listed bellow are some examples of such problems: 1. Using the "for" expression for incremental processing The "for" expression in XPath 2.0 cannot be used when the results of processing every item of a sequence depend on the processing of the previous items. Or in some cases it can be used, but in an obviously inefficient way: - Find the product of all numbers in a sequence. - Reverse a sequence or a string. - Concat all string items of a sequence. A typical use-case, for which the XPath 2.0 solution is difficult and inefficient, is the following: Given a sequence of "book" nodes, calculate and display the amount of money received from the sales of each book (price * sales), but also obtain and display a running total, as each book node from the sequence is processed. To achieve this in XPath, one would write: for $i in (1 to count($items)) return ($items[$i], sum( for $j in (sublist($items, 1, $i)) return (@price * @sales) ) In the above expression, if N is the number of items in $items, N * (N + 1) / 2 additions and N * (N + 1) / 2 multiplications will be performed. While the above may seem to be just a textbook example (and really a similar example can be found in Mike Kay’s book), there are real-world examples, where a running total must be calculated and even several results must be accumulated in parallel. I am deeply obliged to Mark Nahabedian (naha@ai.mit.edu ) for allowing me to quote his work which has to deal with exactly this problem -- a complete example can be found at: http://www.ai.mit.edu/people/naha/itrack/about.html 2. Difficulties in returning aggregate (nodes) value from a sequence, especially when returning those nodes depends in a non-trivial way on the other nodes of the sequence: - Obtain all nodes with "maximum value" from a sequence, especially in the case when the node "value" is computed by a very complex expression. - Obtain the nodes with "distinct values" from a sequence, especially in the case when the node "value" is computed in a very complex way. - There's no general way to "filter" elements of a sequence based on a predicate. The reason is that a predicate (function) cannot be passed as a parameter to a general "filter" function. As a result programmers will write multitude of similar filtering expressions, without being able to re-use them. Such repetitions are time-consuming, error-prone, and generally result in un-maintainable and non-reusable code. Examples of problems in this group: - Return the sum of squares of the numbers in a sequence. - Return all items in a sequence, for which f(item) has minimal value - For some function f() test whether all the values of f(item) on a sequence are equal (> 0, etc.) - For some function f() test whether all values f(item) on a sequence are in increasing order. Although a solution can be found in XPath, it will be difficult and inefficient. Also, for every different function f() another version of the same solution will have to be produced, because functions cannot be passed as parameters. 3. Combining two sequences in producing a sequence: Given (a1, a2, ..., aN) and (b1, b2, ..., bN) compute: (a1 + b1, a2 + b2, ..., aN + bN) (a1 * b1, a2 * b2, ..., aN * bN) (a1 and b1, a2 and b2, ..., aN and bN) (a1 or b1, a2 or b2, ..., aN or bN) etc. 4. Text processing beyond the limits of regular expressions is not possible. A real world problem was pointed out by David Carlisle -- he needs in his work to match strings, surrounded by (unknown in advance number of) balanced parenthesis. For any such problem, it would be nice to have a general, table-driven parser() function. However this is not possible, because the parser() function will need to be passed as parameter a lex() function that it must call for obtaining the terminal symbols from the input text. Another example of awkward to perform text-processing task, pointed by Jeni Tennison, can be found at: http://lists.xml.org/archives/xml-dev/200201/msg00817.html and http://www.kuro5hin.org/story/2002/1/15/1562/95011 5. XPath language complexity has grown considerably and the language cannot continue to expand indefinitely: Already there are hardly any spare characters left for operators. Often there are (more than one) different ways of performing similar tasks. In contrast, a language that supports higher-order functions can be kept simple, small and elegant, while at the same time providing powerful means to produce any necessary new functionality. Thus the "standard" language features (e.g. operators and functions) can be kept to a minimum, while the language makes possible desired new functionality to be easily produced and accumulated into a library of general and reusable useful functions. Without support for higher-order functions such libraries are very limited in scope and usefulness. 6. Inflexibility, where equality and comparison-returning functions are needed to be passed to: - sort - distinct-values - grouping, etc. 7. Little or no reusability is possible for "for" expressions In the expression bellow: for $i in (1 to count($items)) return expression "expression" cannot be reused (simply copied and pasted) and will have to be modified every time it is used with differently named range variable so that references to the range variable are renamed. In contrast, with higher-order functions support one can have a map() function, so that in map(f, $sequence) the code of f() will never have to be modified. Part II. Higher-Order Functions Solutions ----------------------------------------- Provided higher-order functions were available, the problems listed above have easy and natural solutions. To demonstrate the compactness and high degree of readability, the code bellow is written in Haskell. Haskell is used only for convenience, in no way should it be inferred that the same syntax is recommended for XPath 2.0. Some basic conventions from this language: f x y = x * y This defines a function f(x,y) = x * y [1, 2, 3] This is a list of elements 1, 2, and 3. The same (for our purposes) as (1, 2, 3) in XPath 2.0. [] This denotes the empty list -- the same as () in XPath 2.0. x -- denotes the name of a single element/function. xs -- any name ending in 's' denotes a list of elements. Any operator can be used also as a function, when put in brackets. Thus: (+) 1 2 = 1 + 2 = 3 The (:) operator is used to prepend an element to the start of a list: x : xs defines a list with head x and tail xs. The flip() function takes as argument any function with two arguments, and produces as result a the same function, which takes these two arguments in reverse order. flip f x y = f y x Primitive recursion over a list can be defined as follows: foldl f z [] = z foldl f z (x:xs) = foldl f (f z x) xs The function "foldl" takes 3 arguments -- a function f(), which takes 2 arguments, a value z, and a list. This is one of the most general functions over lists. It traverses the list from left to write, applying f() on each element and the currently accumulated result. There is a dual function (foldr), which behaves in a similar way, but traverses the list from right to left: foldr f z [] = z foldr f z (x:xs) = f x (foldr f z xs) As can be easily seen: foldl (+) 0 xs is the sum of all elements in a list xs. Therefore we could write: sum xs = foldl (+) 0 xs Analogously: product xs = foldl (*) 1 xs And this one liner is the solution to one of the problems in Part I, section 1. We can ommit the last operand(s) from an equation, in case it is the same and we still get a valid function definition. Therefore, the above function definitions could be simplified even further and written as: sum = foldl (+) 0 product = foldl (*) 1 Reversing the elements of a list (solution of another problem in Part I, section 1.) is simply defined as: reverse = foldl (flip (:)) [] Concatenating all elements of a list (solution of the next problem) is simply: concat = foldr (++) [] where (++) is the concatenation operator for lists. Combining two lists with equal length into one can be performed using the zipWith() function: zipWith f (a:as) (b:bs) = f a b : zipWith f as bs zipWith _ _ _ = [] The function f() is applied on every pair of elements at position N from the two lists, and the result forms the element at position N in the result list. zipWith() solves directly all the problems from Part I, section 3: - (a1 + b1, a2 + b2, ..., aN + bN) is just: zipWith (+) as bs - (a1 * b1, a2 * b2, ..., aN * bN) is just: zipWith (*) as bs - (a1 and b1, a2 and b2, ..., aN and bN) is just: zipWith and as bs - (a1 or b1, a2 or b2, ..., aN or bN) is just: zipWith or as bs A very useful function is scanl: scanl f q xs = q : (case xs of [] -> [] x:xs -> scanl f (f q x) xs) It is similar to foldl, but creates a list, every element of which contains the intermediate accumulated result. The first element of the result-list is q. In case the list is guaranteed to be non-empty, then the following function can be defined: scanl1 f (x:xs) = scanl f x xs It behaves like scanl(), but doesn’t use a “zero” argument. As can be easily seen, scanl1 `op` xs produces a list of the intermediate accumulated results of performing op() on the list xs. For example: scanl1 (+) [1, 2, 3] = [1, 3, 6] = [1, 1+2, 3 + 3] scanl1() combined with zipWith() solves directly the problem of calculating the running total from Part I, section 1: scanl1 (+) (zipWith (*) [1,2,3] [2,2,2]) returns: [2, 6, 12] A direct solution to the filtering problems defined in Part I, section 2, is provided by the function filter(): filter p xs = [ x | x <- xs, p x ] it takes a function p() defined on the type of elements of its second argument - a list xs, and returns a list of those elements of xs, for which p(x) = true. Using it, we can write: - Return all items in a sequence, for which f(item) has minimal value minvals f xs = filter (= fmin) ys where ys = map f xs, fmin = minimum ys - For some function f() test whether all the values of f(item) on a sequence are equal (> 0, etc.) allFPositive f xs = foldl and ys where ys = map (( > 0) . f) xs - For some function f() test whether all values f(item) on a sequence are in increasing order. allFIncreasing f xs = foldl and ys where ys = zipWith (<) (init xs) (tail xs) In the last solution we used the init() function, which is the dual of tail() - from a list xs it produces another, containing all elements of the first list , but the last one: init (x:xs) = x : init xs Finally, here’s an example how to keep a language small and simple: Whenever a programmer needs a “mapping operator”, she could produce it immediately herself, without having to ask a working group for including it in the “standard language”, as follows: 1. She defines the function map(): map f xs = foldl ( (:) . f ) [ ] xs 2. Because she needs to apply the “mapping operator” repeatedly, for convenience she defines a multiMap() function: multiMap xs fs = foldr map xs fs multiMap [1, 2, 3] [(1+), (2*), (5+)] produces: [13,15,17] , that is [(1 + 5)*2 + 1, (2 + 5)*2 + 1, (3 + 5)*2 + 1] 3. The multiMap() function as defined above applies the functions starting from the last one in the list. The programmer wants them applied starting from the first function in the list. She re-defines multiMap by changing “foldr” to “foldl” as follows: multiMap xs fs = foldl (flip map) xs fs Now multiMap [1, 2, 3] [(1+), (2*), (5+)] produces: [9, 11, 13] that is: [(1 + 1)*2 + 5, (2 + 1)*2 + 5, (3 + 1)*2 + 5] 4. She is ready to use the new function. For example, she specifies a series of SVG coordinate transformations (this is not in Haskell, but uses XPath 2.0 expressions as lambda expressions): multiMap ($coordinates, ((. *2), (if (position() mod 2) then . + 50 else .) ..................... ) ) III. Conclusion --------------- XPath 2.0 as specified in the current working draft has the problems described in Part I. A language with support for higher-order function is free of these problems. IV. Recommendation ------------------ Based on the above conclusion, I recommend that higher-order functions support be implemented in Xpath 2.0. __________________________________________________ Do You Yahoo!? Send FREE video emails in Yahoo! Mail! http://promo.yahoo.com/videomail/
Received on Wednesday, 16 January 2002 15:16:00 UTC