- From: Ray Whitmer <rayw@netscape.com>
- Date: Wed, 11 Jul 2001 16:05:20 -0700
- To: Curt Arnold <carnold@houston.rr.com>
- CC: www-dom@w3.org
Well, here is my 2 cents -- apologies for grafitiing your proposal with my own opinions. Curt Arnold wrote: > XPathEvaluator interface > > > > The name seems inconsistent with the names for other Document related > interfaces. Other specs typically use Document[SpecName], > DocumentEvent, DocumentRange, DocumentViews, DocumentTraversal, for > their augmentation of the Document. Though it may not be the most > intuitive naming convention, naming the interface DocumentXPath would > be consistent with the other DOM specs. > The evaluation of XPath expressions is available in XPathEvaluator . A Document which implements the XPath module will be castable using language-specific mechanisms to XPathEvaluator, which will provide evaluation of XPath expressions with no special extension functions or variables. XPathEvaluator implementations may be available from other sources that provide extension functions or variables. The choice was made because Document is only one source of the interface, just like we don't call the Node interface DocumentNode, because it is available from other sources, as is explicitly mentioned in the specification. > I would suggest that in anticipation of XPath 2.0 that: > > > > a) Types corresponding to the 19 primitive XML Schema data types be > defined for each of the bindings. For example, that for the Java > binding, xs:dateTime be represented by java.util.Date for the Java > binding, that xs:date be represented by the starting instant in a > java.util.Date. Since the identity of the 19 types are fixed since > XML Schema datatypes is a recommendation, these bindings could be > defined now, though it would not be possible to get a java.util.Date > from evaluate without an XPath 2.0 compatible query. > It is not clear to me that there is a real advantage in forcing people to invent so many different new types (in most cases) to handle all these different schema types in Java, ECMAScript, etc. Where there are not already native language types created to handle them, I do not believe it does much good to map them to anything. Many, for example, work much better as string or the specific application may have some specific type it wants to put them into. If we are going to go to all the trouble of converting the text which is in the DOM into 19 different objects representing schema types, I would think this should be done in a library outside of DOM XPath. We have the 1.0 common types covered as conveniences for XPath 1.0, but I say let additional XPath 2.0 simple types just evaluate as a String and then call some schema conversion routine if something else is desired, that can also be called by those not using XPath. > b) That feature strings are specified so that you can determine the > version of XPath expression supported. > > > > For example, "DOMXPath" could be used to determine what level of XPath > DOM interfaces are supported and "XPath" what version of expression > syntax is supported (alternatively XPath for interface version and > XPathExpression for XPath expression version). > > > > // if processor supports DOM XPath 3.0 > > if(impl.hasFeature("DOMXPath","3.0")) { > > // if XPath 2.0 expression syntax > > if(impl.hasFeature("XPath","2.0")) { > > } > > else { > > if(impl.hasFeature("XQuery","1.0")) { > > ... > > } > > else { > > if(impl.hasFeature("XPath","1.0")) { > Possibly, but I am not yet confident enough that XPath 2.0 can even be evaluated using these interfaces, given significant differences. > Use Case 1: Immediate evaluation with static result or node set > > > > Processing can not continue until the results are known and once > determined the result should be available regardless of document mutation. > > > > The evaluateAsNumber(), evaluateAsString(), evaluateAsBoolean, > and evaluateAsNode only address this use case. If you do, > > > > String address = xeval.evaluateAsString("address"...); > > > > The query must be complete before the function returns and the return > value is not invalidated or changed by document mutation after the > function returns. > > > > evaluate and evaluateAsNodeSet should also address this use case, > however they also address other use cases and the desired behavior > should be specified. > As long as we keep observed/documented API behavior separate from implementation and performance. Use case 1 could easily be incrementally implemented, as long as it had a way of snapshotting in case of mutation. > Use Case 2: immediate evaluation with result set invalidated on > document mutation > > > > No substantial work can be done until the query is complete and > > all members of the result set are anticipated to be iterated, > > so lazy evaluation or asynchronous evaluation would not be beneficial. > > > > The source document may be changing and the application desires the result > > set to be invalidated if there are any document mutation. > > > > This case could be accomplished by wrapping the node set from use case > 1 by > > an object that implemented both NodeSet and EventListener and > registered itself > > as an eventListener on the document node. However, this use case can be > > addressed in the spec as a convienience to users and to > implementations that > > don't support DOM Events. > Again, on the implementation side, there is a great advantage in getting this result incrementally, because a list can be huge. Wrapping an object that implemented NodeSet could turn out to be much less efficient, because case 1 has to still have a snapshot after mutation, whereas case 2 does not. Having this expressed as a specific interface is not equivalent to a NodeSet and an EventListener, because there is no way for an interface to specify that it needs a NodeSet that is not stale but reflects an expression evaluated on the current hierarchy, as is essential for many use cases unless this constraint is expressible as an interface. > Use Case 3: lazy evaluation with static result set accessed in > document order > > Use Case 4: lazy evaluation with result set accessed in document order > invalidated on document mutation > > Use Case 5: lazy evaluation with static result set accessed in > indeterminant order > > Use Case 6: lazy evaluation with result set accessed in indeterminant > order invalidated on document mutation > Lazy evaluation should be up to the implementation, and as such it is not a use case itself. It is an implementation strategy to more-efficiently handle use cases, especially involving huge result sets and partial use of the results, which are part of the use cases. As long as it doesn't change behavior, the implementation can do whatever it wants. > No substantial work can be done before at least one node of the result > set is known. > > Only part of the result set may be iterated, so it may be beneficial > to incrementally determine the result set. The behavior for use case > 3 should be indistinguishable (other than timing) from that for use > case 1, the behavior for use case 4 should be indistinguishable from > use case 2. Calling length() would typically force the completion of > the evaluation > I do not know what you are referring to as "the behavior" that should be the same between 1 and 3. If you are referring to whether it invalidates or snapshots during mutation, I agree. If you are referring to how it is implemented or how much computation is required, then I think this is up to the implementation. Since 1 and 2 can easily be lazily evaluated, so 3 and 4 can as well, but only if the expression can be processed in such a way as to naturally produce all results incrementally in document order, the difficulty of which will depend upon the expression. 5 and 6 are extremely easy because there is no order to worry about. > Optimal behaviors for use case 3 and 5 may be too complex for > implementation since it would require an immediate completion of the > evaluation when the first mutation of the document is pending. > Basically, if an element was to be added, the outstanding node set > would have to block the mutation until it determined the complete > result set (or at least determined that the mutation would or would > not affect the incremental determination of the result set). However > the behavior for use case 1 would be acceptible. > 1, 3, and 5 I tend to group together, because I don't like huge static lists even if every node will be processed. An immediate completion of the evaluation when the first mutation is pending is one possibility for incremental implementation. Another is a DOM implementation I created with a data model which permits you to hold on to a specific snapshot of the hierarchy (sort of like a logged file system) as it continues to mutate. Neither of these are too difficult to seriously contemplate. > Use cases 5 and 6 allow an optimized behavior when the order of > processing is not significant, for example, if summing the line items > in an order to determine the total cost. > 5 and 6 certainly make it easier if you do not have to worry about an order. The difference depends upon the quality of the implementation. In high-quality implementations, there need be no significant difference for many common expressions. > Use Case 7: asynchronous evaluation with static result set accessed in > document order > > Use Case 8: asynchronous evaluation with result set accessed in > document order invalidated by document mutation > > Use Case 9: asynchronous evaluation with static result set accessed > in indeterminate order > > Use Case 10: asynchronous evaluation with result set accessed in > indeterminant order invalidated by document mutation > > > > These cases differ from their i-4 siblings in that some substantial > work could be accomplished by the caller before the first member is > accessed and in the intervals between accesses. Again other than > timing issues the behavior should be indistinguishable i-4 and i-6 > siblings. > I think you are talking about implementations, rather than use cases, which I would believe are defined by the application. So far, the behavioral differences I have seen are: snapshoting versus invalidating ordered versus unordered partial processing of the result from a single node to all And the implementation differences I have seen (indistinguishable from a behavioral perspective) are: Complete versus incremental versus asynchronous computation Caching a list or not caching a list The goal of the API is to accomodate each desired behavior while giving the implementation as much information and freedom as it needs to make the best choices. > Different uses may want different behavior when a node access is > attempted when the next node has not been determined. In some > instances, it may be preferrable to block until the next node (or null > if it was determined that no additional nodes matched) becomes > available, in other instances, it may be preferrable to continue to do > something else until a node becomes available. > > > > Use cases 1sort, 2sort, 7sort and 8sort > > > > Same behavior as 1, 2, 7 and 8 but with the nodes sorted. Doesn't > make any sense for lazy evaluation of a sorted set. > > > > Dimensions of the use cases: > > > > The desired behavior that a caller would prefer could be indicated by > combinations of and choice of NodeSet methods. > > > > XPathEvaluationCode > > > > XPATH_EVALUATE_IMMEDIATE > > XPATH_EVALUATE_LAZY > > XPATH_EVALUATE_ASYNC_BLOCK // block until next node available > > XPATH_EVALUATE_ASYNC_EXCEPTION // throw exception if node not available > > > > This code would only be a hint that the processor could ignore and > > use any behavior it wanted (except ASYNCH_EXCEPTION). > This is implementation, which should produce the same behavior, so making it a hint is OK, but I am not sure why the application should care much at all how the implementation chooses to do it. It seems to me that the application is very likely to make the wrong choice. Combine that likelyhood with the likelyhood that the implementation does not give the application a choice and I don't see the point. The application's primary concern is that it gets the result, and the implementation's concern is to make it available as efficiently as possible balancing various factors. > XPathMutationCode > > > > // result set is static, not affected by mutation, > > // possibly out of synch with current state of document > > XPATH_MUTATION_IGNORE > > // > > // result set in invalidated by any mutation of the document > > XPATH_MUTATION_INVALIDATE > > // > > // cause attempts to change document to throw exception > > // > > // attempts at mutations would force a garbage collect to > > // make sure that a unused NodeSet is not blocking them > > XPATH_MUTATION_PREVENT > > > > The implementation must support MUTATION_IGNORE, but if > > it cannot detect mutations, it needs to throw an exception > > if a query requests MUTATION_INVALIDATE. > This is not behavior, which is why it cannot be ignored. Making it optional greatly reduces its usefulness. People generally will not write two versions of code, one for if a feature is supported and one for if it is not. > Lazy evaluation [document order]: > > > > The traditional iterative loop: > > > > for(int i = 0; i < nodeset.length; i++) { > > node = nodeset.item(i); > > } > > > > would defeat lazy evaluation (since determining the length would > typically only be available after the query has been completed). Lazy > evaluation (in document order) would require a loop like: > > > > try { > > for(int i = 0; ; i++) { > > node = nodeset.item(i); > > } > > } > > catch(DOMException ex) { > > if(ex.code != INDEX_SIZE_ERR) { > > throw ex; > > } > > } > This is why people liked the non-snapshotting ActiveNodeSet as an iterator with no length function, because it seems less common for this to be cached into a list that has an available length than for the snapshotting StaticNodeSet. > Indeterminate order iteration: > > > > A complex query over a less traditional store (say a database or > distributed query system) may be able to determine members of the > result set before a document or sort order may be established. > Instead of creating a distinct interface for this, I think it would be > better just add to additional method to support this type of iteration > on NodeSet. To keep the object stateless (which avoids the need to > worry about cloning), I'd suggest using an integer argument to act as > a key with a special value (0) indicating start of iteration and -1 > indicating end of iteration (all other values are reserved for the > implementation to use as it sees fit to keep track of the position. > > > > For example: > > > > NodeSet nodeset = xeval.evaluteAsNodeSet("//person[@id]"...); > > Integer iterKey = new Integer(0); > > do { > > Node node = nodeset.iterate(iterKey); > > if(node == null) { > > break; > > } > > ... > > } > > while(iterKey.intValue != -1); > The issue of a stateful has been discussed and typically hasn't seemed to be a big concern to many people. In your example, you created a state object anyway -- although if the code is supposed to be Java, last time I checked, Integer was an immutable class. Try an integer array size 1 instead. I think it is simpler to keep the state in the iterator if we want to control order of access and not worry about what happens if someone keeps a random key value around and tries to use it later. > Sorted iteration: > > > > A sorting capability patterned after <xsl:sort> would be extremely > useful. What I would suggest would be a distinct XPathSortCriteria > interface with a corresponding create method on the > XPathEvaluator/DocumentXPath. If an implementation doesn't support > sorting, it could return null. > > > > Interface NamespaceResolver: > > > > This interface acts strictly as a map between two strings, a namespace > prefix that acts as a key and namespace URI as a value. Since most > platforms will already have an abstract interface, such as > java.util.Map or System.IDictionary, that performs this function and a > variety of concrete implementations available. A factory method > > should be provided that will generate a NamespaceResolver from an > object implementing the appropriate platform interface. > > > > In most use cases, the mapping between prefixes in the query is > independent of the prefixes in the document. Though getting a > NamespaceResolver interface that corresponds to the in-force prefixes > for a certain node is desirable, I'd expect it would be a secondary > use case to building the mapping within the application. > In an earlier version of the API, we used an Element instead of a NamespaceResolver. For many, getting it from the context node will be the only use case (whenever the XPath string comes from a document, such as XSL, XLink XPointers, etc.) Creating a custom factory for various bindings is a convenience of value for some applications, but it is not clear to me how many. Many applications may prefer to just parse a namespace definition tag from a constant string that has all the desired namespaces letting DOM supply the map or write a very small wrapper on their own favorite object. I think it is not burdensome to leave alternative map creation up to the application, which was the point of having the interface, so the application could do it in any way it needed to. > In summary, here is a Java-esqe sketch of the interfaces as outlined > in this document: > Let me re-comment it, then, in summary. > package org.w3c.dom.xpath; > > > > > > public class XPathException extends RuntimeException { > > public XPathException(short code, > > String queryLanguage, > > String query, > > String message) { > > super(message); > > this.code = code; > > this.query = query; > > this.queryLanguage = queryLanguage; > > } > > public short code; > > public String query; > > public String queryLanguage; > > > > public static final short XPATH_COERSCION_ERR = 1; > > public static final short XPATH_INVALID_QUERY_ERR = 2; > > public static final short XPATH_UNSUPPORTED_QUERY_LANG_ERR = 3; > > public static final short XPATH_UNSUPPORTED_SORT_DATATYPE_ERR = 4; > > public static final short XPATH_SORT_NOTIMPL_ERR = 5; > > public static final short XPATH_UNSUPPORTED_MUTATION_BEHAVIOR_ERR = 6; > > } > The invalid query error code is clearly needed. The other rely on extensions you propose that it is not clear to me should be standardized. Fortunately, they could be added in the future without breaking compatibility. > // > > // add values for DOMException.code for access to > > // an invalidated NodeSet and an access before a member > > // is ready on an asynch call > > public static final short SET_INVALID_ERR = ...; > > public static final short NOT_READY_ERR = ...; > We need the invalid error. I think any asynchronous implementation must behave as though it is synchronous, blocking until the necessary data is available. > // > > // previously XPathEvaluator > > // > > public interface DocumentXPath { > I think we do not want to limit this to just being a single interface on Document. > // evaluation codes > > // > > public static final short XPATH_EVALUATE_IMMEDIATE = 0; > > public static final short XPATH_EVALUATE_LAZY = 1; > > public static final short XPATH_EVALUATE_ASYNC_BLOCK = 2; > > public static final short XPATH_EVALUATE_ASYNC_EXCEPTION = 3; > It is not clear to me that in most cases the application should care how the evaluation occurs. The implementation should always do its best. I would forget these for now. > // result set is static, not affected by mutation, > > // possibly out of synch with current state of document > > public static final short XPATH_MUTATION_IGNORE = 0; > > // > > // result set in invalidated by any mutation of the document > > public static final short XPATH_MUTATION_INVALIDATE = 1; > > // > > // cause attempts to change document to throw exception > > // > > // attempts at mutations would force a garbage collect to > > // make sure that a unused NodeSet is not blocking them > > public static final short XPATH_MUTATION_PREVENT = 2; > It caling it "IGNORE" might be interpreted that the set would mutate with the document as happens everywhere else in DOM. It should be called a snapshot or static. I would rather have separate functions than a flag, because it is easier for most users to call the proper function than add a particular flag and it makes it easier to return different result types, which I also believe is important. I do not fully understand your intent with "PREVENT" but I find many problems with the idea. I know of no language where you can "force a garbage collect to make sure that an unused NodeSet is not blocking them." Java only permits hints, and if you have ever monitored what it does, it seldom collects all or even most of the available objects, even after dozens of calls in a tight loop. It is also historically ill-fated to try to tie anything to the garbage collector because it can cause bizarre deadlocks and other problems. > public NodeSet evaluateAsNodeSet(String queryLanguage, > > String expression, > > Node contextNode, > > NamespaceResolver > namespaceResolver, > > > XPathSortCriteria sort, > > short > evaluationCode, > > short mutationCode) > > throws > XPathException; > I think this has far too many arguments to be simple in simple cases. Let's use different interfaces for different query languages -- even perhaps for XPath 2.0. It will be simpler for users rather than trying to overload this API with things it doesn't need trying to anticipate. If there isn't a single document order most are happy with, then let's leave sorting up to some other layer. The reasons for supporting sorting is that it seems common so having it easily available is good and the implementation might be able to produce the nodes in that order anyway. Making the order user defined seems to make it not easy for the user and very unlikely that the implementation could know it was naturally producing them in an order that matched the order. The evaluation code isn't needed by most use cases I can think of. Instead of a mutation code, I like different methods, for simplicity and different return types corresponding to the different behaviors. > // > > // would like this, but debatable > > // > > public XPathExpression createExpression(String queryLanguage, > > String expression, > > NamespaceResolver > namespaceResolver, > > > XPathSortCriteria sort, > > short > evaluationCode, > > short mutationCode) > > throws > XPathException; > > > > } > You are clearly not the only one requesting an object to represent a compiled expression. Ray Whitmer rayw@netscape.com
Received on Wednesday, 11 July 2001 19:00:43 UTC