XPath on encoded SOAP documents

Encoded SOAP documents (that is, XML documents that are generated by 
applying SOAP's encoding rules as described in 
http://www.w3.org/TR/soap12-part2/#soapenc) are a fast-growing segment 
of  XML .  Unfortunately, XPath is not well-equipped to process them.

To see why, let's look at a simple example:

   <?xml version='1.0' encoding='UTF-8'?>
   <soap:Envelope xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
     xmlns:xsd='http://www.w3.org/2001/XMLSchema'
     xmlns:soap='http://schemas.xmlsoap.org/soap/envelope/'
     xmlns:soapenc='http://schemas.xmlsoap.org/soap/encoding/'
     soap:encodingStyle='http://schemas.xmlsoap.org/soap/encoding/'>
     <soap:Body>

        <n:getOrderStatusResponse 
xmlns:n='http://tempuri.org/OrderStatusTable'>
          <Result href='#id0'/>
        </n:getOrderStatusResponse>

        <id0 id='id0' soapenc:root='0'
           xmlns:ns2='http://www.user.com/package/'   xsi:type='ns2:Status'>
           <orderNum xsi:type='xsd:string'>10002</orderNum>
           <accountName xsi:type='xsd:string'>IBM</accountName>
           <orderState xsi:type='xsd:string'>Open</orderState>
           <orderPriority xsi:type='xsd:string'>High</orderPriority>
           <orderManager xsi:type='xsd:string'>Barry Bonds</orderManager>
           <shipToInfo href='#id1'/>
       </id0>

       <id1 id='id1' soapenc:root='0'
          xmlns:ns2='http://www.user.com/package/' xsi:type='ns2:ShipTo'>
         <contact xsi:type='xsd:string'>Jeff Kent</contact>
         <address1 xsi:type='xsd:string'>123 Main Street</address1>
         <address2 xsi:type='xsd:string'>Suite 400</address2>
         <city xsi:type='xsd:string'>Oakland</city>
         <state xsi:type='xsd:string'>CA</state>
         <country xsi:type='xsd:string'>USA</country>
         <zip xsi:type='xsd:int'>94612</zip>
       </id1>

     </soap:Body>
   </soap:Envelope>

Note that the document's structure and the structure of the underlying data 
are quite different.  From the data point of view, we have the hierarchy:

	getOrderStatusResponse
	    Result
	        shipToInfo

In XML, the corresponding elements are peers.  The relationships are 
expressed by href attributes which link to id attributes.  (Since the data 
can form an arbitrary graph rather than a strict tree, it's necessary for 
SOAP to be able to use links in addition to containment, of course.)

The dereferences introduced in XPath 2.0 cannot be used here for three 
reasons, of increasing complexity.

1. href is not strictly speaking an ID

An href attribute that points to an element with id "a" has value 
"#a".  This is a minor point, and requires only a minor adjustment.

2. The element names in these documents are meaningless.

Dereferences use NameTests, just as axes do, which makes sense when element 
names are meaningful.  In these documents, an expression like:

	n:getOrderStatusResponse/@Result #> id0

is useless.  It works on this particular document, but need not work on 
another document containing the same information created by a different 
implementation of the SOAP encoder.  A slight change to the data (e.g. 
adding a billTo address) could result in different element name even with 
the same encoder.  What's needed is a syntax which doesn't refer to the 
element name of the target.

3. The structure of the document isn't entirely fixed.

The SOAP spec is a bit slippery here, but I think an encoder is within its 
rights to notice that the shipTo element only appears once in the document, 
and inline it, producing:

        <id0 id='id0' '>
           ....
           <shipToInfo xsi:type='ns2:ShipTo'>
            ...
           </shipToInfo>
       </id0>
	
Now what's needed is a construct which finds a child element and:

	If it's a real element, returns it.
	
	If it's a stub (has an href attribute), traverses its href link and 
returns the result.

The obvious way to extend XPath to handle this is to introduce a 
special-purpose function.  I actually did this (starting with standard 
Xalan 2.2), calling it "href".  This works reasonably well when only one 
link needs traversing:

	href(n:getOrderStatusResponse, "shipToInfo")

But it gets ugly fast as the number of steps increases:

     href( href( href( n:getOrder, "n:order" ), Items)[1], shipToInfo)

This might be acceptable if all expressions are computer-generated, but 
definitely not otherwise.  XPath, rightly, expresses this kind of iteration 
with steps.  Accordingly, I added a new axis called "encoded-ref":

	n:getOrder/encoded-ref::Items[1]/encoded-ref::shipToInfo

and, since this became one of the most frequently used axes, added the 
abbreviation "^" :

	n:getOrder/^Items[1]/^shipToInfo

This has found fairly good user acceptance (even though so far as I know 
none of the users are former Pascal programmers who would find ^ as a 
dereference operator mnemonic:-)   There were a few implementation 
difficulties, because the name following

	encoded-ref::

is *not* a NameTest.  Xalan's implementation of axes is:

	The axis is represented by an iterator that produces all nodes one step 
along that axis
	from the starting point.

	The filter (NameTest or NodeTest) that follows filters the product of the 
iterator

	The predicates for the step filter the output further.

(I don't know how common this is in XPath implementations, but the XPath 
grammar is quite suited to it.)  This doesn't work here, since the name 
expresses how to navigate to the target nodeset, not any property of the 
target nodes themselves. In the example above, one would get to the 
response (the id0 element) with

    n:getOrderStatusResponse/^Result

but there is nothing about the id0 element itself which matches 
"Result".  Instead, the string "Result" has to be processed by the axis 
iterator itself.

To sum up:

XPath 2.0 as currently defined cannot process encoded SOAP documents.  It's 
possible to add a new type of axis to remedy this.   The implementation 
costs are non-zero but not prohibitive.  The definition of this axis is 
straightforward, result in something subtly different from the other axes.

Received on Saturday, 16 March 2002 07:35:36 UTC