Re: Canonicalization xml:base processing

Dear Richard, 

This email is about the xml:base fix up. First of all I'd like to give 
some examples to make sure we have an agreement on how the Algorithm for 
xml:base fix up shall behave.
These examples could potentially go into the final document as well. 
After the examples follows a suggestion for a new section 2.4.
Btw. when I tried to merge the algorithms I found out that a lot of what 
you wrote, was already there in the text, however in a very "xpathified" 
manner with minor errors, which at least is similar to most of the text 
in the document. I tried to decrypt this bit a little and I hope it is 
more readable now.

Up front I'd like to mention that after talking to Jose Kahan and 
thinking about the issue for a little longer we'd still prefer to also 
perform "dot and dot-dot canonicalization" (aka. remove_dot_segments). 
It will allow the reuse of existing implementations for relative URI 
resolution. More important from my point of view however is: "dot and 
dot-dot canonicalization" allows to map more equivalent documents onto 
the same serialized output and helps to avoid false negatives in XMLDSig.

--- Examples for xml:base fixup ---

For the given input
  <a xml:base="one/two">
    <b xml:base="//three/four/./five/./../file.xsd">
      <c xml:base="a.file"/>
      <d>
        <e xml:base="#bare-name">
          <f xml:base=""/>
          <f1/>
        </e>
        <g xml:base="//six/"/>
      </d>
      <h xml:base="http://www.iaik.tugraz.at">
        <i xml:base="/aboutus/people/index.php">
          <j xml:base="lanz/index.php">
        </i>
      </h>
    </b>
  </a>

with <a> being clipped out c14 shall output:

    <b xml:base="//three/four/./five/./../file.xsd">
      <c xml:base="a.file"/>
      <d>
        <e xml:base="#bare-name">
          <f xml:base=""/>
          <f1/>
        </e>
        <g xml:base="//five/"/>
      </d>
      <h xml:base="http://www.iaik.tugraz.at">
        <i xml:base="/aboutus/people/index.php">
          <j xml:base="lanz/index.php">
        </i>
      </h>
    </b>

with <b> being clipped out c14 gives:
  <a xml:base="one/two">
   
      <c xml:base="//three/four/./five/./../a.file"/> 
("//three/four/a.file")
      <d xml:base="//three/four/./five/./../file.xsd"> 
("//three/four/file.xsd")
        <e xml:base="#bare-name">
          <f xml:base=""/>
          <f1/>
        </e>
        <g xml:base="//five/"/>
      </d>
      <h xml:base="http://www.iaik.tugraz.at">
        <i xml:base="/aboutus/people/index.php">
          <j xml:base="lanz/index.php">
        </i>
      </h>
  
  </a>

with <b> and <d> being clipped out:
  <a xml:base="one/two">
   
      <c xml:base="//three/four/./five/./../a.file"/>
     
        <e xml:base="//three/four/./five/./../file.xsd#bare-name"> 
("//three/four/file.xsd#bare-name")
          <f xml:base=""/>
          <f1/>
        </e>
        <g xml:base="//five/"/>
    
      <h xml:base="http://www.iaik.tugraz.at">
        <i xml:base="/aboutus/people/index.php">
          <j xml:base="lanz/index.php">
        </i>
      </h>
  
  </a>

with <b>, <d> and <e> being clipped out:
  <a xml:base="one/two">
   
      <c xml:base="//three/four/./five/./../a.file"/>
     
       
          <f xml:base="//three/four/./five/./../file.xsd"/> 
("//three/four/file.xsd")
          <f1 xml:base="//three/four/./five/./../file.xsd#bare-name"/>
       
        <g xml:base="//five/"/>
    
      <h xml:base="http://www.iaik.tugraz.at">
        <i xml:base="/aboutus/people/index.php">
          <j xml:base="lanz/index.php">
        </i>
      </h>
  
  </a>

--- Section 2.4 reworded ---


      2.4 Document Subsets

    Some applications require the ability to create a physical
    representation for an XML document subset (other than the one
    generated by default, which can be a proper subset of the document
    if the comments are omitted). Implementations of XML
    canonicalization that are based on XPath can provide this
    functionality with little additional overhead by accepting a
    node-set as input rather than an octet stream. The processing of an
    element node E MUST be modified slightly when an XPath node-set is
    given as input and the element's parent (direct ancestor) is omitted
    from the node-set. This is necessary because omitted nodes SHALL not
    break the inheritance rules of inheritable attributes defined in the
    xml namespace.

[Definition:] Simple inheritable attributes are attributes that have a 
value that requires at most a simple redeclaration. This redeclaration 
is done by supplying a new value in the child axis. The redeclaration of 
a simple inheritable attribute A contained in an element E is done by 
supplying a new value to an attribute with the same name contained in a 
descendant element of E. Simple inheritable attributes are xml:lang and 
xml:space.

    The method for processing the attribute axis of an element E in the
    node-set is enhanced. All element nodes along E's ancestor axis are
    examined for nearest occurrences of simple inheritable attributes in
    the xml namespace, such as xml:lang and xml:space (whether or not
    they are in the node-set). From this list of attributes, remove any
    simple inheritable attributes that are in E's attribute axis
    (whether or not they are in the node-set). Then, lexicographically
    merge this attribute list with the nodes of E's attribute axis that
    are in the node-set. The result of visiting the attribute axis is
    computed by processing the attribute nodes in this merged attribute
    list.

The xml:base attribute is not a simple inheritable attribute and 
requires special processing beyond a simple redeclaration.  A "join URI" 
function is used which takes any URI (uri-1) from an ancestor and joins 
a relative URI of E (rel-uri-2) (in most cases after the last slash) of 
the former and then normalizes the result. We describe here a simple 
method for providing this function. This method uses a separate string 
buffer in a manner similar to that found in section 5.2 of RFC 2396. 
Please refer to this source for terms and definitions used in the 
following Algorithm.

   1. If the first URI (uri-1) is null continue with step 2 otherwise
      copy uri-1 to the buffer. In other words, any characters of uri-1
      is copied to the buffer.
   2. If the relative URI (rel-uri-2) is null continue with step 5
      otherwise if the relative rel-uri-2 starts with a '#' hash or it
      is the empty string "" continue with step 4 otherwise remove the
      last segment of the first URI's (uri-1) path component.  Anything
      after the last (right-most) slash character, if any, is removed
      from the buffer.
   3. If the relative URI (rel-uri-2) starts with a '/' slash delete all
      "<segment>/" from the buffer, where <segment> is a complete path
      segment. If the URI (rel-uri-2) starts with a '//' two slashes
      delete further the '//<authority>/'.
   4. The relative URI is appended to the buffer string.
   5. All occurrences of "./", where "." is a complete path segment, are
      removed from the buffer string.
   6. If the buffer string ends with "." as a complete path segment,
      that "." is removed.
   7. All occurrences of "<segment>/../", where <segment> is a complete
      path segment not equal to "..", are removed from the buffer
      string. Removal of these path segments is performed iteratively,
      removing the leftmost matching pattern on each iteration, until no
      matching pattern remains.
   8. If the buffer string ends with "<segment>/..", where <segment> is
      a complete path segment not equal to "..", that "<segment>/.." is
      removed.
   9. If the resulting buffer string begins with
      "<scheme>://<authority>/" followed by one or more complete path
      segments of "..", then the resulting URI is considered to be in
      error and Implementations SHOULD indicate this error and fail, if
      however there are no "<scheme>://<authority>/"> components, the
      result is a relative URI starting with a relative path. If
      processing continues implementations MUST handle this by retaining
      the leading ".." complete path segments in the resulting path
      (i.e., treating them as part of the final URI e.g.
      ../../<segment>/<segment> )*.

This function may also be called with a null URI, i.e. when no xml:base 
attribute exists in E (not to be confused with xml:base="").

The method for processing the attribute axis of an element E in the 
node-set hence needs to be enhanced further.
The element nodes along E's ancestor axis are examined for all 
occurrence of omitted non simple inheritable attributes in the xml 
namespace (i.e. they are not in the node-set), such as xml:base until 
their first rendered occurrence exclusive (i.e. this one is in the 
node-set). Only if such attributes exist E's  xml:base attribute will be 
changed (i.e. added or fixed up).  The xml:base attributes selected will 
be joined by calling the "join URI" function described previously 
iteratively beginning with the two xml:base attributes closest to the 
document root until the new value for E's xml:base attribute remains 
(may also be null).

Then, lexicographically merge this fixed up attribute with the nodes of 
E's attribute axis that are in the node-set. The result of visiting the 
attribute axis is computed by processing the attribute nodes in this 
merged attribute list.



best regards
Konrad


Richard Tobin wrote:  
> To fix up the xml:base attribute of an element E:
>
> If the base URI of the immediate container of E is known (and is
> therefore by definition absolute), determine the base URI of E
> according to xml:base.  Set the xml:base attribute to this value.
>
> If the base URI of E's container is not known (which can only be the
> case if the base URI of the document is unknown, and there is no
> ancestor element with an absolute xml:base attribute), proceed as
> follows:
>
>  - if there is no ancestor with an xml:base attribute, leave E's
>    xml:base attribute (if any) unchanged;
>
>  - if the nearest ancestor with an xml:base is not being omitted,
>    leave E's xml:base attribute (if any) unchanged;
>
>  - otherwise we must construct an xml:base attribute giving E's base
>    URI relative to the nearest non-omitted ancestor with an xml:base
>    attribute; call this ancestor A).  Find the xml:base attributes of
>    the omitted ancestor elements between A and E.  Take these in
>    outer-to-inner order, followed by the E's xml:base attribute if it
>    has one.  This is a sequence of relative URIs.  Discard the last
>    segment - the characters after the last slash - of all but the last
>    of these.  If any of these URIs has no slash character, discard it
>    completely.  Concatenate the resulting strings, and use this as the
>    xml:base attribute of E.
>
> -- Richard
>
>   


-- 
Konrad Lanz, IAIK/SIC - Graz University of Technology
Inffeldgasse 16a, 8010 Graz, Austria
Tel: +43 316 873 5547
Fax: +43 316 873 5520
https://www.iaik.tugraz.at/aboutus/people/lanz
http://jce.iaik.tugraz.at

Certificate chain (including the EuroPKI root certificate):
https://europki.iaik.at/ca/europki-at/cert_download.htm

Received on Thursday, 18 May 2006 11:07:35 UTC