RE: Canonicalization xml:base processing from Grosso, Paul on 2006-05-18 (public-xml-core-wg@w3.org from May 2006)

From: Grosso, Paul <pgrosso@ptc.com>
Date: Thu, 18 May 2006 10:39:35 -0400
To: <public-xml-core-wg@w3.org>
Message-ID: <CF83BAA719FD2C439D25CBB1C9D1D3020357AD4F@HQ-MAIL4.ptcnet.ptc.com>
I'll leave it to others to comment on the substance, but I note your 
wording mentions 2396 which is already superceded. Can we avoid
referencing an old RFC?
 
paul


________________________________

	From: Konrad Lanz [mailto:Konrad.Lanz@iaik.tugraz.at] 
	Sent: Thursday, 2006 May 18 06:07
	To: Richard Tobin
	Cc: Grosso, Paul; public-xml-core-wg@w3.org
	Subject: Re: Canonicalization xml:base processing
	
	
	Dear Richard,  
	
	This email is about the xml:base fix up. First of all I'd like
to give some examples to make sure we have an agreement on how the
Algorithm for xml:base fix up shall behave.
	These examples could potentially go into the final document as
well. After the examples follows a suggestion for a new section 2.4.
	Btw. when I tried to merge the algorithms I found out that a lot
of what you wrote, was already there in the text, however in a very
"xpathified" manner with minor errors, which at least is similar to most
of the text in the document. I tried to decrypt this bit a little and I
hope it is more readable now.
	
	Up front I'd like to mention that after talking to Jose Kahan
and thinking about the issue for a little longer we'd still prefer to
also perform "dot and dot-dot canonicalization" (aka.
remove_dot_segments). It will allow the reuse of existing
implementations for relative URI resolution. More important from my
point of view however is: "dot and dot-dot canonicalization" allows to
map more equivalent documents onto the same serialized output and helps
to avoid false negatives in XMLDSig.
	
	--- Examples for xml:base fixup ---
	
	For the given input
	  <a xml:base="one/two">
	    <b xml:base="//three/four/./five/./../file.xsd">
	      <c xml:base="a.file"/>
	      <d>
	        <e xml:base="#bare-name">
	          <f xml:base=""/>
	          <f1/>
	        </e>
	        <g xml:base="//six/"/>
	      </d>
	      <h xml:base="http://www.iaik.tugraz.at"
<http://www.iaik.tugraz.at> >
	        <i xml:base="/aboutus/people/index.php">
	          <j xml:base="lanz/index.php">
	        </i>
	      </h>
	    </b>
	  </a> 
	
	with <a> being clipped out c14 shall output:
	
	    <b xml:base="//three/four/./five/./../file.xsd">
	      <c xml:base="a.file"/>
	      <d>
	        <e xml:base="#bare-name">
	          <f xml:base=""/>
	          <f1/>
	        </e>
	        <g xml:base="//five/"/>
	      </d>
	      <h xml:base="http://www.iaik.tugraz.at"
<http://www.iaik.tugraz.at> >
	        <i xml:base="/aboutus/people/index.php">
	          <j xml:base="lanz/index.php">
	        </i>
	      </h>
	    </b>
	
	with <b> being clipped out c14 gives:
	  <a xml:base="one/two">
	    
	      <c xml:base="//three/four/./five/./../a.file"/>
("//three/four/a.file")
	      <d xml:base="//three/four/./five/./../file.xsd">
("//three/four/file.xsd")
	        <e xml:base="#bare-name">
	          <f xml:base=""/>
	          <f1/>
	        </e>
	        <g xml:base="//five/"/>
	      </d>
	      <h xml:base="http://www.iaik.tugraz.at"
<http://www.iaik.tugraz.at> >
	        <i xml:base="/aboutus/people/index.php">
	          <j xml:base="lanz/index.php">
	        </i>
	      </h>
	   
	  </a> 
	
	with <b> and <d> being clipped out:
	  <a xml:base="one/two">
	    
	      <c xml:base="//three/four/./five/./../a.file"/>
	      
	        <e
xml:base="//three/four/./five/./../file.xsd#bare-name">
("//three/four/file.xsd#bare-name")
	          <f xml:base=""/>
	          <f1/>
	        </e>
	        <g xml:base="//five/"/>
	     
	      <h xml:base="http://www.iaik.tugraz.at"
<http://www.iaik.tugraz.at> >
	        <i xml:base="/aboutus/people/index.php">
	          <j xml:base="lanz/index.php">
	        </i>
	      </h>
	   
	  </a> 
	
	with <b>, <d> and <e> being clipped out:
	  <a xml:base="one/two">
	    
	      <c xml:base="//three/four/./five/./../a.file"/>
	      
	        
	          <f xml:base="//three/four/./five/./../file.xsd"/>
("//three/four/file.xsd")
	          <f1
xml:base="//three/four/./five/./../file.xsd#bare-name"/>
	        
	        <g xml:base="//five/"/>
	     
	      <h xml:base="http://www.iaik.tugraz.at"
<http://www.iaik.tugraz.at> >
	        <i xml:base="/aboutus/people/index.php">
	          <j xml:base="lanz/index.php">
	        </i>
	      </h>
	   
	  </a> 
	
	--- Section 2.4 reworded ---
	
	

	2.4 Document Subsets


		Some applications require the ability to create a
physical representation for an XML document subset (other than the one
generated by default, which can be a proper subset of the document if
the comments are omitted). Implementations of XML canonicalization that
are based on XPath can provide this functionality with little additional
overhead by accepting a node-set as input rather than an octet stream.
The processing of an element node E MUST be modified slightly when an
XPath node-set is given as input and the element's parent (direct
ancestor) is omitted from the node-set. This is necessary because
omitted nodes SHALL not break the inheritance rules of inheritable
attributes defined in the xml namespace. 

	[Definition:] Simple inheritable attributes are attributes that
have a value that requires at most a simple redeclaration. This
redeclaration is done by supplying a new value in the child axis. The
redeclaration of a simple inheritable attribute A contained in an
element E is done by supplying a new value to an attribute with the same
name contained in a descendant element of E. Simple inheritable
attributes are xml:lang and xml:space.
	

		The method for processing the attribute axis of an
element E in the node-set is enhanced. All element nodes along E's
ancestor axis are examined for nearest occurrences of simple inheritable
attributes in the xml namespace, such as xml:lang and xml:space (whether
or not they are in the node-set). From this list of attributes, remove
any simple inheritable attributes that are in E's attribute axis
(whether or not they are in the node-set). Then, lexicographically merge
this attribute list with the nodes of E's attribute axis that are in the
node-set. The result of visiting the attribute axis is computed by
processing the attribute nodes in this merged attribute list. 

	The xml:base attribute is not a simple inheritable attribute and
requires special processing beyond a simple redeclaration.  A "join URI"
function is used which takes any URI (uri-1) from an ancestor and joins
a relative URI of E (rel-uri-2) (in most cases after the last slash) of
the former and then normalizes the result. We describe here a simple
method for providing this function. This method uses a separate string
buffer in a manner similar to that found in section 5.2 of RFC 2396.
Please refer to this source for terms and definitions used in the
following Algorithm.
	

	1.	If the first URI (uri-1) is null continue with step 2
otherwise copy uri-1 to the buffer. In other words, any characters of
uri-1 is copied to the buffer.
		
	2.	If the relative URI (rel-uri-2) is null continue with
step 5 otherwise if the relative rel-uri-2 starts with a '#' hash or it
is the empty string "" continue with step 4 otherwise remove the last
segment of the first URI's (uri-1) path component.  Anything after the
last (right-most) slash character, if any, is removed from the buffer. 
	3.	If the relative URI (rel-uri-2) starts with a '/' slash
delete all "<segment>/" from the buffer, where <segment> is a complete
path segment. If the URI (rel-uri-2) starts with a '//' two slashes
delete further the '//<authority>/'. 
		
	4.	The relative URI is appended to the buffer string. 
	5.	All occurrences of "./", where "." is a complete path
segment, are removed from the buffer string. 
	6.	If the buffer string ends with "." as a complete path
segment, that "." is removed. 
	7.	All occurrences of "<segment>/../", where <segment> is a
complete path segment not equal to "..", are removed from the buffer
string. Removal of these path segments is performed iteratively,
removing the leftmost matching pattern on each iteration, until no
matching pattern remains. 
	8.	If the buffer string ends with "<segment>/..", where
<segment> is a complete path segment not equal to "..", that
"<segment>/.." is removed. 
	9.	If the resulting buffer string begins with
"<scheme>://<authority>/" followed by one or more complete path segments
of "..", then the resulting URI is considered to be in error and
Implementations SHOULD indicate this error and fail, if however there
are no "<scheme>://<authority>/"> components, the result is a relative
URI starting with a relative path. If processing continues
implementations MUST handle this by retaining the leading ".." complete
path segments in the resulting path (i.e., treating them as part of the
final URI e.g. ../../<segment>/<segment> )*. 

	This function may also be called with a null URI, i.e. when no
xml:base attribute exists in E (not to be confused with xml:base=""). 
	

	The method for processing the attribute axis of an element E in
the node-set hence needs to be enhanced further.
	The element nodes along E's ancestor axis are examined for all
occurrence of omitted non simple inheritable attributes in the xml
namespace (i.e. they are not in the node-set), such as xml:base until
their first rendered occurrence exclusive (i.e. this one is in the
node-set). Only if such attributes exist E's  xml:base attribute will be
changed (i.e. added or fixed up).  The xml:base attributes selected will
be joined by calling the "join URI" function described previously
iteratively beginning with the two xml:base attributes closest to the
document root until the new value for E's xml:base attribute remains
(may also be null). 
	

	Then, lexicographically merge this fixed up attribute with the
nodes of E's attribute axis that are in the node-set. The result of
visiting the attribute axis is computed by processing the attribute
nodes in this merged attribute list.



	best regards
	Konrad
	
	
	Richard Tobin wrote:   
	

		To fix up the xml:base attribute of an element E:
		
		If the base URI of the immediate container of E is known
(and is
		therefore by definition absolute), determine the base
URI of E
		according to xml:base.  Set the xml:base attribute to
this value.
		
		If the base URI of E's container is not known (which can
only be the
		case if the base URI of the document is unknown, and
there is no
		ancestor element with an absolute xml:base attribute),
proceed as
		follows:
		
		 - if there is no ancestor with an xml:base attribute,
leave E's
		   xml:base attribute (if any) unchanged;
		
		 - if the nearest ancestor with an xml:base is not being
omitted,
		   leave E's xml:base attribute (if any) unchanged;
		
		 - otherwise we must construct an xml:base attribute
giving E's base
		   URI relative to the nearest non-omitted ancestor with
an xml:base
		   attribute; call this ancestor A).  Find the xml:base
attributes of
		   the omitted ancestor elements between A and E.  Take
these in
		   outer-to-inner order, followed by the E's xml:base
attribute if it
		   has one.  This is a sequence of relative URIs.
Discard the last
		   segment - the characters after the last slash - of
all but the last
		   of these.  If any of these URIs has no slash
character, discard it
		   completely.  Concatenate the resulting strings, and
use this as the
		   xml:base attribute of E.
		
		-- Richard
		
		  



	-- 
	Konrad Lanz, IAIK/SIC - Graz University of Technology
	Inffeldgasse 16a, 8010 Graz, Austria
	Tel: +43 316 873 5547
	Fax: +43 316 873 5520
	https://www.iaik.tugraz.at/aboutus/people/lanz
	http://jce.iaik.tugraz.at
	
	Certificate chain (including the EuroPKI root certificate):
	https://europki.iaik.at/ca/europki-at/cert_download.htm
Received on Thursday, 18 May 2006 14:41:49 UTC