proposal for bug 4687 on handling DTDs when composing SML-IF documents

A while ago, I took an action to outline possible solutions to bug
4687 "Handling of DTDs when composing an IF document".  This is my
attempt to discharge that action.

--Michael


EXECUTIVE SUMMARY

We can say

   A If a document to be embedded inline in an SML model has a DTD,
     it should first be made a standalone document that does not
     depend upon a DTD and then embedded as described.

or

   B If a document to be embedded inline in an SML model has a DTD,
     the relevant parts of its DTD should be copied into the DTD of
     the SML-IF package, with name-mangling as required.

or

   C If a document to be embedded inline in an SML model has a DTD,
     it may be embedded inline in base64Binary or hexBinary form.



THE PROBLEM

Section 3.3.1 Embedded Documents now reads:

     If a document is to be embedded in the SML-IF document, the octet
     stream representing it MUST first be processed as follows:

         - The XML declaration and document type declaration (DTD)
           are removed.

         - The stream is converted to the encoding of the SML-IF
           document into which it will be packaged.

         Note:

         If the SML-IF document uses UTF-8 encoding, the octet-stream
         result of XML Canonicalization [Canonical XML] is more than
         sufficient to accomplish this processing.

     The resulting octet stream MUST be embedded as the content of the
     data child of the corresponding document element.

This is not guaranteed to produce well-formed or correct output.

For example, consider the following XML documents:

     <!DOCTYPE a [ <!ELEMENT a ANY><!ATTLIST a type CDATA 'ordered'> ] 
 ><a/>

     <!DOCTYPE b [ <!ELEMENT b ANY><!ENTITY c 'Aha!'> ]><b>&c;</b>

If we follow the rule given in section 3.3.1, the SML-IF package will
end up with something like the following form:

     <?xml version="1.0" encoding="UTF-8"?>
     <model xml:base="http://www.university.example.org/sml/models/"
           xmlns="http://www.w3.org/2007/09/sml-if"
           xmlns:sml="http://www.w3.org/2007/09/sml"
           xmlns:xml="http://www.w3.org/XML/1998/namespace"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           version="1.0">
       ...

       <instances>
         <document>
           <docInfo>
     	<aliases>
     	  <alias>http://example.org/a.xml</alias>
     	</aliases>
           </docInfo>
           <data>
             <a/>
           </data>
         </document>
         <document>
           <docInfo>
     	<aliases>
     	  <alias>http://example.org/b.xml</alias>
     	</aliases>
           </docInfo>
           <data>
             <b>&c;</b>
           </data>
         </document>
       </instances>
     </model>

This isn't quite what one wants: the 'a' document really should have
the defaulted attribute specification type="ordered", and the 'b'
element really should have some content.

We need to deal with this, even though SML says that it uses XSDL and
Schematron as schema languages, not DTDs, because there is no
requirement that every document in a model be governed by any schema,
and because in fact both XSDL and Schematron are designed to be usable
both independently of DTDs and together with DTDs.

Several possible solutions are probably worth considering.


SOLUTION 1:  STAND-ALONE DOCUMENTS

We can make the removal of the DTD be (relatively) harmless by
specifying that before the document type declaration is removed, the
document must be transformed into an equivalent standalone document
with no internal DTD subset.

For our purposes, a standalone document is one for which all of the
following are true:

   - No elements for which attributes are declared with default values
     appear in the document instance without value specifications for
     those attributes.

   - No entity references occur in the document, other than for the
     pre-defined entities amp, lt, gt, apos, quot.

   - No attributes declared with a tokenized type appear in the document
     with a value such that normalization will change the attribute's  
value.

   - No element declared in the DTD as having element content has any
     white space characters contained directly within any instance of  
that
     element type.

Informally, this is equivalent to creating a document for which the
XML declaration <?xml ... standalone='yes' ?> is appropriate when the
internal DTD subset is removed and inserted at the beginning of the
external subset.

(Note that if the document is already a standalone document in this
sense, no changes need to be made to the instance before removing the
DTD.)

For this rule, the example given above turns into:

         <document>
           <docInfo>
     	<aliases>
     	  <alias>http://example.org/a.xml</alias>
     	</aliases>
           </docInfo>
           <data>
             <a type="ordered"/>
           </data>
         </document>
         <document>
           <docInfo>
     	<aliases>
     	  <alias>http://example.org/b.xml</alias>
     	</aliases>
           </docInfo>
           <data>
             <b>Aha!</b>
           </data>
         </document>


SOLUTION 2:  MERGING DTDS

We can merge the DTDs of the two documents.  In the simple case, the
names declared in the different DTDs don't conflict, so a simple
copy of the relevant DTDs suffices.

For this rule, the example given above turns into:

     <?xml version="1.0" encoding="UTF-8"?>
     <!DOCTYPE model [
       <!ELEMENT a ANY>
       <!ATTLIST a type CDATA 'ordered'>
       <!ELEMENT b ANY>
       <!ENTITY c 'Aha!'>
     ]>
     <model xml:base="http://www.university.example.org/sml/models/"
       ...
     </model>


SOLUTION 2':  MERGING DTDS WITH NAME MANGLING

In some cases. solution 2 will lead to name conflicts for declarations
in the DTDs.  To merge them reliably, it's necessary either to
ascertain that there are no name conflicts, or to mangle the names.
If we mechanically insert a unique prefix to every name in the DTD
which must be unique, and make the appropriate changes in the
instances, we can produce an SML-IF package with embedded documents
not identical to, but isomorphic to, the originals.  The names can
then be unmangled by the consumer, if desired (it will be).

(Extension of this algorithm to handle qualified names is left as an
exercise for the reader.)

Using the prefixes p1. and p2. for documents a.xml and b.xml, the
SML-IF package would look something like this:

     <?xml version="1.0" encoding="UTF-8"?>
     <!DOCTYPE model [
       <!ELEMENT p1.a ANY>
       <!ATTLIST p1.a type CDATA 'ordered'>
       <!ELEMENT p2.b ANY>
       <!ENTITY p2.c 'Aha!'>
     ]>
     <model xml:base="http://www.university.example.org/sml/models/"
           xmlns="http://www.w3.org/2007/09/sml-if"
           xmlns:sml="http://www.w3.org/2007/09/sml"
           xmlns:xml="http://www.w3.org/XML/1998/namespace"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           version="1.0">
       ...

       <instances>
         <document>
           <docInfo>
     	<aliases>
     	  <alias>http://example.org/a.xml</alias>
     	</aliases>
           </docInfo>
           <data>
             <p1.a/>
           </data>
         </document>
         <document>
           <docInfo>
     	<aliases>
     	  <alias>http://example.org/b.xml</alias>
     	</aliases>
           </docInfo>
           <data>
             <p2.b>&p2.c;</p2.b>
           </data>
         </document>
       </instances>
     </model>


SOLUTION 3: TUNNELING THROUGH BASE64BINARY OR HEXBINARY

Non-standalone documents can also be transmitted as embedded documents
without loss of information by tunneling them through one of the two
binary datatypes of XSDL.  If, for example, we embed the instances in
base64 encoding, the package would take a form like this:

     <?xml version="1.0" encoding="UTF-8"?>
     <model xml:base="http://www.university.example.org/sml/models/"
           xmlns="http://www.w3.org/2007/09/sml-if"
           xmlns:sml="http://www.w3.org/2007/09/sml"
           xmlns:xml="http://www.w3.org/XML/1998/namespace"
           xmlns:xs ="http://www.w3.org/2001/XMLSchema"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           version="1.0">
       ...

       <instances>
         <document>
           <docInfo>
     	<aliases>
     	  <alias>http://example.org/a.xml</alias>
     	</aliases>
           </docInfo>
           <data xsi:type="xs:base64Binary">
             XDxc IURP Q1RZ UEUg YSBc WyBc PFwh RUxF TUVO VCBh IEFO
             WVw+ XDxc IUFU VExJ U1Qg YSB0 eXBl IENE QVRB ICdv cmRl
             cmVk J1w+ IFxd Plw8 YS9c Pg==
           </data>
         </document>
         <document>
           <docInfo>
     	<aliases>
     	  <alias>http://example.org/b.xml</alias>
     	</aliases>
           </docInfo>
           <data xsi:type="xs:base64Binary">
             PFwh RE9D VFlQ RSBi IFsg PFwh RUxF TUVO VCBi IEFO WT48
             XCFF TlRJ VFkg YyAn QWhh ISc+ IF0+ PGI+ JmM7 PC9i Pg==
           </data>
         </document>
       </instances>
     </model>

Received on Wednesday, 17 October 2007 16:03:40 UTC