- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Wed, 17 Oct 2007 09:04:03 -0700
- To: public-sml@w3.org
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>
A while ago, I took an action to outline possible solutions to bug 4687 "Handling of DTDs when composing an IF document". This is my attempt to discharge that action. --Michael EXECUTIVE SUMMARY We can say A If a document to be embedded inline in an SML model has a DTD, it should first be made a standalone document that does not depend upon a DTD and then embedded as described. or B If a document to be embedded inline in an SML model has a DTD, the relevant parts of its DTD should be copied into the DTD of the SML-IF package, with name-mangling as required. or C If a document to be embedded inline in an SML model has a DTD, it may be embedded inline in base64Binary or hexBinary form. THE PROBLEM Section 3.3.1 Embedded Documents now reads: If a document is to be embedded in the SML-IF document, the octet stream representing it MUST first be processed as follows: - The XML declaration and document type declaration (DTD) are removed. - The stream is converted to the encoding of the SML-IF document into which it will be packaged. Note: If the SML-IF document uses UTF-8 encoding, the octet-stream result of XML Canonicalization [Canonical XML] is more than sufficient to accomplish this processing. The resulting octet stream MUST be embedded as the content of the data child of the corresponding document element. This is not guaranteed to produce well-formed or correct output. For example, consider the following XML documents: <!DOCTYPE a [ <!ELEMENT a ANY><!ATTLIST a type CDATA 'ordered'> ] ><a/> <!DOCTYPE b [ <!ELEMENT b ANY><!ENTITY c 'Aha!'> ]><b>&c;</b> If we follow the rule given in section 3.3.1, the SML-IF package will end up with something like the following form: <?xml version="1.0" encoding="UTF-8"?> <model xml:base="http://www.university.example.org/sml/models/" xmlns="http://www.w3.org/2007/09/sml-if" xmlns:sml="http://www.w3.org/2007/09/sml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0"> ... <instances> <document> <docInfo> <aliases> <alias>http://example.org/a.xml</alias> </aliases> </docInfo> <data> <a/> </data> </document> <document> <docInfo> <aliases> <alias>http://example.org/b.xml</alias> </aliases> </docInfo> <data> <b>&c;</b> </data> </document> </instances> </model> This isn't quite what one wants: the 'a' document really should have the defaulted attribute specification type="ordered", and the 'b' element really should have some content. We need to deal with this, even though SML says that it uses XSDL and Schematron as schema languages, not DTDs, because there is no requirement that every document in a model be governed by any schema, and because in fact both XSDL and Schematron are designed to be usable both independently of DTDs and together with DTDs. Several possible solutions are probably worth considering. SOLUTION 1: STAND-ALONE DOCUMENTS We can make the removal of the DTD be (relatively) harmless by specifying that before the document type declaration is removed, the document must be transformed into an equivalent standalone document with no internal DTD subset. For our purposes, a standalone document is one for which all of the following are true: - No elements for which attributes are declared with default values appear in the document instance without value specifications for those attributes. - No entity references occur in the document, other than for the pre-defined entities amp, lt, gt, apos, quot. - No attributes declared with a tokenized type appear in the document with a value such that normalization will change the attribute's value. - No element declared in the DTD as having element content has any white space characters contained directly within any instance of that element type. Informally, this is equivalent to creating a document for which the XML declaration <?xml ... standalone='yes' ?> is appropriate when the internal DTD subset is removed and inserted at the beginning of the external subset. (Note that if the document is already a standalone document in this sense, no changes need to be made to the instance before removing the DTD.) For this rule, the example given above turns into: <document> <docInfo> <aliases> <alias>http://example.org/a.xml</alias> </aliases> </docInfo> <data> <a type="ordered"/> </data> </document> <document> <docInfo> <aliases> <alias>http://example.org/b.xml</alias> </aliases> </docInfo> <data> <b>Aha!</b> </data> </document> SOLUTION 2: MERGING DTDS We can merge the DTDs of the two documents. In the simple case, the names declared in the different DTDs don't conflict, so a simple copy of the relevant DTDs suffices. For this rule, the example given above turns into: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE model [ <!ELEMENT a ANY> <!ATTLIST a type CDATA 'ordered'> <!ELEMENT b ANY> <!ENTITY c 'Aha!'> ]> <model xml:base="http://www.university.example.org/sml/models/" ... </model> SOLUTION 2': MERGING DTDS WITH NAME MANGLING In some cases. solution 2 will lead to name conflicts for declarations in the DTDs. To merge them reliably, it's necessary either to ascertain that there are no name conflicts, or to mangle the names. If we mechanically insert a unique prefix to every name in the DTD which must be unique, and make the appropriate changes in the instances, we can produce an SML-IF package with embedded documents not identical to, but isomorphic to, the originals. The names can then be unmangled by the consumer, if desired (it will be). (Extension of this algorithm to handle qualified names is left as an exercise for the reader.) Using the prefixes p1. and p2. for documents a.xml and b.xml, the SML-IF package would look something like this: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE model [ <!ELEMENT p1.a ANY> <!ATTLIST p1.a type CDATA 'ordered'> <!ELEMENT p2.b ANY> <!ENTITY p2.c 'Aha!'> ]> <model xml:base="http://www.university.example.org/sml/models/" xmlns="http://www.w3.org/2007/09/sml-if" xmlns:sml="http://www.w3.org/2007/09/sml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0"> ... <instances> <document> <docInfo> <aliases> <alias>http://example.org/a.xml</alias> </aliases> </docInfo> <data> <p1.a/> </data> </document> <document> <docInfo> <aliases> <alias>http://example.org/b.xml</alias> </aliases> </docInfo> <data> <p2.b>&p2.c;</p2.b> </data> </document> </instances> </model> SOLUTION 3: TUNNELING THROUGH BASE64BINARY OR HEXBINARY Non-standalone documents can also be transmitted as embedded documents without loss of information by tunneling them through one of the two binary datatypes of XSDL. If, for example, we embed the instances in base64 encoding, the package would take a form like this: <?xml version="1.0" encoding="UTF-8"?> <model xml:base="http://www.university.example.org/sml/models/" xmlns="http://www.w3.org/2007/09/sml-if" xmlns:sml="http://www.w3.org/2007/09/sml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xs ="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0"> ... <instances> <document> <docInfo> <aliases> <alias>http://example.org/a.xml</alias> </aliases> </docInfo> <data xsi:type="xs:base64Binary"> XDxc IURP Q1RZ UEUg YSBc WyBc PFwh RUxF TUVO VCBh IEFO WVw+ XDxc IUFU VExJ U1Qg YSB0 eXBl IENE QVRB ICdv cmRl cmVk J1w+ IFxd Plw8 YS9c Pg== </data> </document> <document> <docInfo> <aliases> <alias>http://example.org/b.xml</alias> </aliases> </docInfo> <data xsi:type="xs:base64Binary"> PFwh RE9D VFlQ RSBi IFsg PFwh RUxF TUVO VCBi IEFO WT48 XCFF TlRJ VFkg YyAn QWhh ISc+ IF0+ PGI+ JmM7 PC9i Pg== </data> </document> </instances> </model>
Received on Wednesday, 17 October 2007 16:03:40 UTC