Implementation experience: Unparsed entity and notation information items with XInclude, DOM, SAX, and JDOM

As you know, I've written partial XInclude processors in DOM, JDOM, and SAX. All three of these APIs have major problems handling unparsed entity and notation information items that come in from included, parsed documents.

JDOM does not expose unparsed entities or notations at all. The information is simply not available.

DOM2 informs you of the notations and unparsed entities declared in a given document.  However, it provides no means for associating them with particular attributes, processing instructions or elements. That is, it does not tell you the type of any attribute. DOM3 may allow you to hack this together, but only with the assistance of the abstract schemas module. The DOM3 core still does not include attribute type information. (I should probably suggest adding this to the DOM3 group too.)

SAX is the only major API that does include enough information to theoretically tell which attributes refer to notations and unparsed entities. However, SAX is designed as a streaming API. Including notations and unparsed entities requires modifying the DOCTYPE declaration and/or DTD which would normally be done at the start of processing. However, the complete list of notations and unparsed entities referenced is not available until the end of the document.

I noticed this first in the context of notations and unparsed entities in the document information item. However, it really applies type information item.

I am not sure how to handle this. I would like to at least consider the possibility of not requiring implementations to maintain notation and unparsed entities. This may in fact be what section 5.3 is trying to say; i.e. I can ignore anything that's not in 5.3. However, that's not totally clear. For one thing, taking that interpretation would further imply that the input infoset could ignore character information items, though the output set could not. 

If this is what 5.3 is trying to say, then I think some discussion much earlier in the spec of how an input XML document is converted to an XML infoset is called for. In particular, an explicit statement that other kinds of info items can be dropped out would be helpful. 
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+ 
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      | 
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
+----------------------------------+---------------------------------+

Received on Sunday, 2 September 2001 11:15:09 UTC