Re: The DOM is not a model, it is a library! from Stephen R. Savitzky on 1999-10-06 (www-dom@w3.org from October to December 1999)

From: Stephen R. Savitzky <steve@rsv.ricoh.com>
Date: 06 Oct 1999 13:43:17 -0700
To: keshlam@us.ibm.com
Cc: www-dom@w3.org
Message-ID: <qcr9j8qkyi.fsf@congo.crc.ricoh.com>
keshlam@us.ibm.com writes:

> >A conforming DOM implementation will render this as &lt;steve@rsv.ricoh.com>,
> >defeating my intentions.
> 
> I presume you meant  <steve@rsv.ricoh.com> ... 

I do not -- I'm referring to what will be seen in the output after the nodes
are converted back into a string or written onto a file, after the DOM has
``helpfully'' removed the distinction between &gt; and >.  

> character entities are converted to characters as the document is read in.

Yep, that's the problem. 

>     Even if you define your own entities, the DOM has allowed the parser to
> flatten entity references, so some parsers may expand those too before the DOM
> has any say in the matter.

Most parsers do, and that's a problem for me. 

>   If you want to preserve the distinction between &lt; and the <
> character, you have to process your document as something other than XML.

The XML specification, like the DOM, like SGML, only covers the behavior of
an application that _reads_ a document, converts it into a parse tree, and
does something with the tree, such as rendering it on a screen.  To such an
application, only the data matters, and the specifications are very clear
about what does and does not matter to such an application.

It says nothing about an application that _creates_ or _transforms_ a
document.  Presumably, to such an application the output format _does_
matter; distinctions between &gt; and > may indeed be very significant in
such cases.  

In particular, I expect an application that transforms documents to be
capable of implementing the null transform.  Putting it another way, I
expect "diff" applied to the input and output documents to show me exactly
what I told the application to change, and nothing else. 

I don't think that's so very unreasonable, do you?  But it means that I
can't use the DOM and that the application, even though the input and output
documents may be XML, is not an XML application!

> I don't know if HTML entity expansion follows similar rules; I'm not an
> authority on HTML4 by any means. If it does, there's nothing the DOM can do
> about it, and you simply have a proposed solution that doesn't work in today's
> web tools. 

HTML, and even SGML, do not make the distinction.  Browsers do not make the
distinction.  But my text editor does. 

> If they _do_ allow late resolution of these, there's no reason the
> parser couldn't create EntityReference nodes in the DOM for LT and GT. Whether
> your parser will agree to do that for you or not is outside the DOM's scope.

My parser does, of course, but the DOM specification requires it not to.
 
> >The DOM has no way to represent the fact that the end tags have been
> >omitted.
> 
> That's true. The solution I'd recommend for this, if it's important to you, is
> to (a) use an output routine that understands how to omit end-tags, and (b)
> extend the DOM by adding a custom flag to Elements which indicates whether in
> this case you'd like the end-tag omitted (if possible) or not.
> 
> A similar flag could be used to distinguish "empty by accident" and "empty by
> intention".

... all of which I've done, of course.  As you and others have pointed out,
it's quite easy to create something that extends the DOM interfaces and
leaves out parts of the required semantics.

> The distinction here is that you're extending your model implementation,
> but not changing the behavior of its DOM API.

On the contrary, I _am_ changing the behavior, in many places -- having a
representation for character entities as EntityReference nodes is only one
among many; another is that my parser is a unidirectional TreeWalker that
returns totally unlinked (and un-owned) Nodes that my application will put
into a tree or not, as appropriate. 

> I'm sure that if we work hard enough we can come up with an example that
> really does absolutely require alterations to the DOM API's behavior, but
> I think a lot of these problems can be solved via either clever
> application of the existing DOM or non-DOM behaviors attached to DOM
> nodes.

I think implementing the null transform on documents requires _both_
extensions and non-DOM behavior.  Sure, you could have the parser map
everything in the document into some horrendous combination of elements that
the output routine can then re-assemble, <element tagname="em">
<entity expand-on-output="yes">amp</entity> I think that's going way too
far</element>, don't you?

> >But don't expect _every_ application to find it a good match.
> 
> Speaking only for myself, that's not my expectation. I do expect it to be
> a good match for a wide range of applications, and an adequate match for
> others bolstered by the benefits of modularity and code reuse. 

The trap is that it looks superficially like a good match for any
application where a parse tree is needed, and it's not.  My guess is that,
whenever you have to do something that the DOM doesn't support, it's better
to scrap it than to extend or modify an existing DOM implementation.

It's not that it isn't _easier_ to take somebody's DOM (even your own) and
hack on it.  It's just that at the end of the day after doing that you have
something that quacks like a duck, waddles like a duck, and has poisoned
ankle spurs like a platypus.  It violates the expectations of anyone who
comes after you and tries to read or maintain the code.

-- 
Stephen R. Savitzky  <steve@rsv.ricoh.com>  <http://rsv.ricoh.com/~steve/>
Platform for Information Applications:      <http://RiSource.org/PIA/>
Chief Software Scientist, Ricoh Silicon Valley, Inc. Calif. Research Center
 voice: 650.496.5710  front desk: 650.496.5700  fax: 650.854.8740 
  home: <steve@theStarport.org> URL: http://theStarport.org/people/steve/
Received on Wednesday, 6 October 1999 16:43:50 UTC