Re: Using the DOM with Java

Jonathan Robie <jonathan@texcel.no> writes:

> At 02:29 PM 12/2/98 -0800, Stephen R. Savitzky wrote:
> 
> >Examples of things that don't round-trip include choice of quotes for
> >attributes, named vs. numeric character entities, omitted start-tags and
> >end-tags in HTML documents, presence of line breaks before and inside of
> >tags, and whether an explicit end tag or "/>" was used on an empty tag in
> >XML.  
> 
> Each of these is an example where there is more than one way to encode the
> same document structure. HTML is slated to become XML conformant in the
> future, so omitted start-tags and end-tags will become obsolete, but some
> of these others might really matter in an authoring application that
> persists the output.

... or in an application that _generates_ documents from something else,
e.g. a database.  Or, in my case, a proxy that reads documents, does some
processing, then passes the result along.

It's not clear to me that HTML is _ever_ going to be XML conformant, unless
you think that nobody is ever going to construct HTML files in a text editor
in the future.  I can see things going the other way, with the web being
opened up to full-bore SGML documents.

> Currently, of course, we can't persist documents, so it doesn't really
> cause any problems. 

Java object serialization does a pretty good job.  For that matter, simply
converting a document to external form, storing it in a file, and parsing it
yields a very robust form of persistence (e.g. robust across changes in
internal representation, which Java serialization is not).  

So the inability to reconstruct documents is already a problem for me, and
my DOM implementation contains the necessary extensions.  

> >DOM level 1 loses information -- it is not possible to reconstruct the
> >original document from the "equivalent" DOM tree.  This is one of the
> >most serious problems with it, by the way.  Another is the inability to
> >represent generic SGML documents.  They're related.
> 
> The DOM isn't *supposed* to represent generic SGML documents, it's not in
> our charter. I don't think we should spend our time figuring out how to do
> record ends and other obscure things few people are using.

I'd settle for the ability to handle omitted start and end tags, character
entities, and full (not XML-restricted) DTD's.  What I'd prefer, though, is
a defined way of extending a DOM document to include meta-information, and
an approved method for adding new node types. 

> >I ran into one of these recently, when I was contemplating using &#64; for 
> >@ in documents containing my e-mail address to foil spammers (at least until
> >they start using DOM-based address-suckers).
> 
> Now waitaminit, they can't use our DOM to do anything so despicable as
> writing address-suckers...

What was that about your charter?
	
-- 
 Stephen R. Savitzky   Chief Software Scientist, Ricoh Silicon Valley, Inc., 
<steve@rsv.ricoh.com>                            California Research Center
 voice: 650.496.5710   fax: 650.854.8740    URL: http://rsv.ricoh.com/~steve/
  home: <steve@starport.com> URL: http://www.starport.com/people/steve/

Received on Thursday, 3 December 1998 13:44:28 UTC