W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > December 1996

Re: RS/RE, again (quite sorry)

From: Derek Denny-Brown <ddb@criinc.com>
Date: Fri, 13 Dec 1996 08:33:38 -0800
Message-Id: <2.2.32.19961213163338.0069dc5c@MAILHOST.criinc.com>
To: w3c-sgml-wg@w3.org
At 09:34 AM 12/13/96 -0500, Gavin wrote:
>>>We seem to be confusing parsing XML, and parsing the grammar defined
>>>by the DTD is you ask me...
>>
>>But one of the important points about SGML (of which XML is a subset) is a
>>contract between the parser and the application: "I will not hand you data
>>which does not conform to the DTD." 

Is it expected that the parser will parse the instance differently if it has
a DTD vs. if it does not?  i.e. if I were to construct a grove from the
result of the parse, would the portion of the grove representing the
instance be equivalent (isomorphic?) or would the presence of the DTD imply
the (strong) likelihood of a differing parse?  It would seem to me that it
would be best if I could expect the result from the parses to be the same,
regardless of DTD, for some applications.  Parsing with regard to the DTD
could be viewed as a filter of the parse w/o the DTD (a subset, almost).  It
may be easier to handle this issue if the parse is treated as a (potential)
2nd step which requires a DTD.

>>This is *central*. Without it, we can seldom do intelligent things
>>with documents.

(Gavin cut this from the above, but I assume it went with the previous quote.)
The problem, as I see it is that the application may have an idea what the
DTD would be, but the parse does not.  So long as the application know
exactly what it is going to get, this should not be a problem.


>>Your solution would leave it up entirely to applications, which will (IMO)
>>almost inevitably lead to incompatibility.
>
>Depends. At least all the applications will know *exactly* what
>they'll be handed.

Unfortunately, a agree on both fronts.  Since we do have a number of
concrete ideas about how the parser should report the document to the
application, if it has a DTD, why not consolidate those ideas, define what
the parser should return if it is parsing a document relative to a DTD, then
say it is up to the application to treat an instance which it knows to be
conforming to a DTD as if it were parsed with regard to that DTD.  Going
back to my 2 step parser model, an application with a fixed set of DTDs
would always take the raw parse and then have hard coded into it some
procedures which filtered the parsers events appropriately (with regard to
the DTD).  A generic DSSSL style sheet might include some extension which
would tell the parser whether to return the raw parse or to require a DTD....

I see an almost unanimous agreement that there is no clean way to tell how
to handle white space in the document without a DTD.  I also hear people
pounding that they want DTDless operation. I don't see any easy way to
resolve this short of making all white-space relevant, which throws everyone
who wants readable XML into fits (with very good reason, if that is your
criteria).

The only way to know how to deal with white space is to have the DTD, so one
end of the transaction needs to know the DTD in order to normalize white
space.  The only other solution, i can seriously think of, is to say that
all white space is significant in a document, unless you have a DTD.  This
opens the door for all sorts of mess (why does it look different over there
vs here....) though and I hestitate even on that.  The view of a document as
a formatted (for human readability) text file vs. a hierarchical encoding of
some data are fundamentally different views, which happened to correspond
exactly to the view of the human author/viewer vs the parser/application.
Reconciling these views is one of the problems I have had with SGML all along...

-derek
"that which is not slightly distorted lacks sensible appeal: from which it
follows
 that irregularity - that is to say, the unexpected, surprise, and astonishment,
    are an essential part and characteristic of beauty" - Charles Baudelaire
Received on Friday, 13 December 1996 11:37:23 EST

This archive was generated by hypermail pre-2.1.9 : Wednesday, 24 September 2003 10:03:48 EDT