Element content the real issue?...

This is an analysis spurred by Paul's Posting "Re: Newlines in element
content (i.e TABLES)"
At 9:51 AM 9/30/96, Paul Prescod wrote:
>The point of Charles' True Information spiel is that the application should
>never see data that has not been normalized according to SGML/XML's rules.
>If a data character (such as a newline under the banish RS/RE proposal)
>occurs in element content, it should be an ERROR, and in an SGML parser's
>interpretation it will be. So an SGML-based application (i.e. Panorama) will
>report an error (if it supports remapping RS/RE).
>That's why RS/RE must either remain as it stands or must be banished from
>element content and replaced by a convention like this:
>>A new paragraph</P>

This same problem occurs for spaces and tabs in element content. My
original proposal (for SGML) avoids this problem because \n and \r
characters would be declared as SPACE characters, and thus would be ignored
in element content. But for XML, we have a problem with any kind of space
elimination in element content when used with DTD-less processing. It's
easy to use my approach with SGML, but with XML, there is a real problem
because without a DTD, we can't tell the difference between element content
and other content.

  So, contra your claim, and my previous assumptions, RE handling is not
the key issue here.  Whatever we decide on RE processing we will still have
to deal with element content in a nasty way because of other whitespace
being treated as data. In fact, it's not clear to me how XML and SGML can
be compatible when processing element content in the absence of a DTD,
since we don't know in that case whether or not we have element content.

   We basically cannot afford to process element and non-element content
differently with regard to whitespace or anything else.

   ==> So we can't allow any ignored whitespace anywhere without resorting
to quoting, because of the non-DTD parsing requirement.

   Perhaps the correct approach to DTD-less processing is to say that the
information returned _is_ different in that case. In this case, if an
instance had whitespace in element content, it would be required to send
the DTD (or at least the content model for the relevant elements). I don't
like this at all, because we now have two possible correct abstract
syntaxes for the document. This should be a non-starter.

   I hate the quoting syntax with a passion, and I suspect that selling it
would be pretty hard. I'd rather just outlaw whitespace in element content,
and live with the problem (which is at least already familiar from HTML).
We can leave it to applications to implement whitespace-ignoration based on
stylesheets, but the parse tree should simply make it _all_ significant

   If we do implement some kind of quoting, why not go all the way, with
one of the radical syntaxes that were proposed earlier on the list, which
make markup syntax isomorphic to LISP syntax. In any case, we might want to
figure out a way to not SGML-ify the '"' character as well as the '<>'
characters. So I guess stupid NET tricks might be useful after all.

   Of course, stupid NET tricks have the same concrete disadvantage as the
other variants of the quoting proposal, i.e. there is ZERO surface level
compatibility with current practice in tagging document instances. I was
just reading Richard Gabriel's excellent new book "Patterns of Software,"
and his comments on programming language design make me think that any
syntax that looks unfamiliar will severely endanger the acceptance of XML.
I commend to your attention the chapters "The end of History and the Last
Programming Language", and "Money through Innovation Reconsidered." His
take on how products and languages (ie standards) get accepted in the
computing community seems very compelling in the light of history.

   My application of his theories says that we should change as little as
possible from HTML (the market leader), while adding the minimum we can
manage to get the most useful new functionality. I must say that I don't
see the point of targeting only the SGML community, because they already
have SGML.

   -- David

   RE delenda est.

David Durand                  dgd@cs.bu.edu | david@dynamicDiagrams.com
Boston University Computer Science          | Dynamic Diagrams
http://www.cs.bu.edu/students/grads/dgd/    | http://dynamicDiagrams.com/