Re: those predeclared entity refs from altheim on 1997-03-26 (w3c-sgml-wg@w3.org from March 1997)

From: altheim <Murray.Altheim@Eng.Sun.COM>
Date: Wed, 26 Mar 1997 11:02:01 -0800 (PST)
To: tbray@textuality.com
Cc: w3c-sgml-wg@w3.org
Message-ID: <libSDtMail.9703261102.27801.altheim@mehitabel/jurassic>
Tim Bray <tbray@textuality.com> writes:
> [various folks are edging towards various arcane methods to solve the
>  "problem" of the Big 5 predeclared entities].
> 
> I have *got* to disagree.  The spec is totally crystalline on
> this.  If you need a '<' in character data or attribute value,
> use '&lt;'.  What could be clearer?  In implementing Lark, 
> there is [after compilation] a state in the automaton that 
> recognizes &lt; - this causes a '<' to be put in the data for
> eventual delivery to the app.  Idiotically simple.

It's only 'idiotically simple' to those who believe XML is an HTML
variant. Without any declaration, '&lt;' has no meaning in an SGML
document; that is idiotically simple. If the XML spec has defined
&lt; without a declaration, the spec is wrong. Even In HTML, the
'big 5' *were* defined in the HTML DTD (of course, there's only
a 'big 4' there). So the question resolves to one we've been hearing
around here a fair bit: is XML to be SGML compliant?

> So on technical grounds there are no problems in XML.  On cultural
> grounds, it is de facto the case that the use of &amp; and &lt; 
> is nearly universal, which greatly promotes interoperability, and
> is a fact of life we should be glad of.

Universal only insofar as XML is defined vis-a-vis HTML. Technically,
there are plenty of problems, as witnessed by the disagreement over
proper form in this group. Just because you disagree, doesn't make 
the contra view 'arcane' or the product of 'purists' (used in the 
perjorative sense), or your views 'idiotically simple' to understand. 

We have a disagreement over the direction for XML; it's as simple as that.
You (apparently) believe XML should be extremely lightweight rather than
capable of rather complex processing. I think it should be both.

> If we no longer predeclare these, then my minimal XML parser
> has to learn how to read and interpret entity declarations.  Kiss
> the DPH goodbye.

I've written an entity declaration parser in Hypertalk to handle
both literal and SYSTEM entities. It wasn't that large a task. It's
pretty much a simple A=B declaration, and if there are 'arcane' variants
to that, maybe we should disallow those variants. Declaring character
entities shouldn't be outside the scope of XML, otherwise all the ISO
entity sets are off limits, and therefore XML can't do HTML, DocBook,
or any other 8879-compatible application that uses those ISO sets. As
has been discussed elsewhere, leaving which sets are defined to a
specific document is both a boon and bane, depending on perspective.

> If I want to process an XML doc with Full SGML, all I need to do
> is declare the entities.  So maybe we need a rule saying that a
> declaration of these entities with any other value than those
> given by XML makes a document non-well-formed and thus non-XML.

Now this sounds reasonable. We don't declare the entities, but we
state the minor namespace used by them. BTW, how did &apos; get in
the list? 

>  IN ALL XML DOCUMENTS, &amp; SHOULD MEAN '&' AND NOTHING ELSE, EVER.  
>  IN ALL XML DOCUMENTS, &lt; SHOULD MEAN '<' AND NOTHING ELSE, EVER.  
>  IN ALL XML DOCUMENTS, &gt; SHOULD MEAN '>' AND NOTHING ELSE, EVER.  
>  IN ALL XML DOCUMENTS, &quot; SHOULD MEAN '"' AND NOTHING ELSE, EVER.  
>  IN ALL XML DOCUMENTS, &apos; SHOULD MEAN "'" AND NOTHING ELSE, EVER.  

Without a DTD, we need a mechanism to allow a declaration subset to 
define charents, text ents, etc. The TC provides some of that with a 
profile, but this seems like somewhat of a corruption as well.

> Can someone explain in simple terms what the problem is that is causing
> us to consider these measures that will greatly increase the
> difficulty of minimal XML parsing and the amount of explanation
> necessary in the spec?  I think I've been paying attention, and I
> certainly haven't seen anything raised that comes close to being
> serious enough to justify these measures, which would simultaneously
> increase complexity and decrease interoperability.

I can't agree with this last statement. While we don't need to allow the
entire gamut of <!ENTITY ...>declarations, allowing charent and literal
declarations isn't a huge task, and disallowing them practically makes
XML useless when trying to use them with other SGML applications
(translation, conversion, etc.). I think it's pretty safe to say that
embedded instances of charent refs are easier to read and debug than
a bunch of Unicode numeric refs, which require a inch and a half thick
book and some time flipping pages just to locate. ISO Latin 1, 2, etc.
are very well known and available by comparison.

As to why being allowed to declare which of the ISO entity sets are being
used in a particular document, I can't imagine why that would *decrease*
interoperability. Even an HTML browser that doesn't include a parse tree
(speaking from some experience) is a rather large and complex code base. I
don't think XML can possibly be based on even more lightweight code unless we
are willing to give up enormous functionality, also at the expense of 
maintaining any compatibility with SGML. I realize that is not a priority
for some people, but it is in our charter.

I really do think it's time we began discussing *conformance levels* so
that people can build lightweight apps, intermediate apps, and the big
monster processing apps that do TEI-like linking, DSSSL-o stylesheets, etc.

If we want extensibility, linking and stylesheets, XML is going to be bigger
than a minimal HTML browser, and therefore is going to require the hooks of
extensibility. One of most important of them is entity declarations.

Murray

...........................................................................
Murray Altheim, SGML Grease Monkey                    <altheim@eng.sun.com>
Member of Technical Staff, Tools Development & Support
Sun Microsystems, 2550 Garcia Ave., MS UMPK17-102, Menlo Park, CA 94043 USA
         "Give a monkey the tools and he'll build a typewriter."
Received on Wednesday, 26 March 1997 14:02:26 UTC