- From: altheim <Murray.Altheim@Eng.Sun.COM>
- Date: Wed, 26 Mar 1997 11:02:01 -0800 (PST)
- To: tbray@textuality.com
- Cc: w3c-sgml-wg@w3.org
Tim Bray <tbray@textuality.com> writes: > [various folks are edging towards various arcane methods to solve the > "problem" of the Big 5 predeclared entities]. > > I have *got* to disagree. The spec is totally crystalline on > this. If you need a '<' in character data or attribute value, > use '<'. What could be clearer? In implementing Lark, > there is [after compilation] a state in the automaton that > recognizes < - this causes a '<' to be put in the data for > eventual delivery to the app. Idiotically simple. It's only 'idiotically simple' to those who believe XML is an HTML variant. Without any declaration, '<' has no meaning in an SGML document; that is idiotically simple. If the XML spec has defined < without a declaration, the spec is wrong. Even In HTML, the 'big 5' *were* defined in the HTML DTD (of course, there's only a 'big 4' there). So the question resolves to one we've been hearing around here a fair bit: is XML to be SGML compliant? > So on technical grounds there are no problems in XML. On cultural > grounds, it is de facto the case that the use of & and < > is nearly universal, which greatly promotes interoperability, and > is a fact of life we should be glad of. Universal only insofar as XML is defined vis-a-vis HTML. Technically, there are plenty of problems, as witnessed by the disagreement over proper form in this group. Just because you disagree, doesn't make the contra view 'arcane' or the product of 'purists' (used in the perjorative sense), or your views 'idiotically simple' to understand. We have a disagreement over the direction for XML; it's as simple as that. You (apparently) believe XML should be extremely lightweight rather than capable of rather complex processing. I think it should be both. > If we no longer predeclare these, then my minimal XML parser > has to learn how to read and interpret entity declarations. Kiss > the DPH goodbye. I've written an entity declaration parser in Hypertalk to handle both literal and SYSTEM entities. It wasn't that large a task. It's pretty much a simple A=B declaration, and if there are 'arcane' variants to that, maybe we should disallow those variants. Declaring character entities shouldn't be outside the scope of XML, otherwise all the ISO entity sets are off limits, and therefore XML can't do HTML, DocBook, or any other 8879-compatible application that uses those ISO sets. As has been discussed elsewhere, leaving which sets are defined to a specific document is both a boon and bane, depending on perspective. > If I want to process an XML doc with Full SGML, all I need to do > is declare the entities. So maybe we need a rule saying that a > declaration of these entities with any other value than those > given by XML makes a document non-well-formed and thus non-XML. Now this sounds reasonable. We don't declare the entities, but we state the minor namespace used by them. BTW, how did ' get in the list? > IN ALL XML DOCUMENTS, & SHOULD MEAN '&' AND NOTHING ELSE, EVER. > IN ALL XML DOCUMENTS, < SHOULD MEAN '<' AND NOTHING ELSE, EVER. > IN ALL XML DOCUMENTS, > SHOULD MEAN '>' AND NOTHING ELSE, EVER. > IN ALL XML DOCUMENTS, " SHOULD MEAN '"' AND NOTHING ELSE, EVER. > IN ALL XML DOCUMENTS, ' SHOULD MEAN "'" AND NOTHING ELSE, EVER. Without a DTD, we need a mechanism to allow a declaration subset to define charents, text ents, etc. The TC provides some of that with a profile, but this seems like somewhat of a corruption as well. > Can someone explain in simple terms what the problem is that is causing > us to consider these measures that will greatly increase the > difficulty of minimal XML parsing and the amount of explanation > necessary in the spec? I think I've been paying attention, and I > certainly haven't seen anything raised that comes close to being > serious enough to justify these measures, which would simultaneously > increase complexity and decrease interoperability. I can't agree with this last statement. While we don't need to allow the entire gamut of <!ENTITY ...>declarations, allowing charent and literal declarations isn't a huge task, and disallowing them practically makes XML useless when trying to use them with other SGML applications (translation, conversion, etc.). I think it's pretty safe to say that embedded instances of charent refs are easier to read and debug than a bunch of Unicode numeric refs, which require a inch and a half thick book and some time flipping pages just to locate. ISO Latin 1, 2, etc. are very well known and available by comparison. As to why being allowed to declare which of the ISO entity sets are being used in a particular document, I can't imagine why that would *decrease* interoperability. Even an HTML browser that doesn't include a parse tree (speaking from some experience) is a rather large and complex code base. I don't think XML can possibly be based on even more lightweight code unless we are willing to give up enormous functionality, also at the expense of maintaining any compatibility with SGML. I realize that is not a priority for some people, but it is in our charter. I really do think it's time we began discussing *conformance levels* so that people can build lightweight apps, intermediate apps, and the big monster processing apps that do TEI-like linking, DSSSL-o stylesheets, etc. If we want extensibility, linking and stylesheets, XML is going to be bigger than a minimal HTML browser, and therefore is going to require the hooks of extensibility. One of most important of them is entity declarations. Murray ........................................................................... Murray Altheim, SGML Grease Monkey <altheim@eng.sun.com> Member of Technical Staff, Tools Development & Support Sun Microsystems, 2550 Garcia Ave., MS UMPK17-102, Menlo Park, CA 94043 USA "Give a monkey the tools and he'll build a typewriter."
Received on Wednesday, 26 March 1997 14:02:26 UTC