- From: <lee@sq.com>
- Date: Tue, 24 Sep 96 00:15:01 EDT
- To: kimber@passage.com, w3c-sgml-wg@w3.org
What follows is a little lengthy, but I hope that it will be of use. I should say at the outset that in saying that SGML can't or doesn't do something, I am stating (what I believe to be) a fact -- that is, there is no implied criticism. We must work with what we have, and not ask for it to have been otherwise. >> * Should XML restrict the use of #PCDATA to content models of the form >> (#PCDATA), or of the form (#PCDATA | A | B | ...)*, as a way of avoiding >> the mixed-content trap (11.2.4) and/or of simplifying RE handling (7.6.1)? > [Eliot...] > I think this would be a good thing (assuming we allow mixed content at > all). SGML effectively requires this anyway, so why not formalize it? Agreed. >> * Should XML prohibit the use of inclusion and exclusion exceptions in >> element declarations? (11.2.4, 11.2.5)? > > Yes. Exclusions are, with very rare exceptions, a bad thing, causing much > more trouble than they're worth. Inclusions, while expressing a useful > semantic, also cause more trouble then they're worth and require DTD > processing. I don't agree here. Some things are hard to do in other ways. Exclusions are good for preventing recursive inclusions. I noticed that in the GCAPAPER DTD, the only way to get text in a caption of a figure was to use HIGHLIGHT, which was an inclusion... but I think it's a mistake. Inclusions are hard to use. >> * Should XML forbid use of the '&' connector in content models >> (11.2.4.1)? > > How much does AND complicate validation? There are essentially three approaches I've seen. (1) build an NDFA ind compile down to a DFA using the normal published algrithms. If there are n terms and'd together, you get n! states to handle them. This means that using a & b & c & d & e & f uses 1 * 2 * 3 * 4 * 5 * 6 = 720 states. But if the model was more complex, e.g. (A,B,C) & d & e & f you are likely to get many more states than that, because you have to account for all the possible insertion points. (2) treat it as an or group and then do a check afterwards (requires either lookahead or, probably more efficiently if most documents are valid, backtracking) (3) modified DFA, essentially using annotations (in the graph theory sense) to note where such intrusions may occur, and pushing another automaton to deal with them. Probably there are other ways. Number (3) is the most efficient of the three, as far as I can tell, but I don't know a good way to deal with this. Unfortunately, as far as I know, there is no way to express (A,B,C) & d* to mean that (A, d, B, d, C) is acceptable. We ended up using inclusions and exceptions to model the HTML HEAD in the IETF HTML WG, because and-groups weren't expressive enough. A lot of the difficulty of SGML content models stems from (1) the lack of non-terminal symbols -- e.g. consider the YACC Chapter: TITLE p_list ; p_list: P | P p_list ; Here, p_list is a non-terminal that would never appear in the input. The (D S) problem recently posted to comp.text.sgml is easily solved using this paradigm, but adds the cost of having a shift-reduce stack. Another way to look at this is to say that SGML doesn't offer an expression language for recognising text. SHORTREF and DATATAG are not general tools for this. For example, you can't specify the syntax of a C expression using SGML as you can with YACC. (2) the lack of scope consider: <!Element (TITLE within Bibliography) - - (#PCDATA|SmallCaps) > <!Element (TITLE) - - (#PCDATA|FnRef|Emphasis|SmallCaps) > You could use exclusions to do this: <!Element Bibliography - - (....) -(FnRef|Emphasis)> and then adding Emphasis and FnRef as inclusions back where they made sense, e.g. in Bibliography.Remarks in an annotated bibliography. (3) the ambiguity rules Consider a report that may have Scope, Requirements, Proposal and Summary, but may have any number of other sections between them: <!Element Report - - ( Scope?, Section*, Requirements?, Section*, --* actually this is illegal *-- Proposal?, Section*, --* actually comments are illegal here too! *-- Summary? ) > Well, this is illegal because of the multiple sections. Of course, the model could cause rather large amounts of look-ahead, too. In many applications that's fine. But for delivering documents over the internet, avoiding look-ahead may be a good thing, reducing the delay before The User Sees The First Screen. Another way to avoid lookahead is to give up the notion that you need to distinguish between a Section in any of the different groups for any reason other than to match the content model. To me, this goes along with dropping the ambiguity rule. Then you cando this withoug look-ahead, as described below. I imagine that the bug forbidding comments in model groups will be fixed in the next update (assuming it's just a bug). Mail me privately on this rather thanbothering the list; suffice it here to note that a comment is not in fact a special case of whitespace, and is not invisible to the parser, the two more usual choices. Now, it is not clear to me that XML can or should address these issues. But if you are talking about removing any one feature, you must understand clearly (as I am sure you do, Eliot, but we are not all Eliots) how that affects the expressivity of the language. Posix regular expressions allow e\{n,m\} to represent between n and m occurrences (inclusive) of expression e. With that, you can eliminate "?", and if you have a notation for infinity, you can also eliminate * and +. But I would not claim that CHAPTER: (TITLE{0,1}, P{1, LOTS}) is sufficiently clearer than CHAPTER: (TITLE?, P*) to justify the simplification. On the other hand, you cannot simulate \{...\} in terms of + and * without getting into some very ugly situations. In the same way, getting rid of inclusions and exclusions will reduce the set of expressible content models, as will removing &. So the important question is whether the loss is worth the gain. >> * Should XML allow nondeterministic content models (11.2.4.3)? > > Again, how much does this complicate validation? I'm not ambiguity expert, > but could the problem be solved simply by stipulating that a token is > always matched to the first place in the content model it can match, > without lookahead? Yes, you could do that, but my content model above would then never work, because if you omitted a Requirements, you'd be stuck on the first Section*, and the Proposal would be an error. I think that my Report can be matched with no lookahead if you don't care about distinguishing sections in the various groups, but you'd have to use a somewhat more sophisticated approach, I'm afraid, than shortest match. An NDFA -> DFA conversion ala Aho et. al. would do it. Here's the model again, followed by the reduced DFA: <!Element Report - - ( Scope?, Section*, Requirements?, Section*, Proposal?, Section*, Summary? )> start: Section -> start Scope -> seenScope Requirements -> seenRequirements Proposal -> seenProposal Summary -> seenSummary seenScope: Section -> seenScope Requirements -> seenRequirements Proposal -> seenProposal Summary -> seenSummary seenRequirements: Section -> seenRequirements Proposal -> seenProposal Summary -> seenSummary seenProposal: Section -> seenProposal Summary -> seenSummary seenSummary: EOF -> accept * -> error So although this is ambiguous in SGML terms, there is no look-ahead needed. Question to brave sould who made it to here: are there any cases where look-ahead _would_ be needed? If so, do they matter? Is it a problem with common prefixes, as in (A B B)* A B C No, I can reduce that to (A B) (B->start | C). Hmmm. Would it help to change the rule for XML to be look-ahead of more than one token may not be practical. Therefore, the parser may forbid content models that require it. Parsers accepting such a model shall produce a warning; the parser may provide an option to disable such warnings, but the default must be to issue the warning. ? If ambiguous content models are allowed, does that affect + - and &, and if so how? If scoping rules were introduced, could + and - be eliminated? What is the expressive loss in elmiminating "&"? I hope this rather lengthy analysis helps. Lee
Received on Tuesday, 24 September 1996 00:44:01 UTC