- From: C M Sperberg-McQueen <cmsmcq@tigger.cc.uic.edu>
- Date: Fri, 21 Mar 1997 07:32:15 -0600
- To: ht@cogsci.ed.ac.uk
- CC: w3c-sgml-wg@w3.org, cmsmcq@uic.edu
Henry S. Thompson <ht@cogsci.ed.ac.uk> asked on Fri, 21 Mar 97 12:00:47 GMT: >1a) Shouldn't the two occurences of '<' in production 16 (the >definition of QuotedCData) be replaced with '&', and if not, why not? I think '<' is forbidden in attribute values and thus is correctly included in the negated character class. As various people have pointed out, literal ampersands should also be prohibited, so the rule should be something like QuotedCData ::= '"' ([^"<&] | Reference)* '"' | "'" ([^'<&] | Reference)* "'" >1b) Shouldn't production 15 (the definition of Literal) prohibit '&' >and '%' as well as the relevant quote character, for consistency with [16]? If we want the regular expressions to be formally unambiguous, I think you're right. And I think we do want them to be unambiguous, since regexp routines vary so much in how they resolve ambiguity (greedy and not greedy, earliest match vs. longest match, etc.) -- but it does make for some ugly expressions. Tim has discovered, in some treatment of re, a subtraction operator that means "all but", so we could write comment ::= '<!--*' ((.*) - '*-->') '*-->' I've never seen this operator, but it certainly helps a lot here. Of course, the implementors will translate it wrong when they / we translate this into RE tools lacking the subtraction operator, but that's better than having errors enshrined in the spec. (And a natural translation into lex will work nicely: "<!--*" { BEGIN(COM); } <COM>. ; <COM>"*-->" { BEGIN(INIT); } As long as the subtraction is longer than the default rule, lex will do the Right Thing. >2) 4.3, the discussion of entity treatment, is somewhat >unsatisfactory. '[P]arsed character data' is misleading, since by the >syntax PCData cannot contain references! If it means 'content and >QuotedCData' (which are the places entity references are allowed), it >should say so. Also, parameter entity processing is not discussed at all. Yes. I repeat my plea for suggestions for the best way to discuss the handling of PE references. I see the following possibilities: A add prose describing how they are handled B add PERef and EE to the grammar, with prose describing the rules governing PE contents within declarations, while retaining the PEReference rule for describing the constraints on PE references between declarations C Add PEReference and EE to the white space tokens for the rules defining declarations, and continue to use PEReference -- but add EE -- in the rules defining the internal subset, and describe the relevant rules in prose (i.e. do it just the way 8879 does) D provide a wholly distinct grammar for the entity structure of declarations and the two subsets -- with prose describing what we are trying to convey Any way we cut this, we are looking at serious amounts of normative prose; the rules governing PE references just do not fit neatly into context-free grammars as I know how to write them. As far as I can tell, possibilities B and D would allow us to put more of the rules into the grammar and have less to describe in the prose. But D is at best very eccentric, and B will puzzle some readers who will not see why there are two productions for PE references. I repeat: does anybody have a good idea? >4.3.6 also needs careful attention, since as it stands it doesn't give >enough weight to the consequences of 2.1, and might lead the naive to >suppose that ". . .three companies: L&M; B&W; Imperial Tobacco" >is invalid, presuming M and W are not themselves defined as entities. Good point; at least an example would be useful. >Indeed taken literally 4.3.6 might lead one to suppose that ANY use of >& is illegal, since PCData may not contain &, and 4.3.6 says >"processing this replacement data (which may contain both text and >markup) . . ." This needs to be clarified, in my view. > >Here's a candidate redraft of the relevant bits: >-------------- > ... Thanks; I'll have to look at this on paper to have a rational reaction. Tim and I will be working on this this afternoon and can look at it together then. >Note the use of the label 'content' for production [39] is extremely >infelicitous. At the risk of appearing extremely dim: why? N.B. the term 'element content' has a special meaning in 8879 and does not mean 'content of an element' there, so using the term 'element content' for production 39 will confuse those familiar with SGML; the Validity Constraint called 'Element Content' is badly named from this point of view. >Hope this helps. It does; many thanks. -C. M. Sperberg-McQueen
Received on Friday, 21 March 1997 08:35:59 UTC