Re: Two more points for cleanup in existing draft

Henry S. Thompson <ht@cogsci.ed.ac.uk> asked on Fri, 21 Mar 97
12:00:47 GMT:

>1a) Shouldn't the two occurences of '<' in production 16 (the
>definition of QuotedCData) be replaced with '&', and if not, why not?

I think '<' is forbidden in attribute values and thus is
correctly included in the negated character class.  As various
people have pointed out, literal ampersands should also be 
prohibited, so the rule should be something like

QuotedCData ::= '"' ([^"<&] | Reference)* '"'
              | "'" ([^'<&] | Reference)* "'"

>1b) Shouldn't production 15 (the definition of Literal) prohibit '&'
>and '%' as well as the relevant quote character, for consistency with [16]?

If we want the regular expressions to be formally unambiguous, 
I think you're right.  And I think we do want them to be
unambiguous, since regexp routines vary so much in how they
resolve ambiguity (greedy and not greedy, earliest match vs.
longest match, etc.) -- but it does make for some ugly expressions.

Tim has discovered, in some treatment of re, a subtraction operator
that means "all but", so we could write

  comment ::= '<!--*' ((.*) - '*-->') '*-->'

I've never seen this operator, but it certainly helps a lot here.
Of course, the implementors will translate it wrong when they / we
translate this into RE tools lacking the subtraction operator, but
that's better than having errors enshrined in the spec.  (And 
a natural translation into lex will work nicely:

  "<!--*"       { BEGIN(COM); }
  <COM>.        ;
  <COM>"*-->"   { BEGIN(INIT); }

As long as the subtraction is longer than the default rule, 
lex will do the Right Thing.

>2) 4.3, the discussion of entity treatment, is somewhat
>unsatisfactory.  '[P]arsed character data' is misleading, since by the
>syntax PCData cannot contain references!  If it means 'content and
>QuotedCData' (which are the places entity references are allowed), it
>should say so.  Also, parameter entity processing is not discussed at all.


I repeat my plea for suggestions for the best way to discuss
the handling of PE references.  I see the following possibilities:

  A add prose describing how they are handled
  B add PERef and EE to the grammar, with prose describing the
    rules governing PE contents within declarations, while
    retaining the PEReference rule for describing the constraints
    on PE references between declarations
  C Add PEReference and EE to the white space tokens for the rules 
    defining declarations, and continue to use PEReference -- but add
    EE -- in the rules defining the internal subset, and describe
    the relevant rules in prose (i.e. do it just the way 8879 
  D provide a wholly distinct grammar for the entity structure of
    declarations and the two subsets -- with prose describing what
    we are trying to convey

Any way we cut this, we are looking at serious amounts of normative
prose; the rules governing PE references just do not fit neatly into
context-free grammars as I know how to write them.  As far as
I can tell, possibilities B and D would allow us to put more of 
the rules into the grammar and have less to describe in the prose.
But D is at best very eccentric, and B will puzzle some readers
who will not see why there are two productions for PE references.

I repeat:  does anybody have a good idea?

>4.3.6 also needs careful attention, since as it stands it doesn't give
>enough weight to the consequences of 2.1, and might lead the naive to
>suppose that ". . .three companies: L&amp;M; B&amp;W; Imperial Tobacco" 
>is invalid, presuming M and W are not themselves defined as entities.

Good point; at least an example would be useful.

>Indeed taken literally 4.3.6 might lead one to suppose that ANY use of
>&amp; is illegal, since PCData may not contain &, and 4.3.6 says
>"processing this replacement data (which may contain both text and
>markup) . . ."  This needs to be clarified, in my view.
>Here's a candidate redraft of the relevant bits:
> ...

Thanks; I'll have to look at this on paper to have a rational
reaction.  Tim and I will be working on this this afternoon and
can look at it together then.

>Note the use of the label 'content' for production [39] is extremely

At the risk of appearing extremely dim:  why?

N.B. the term 'element content' has a special meaning in 8879 and does
not mean 'content of an element' there, so using the term 'element
content' for production 39 will confuse those familiar with SGML; the
Validity Constraint called 'Element Content' is badly named from this
point of view.

>Hope this helps.

It does; many thanks.

-C. M. Sperberg-McQueen