XML 1.0 - clarification - deciphering [51] Mixed from Kent M Pitman on 1998-05-05 (xml-editor@w3.org from April to June 1998)

From: Kent M Pitman <kmp@harlequin.com>
Date: Mon, 4 May 98 23:03:17 EDT
To: xml-editor@w3.org
Cc: kmp@harlequin.com
Message-Id: <9805050303.AA00538@excel.harlequin.com>

Right now you have:

 [51] Mixed ::= '(' S? '#PCDATA' ( S? '|' S? Name )* S? ')*' |
                '(' S? '#PCDATA'                     S? ')'


As nearly as I can tell, the only point of separating Mixed's definition
into two parts is to control the '*' (making it required when a Name is
given) but this is an awfully wierd way to say that.  The above makes the
final '*' be 'either required or not' (i.e., optional).  It seems to me it'd
be (approximately) 10,000% clearer to say:

 [51] Mixed ::= '(' S? '#PCDATA' ( S? '|' S? Name )+ S? ')*'      |
                '(' S? '#PCDATA'                     S? ')' '*'?

saying that when the #PCDATA has no following names, the "* is optional".

The second formulation also has the important difference that it does not
require lookahead in order to successfully parse the first part (the part
in parens) knowing deterministically which branch you went through.
(Something I this is a tremendously important part of the grammar even
though you've already acknowledged you don't.)

- - - - 

BTW, I don't understand why "*" is permitted at all in the case 
of just
 (#PCDATA)
since if there are no other elements permitted.  The point of a *
is so that in
 (#PCDATA|Foo)
you can do 
 ..pcdata..<Foo>..foodata..</Foo>..pcdata..<Foo>..foodata..</Foo>..pcdata..
allowing repeated pcdatas or foos.  But with
 (#PCDATA)
the entire thing is just one big
 ..pcdata..
and there can't be two blocks of parsed character data as in
 ..pcdata....pcdata..
since there would be no uniquely identified point at which to make 
the division (without making the parse nondeterministic).

If you were going to disallow the '*' in the #PCDATA-only case,
you would DEFINITELY want to use my reformulation using "+" rather
than "*" for the set containing the Names.

Received on Monday, 4 May 1998 22:59:55 UTC