References and prod. [39] "VC: Element Valid"

Prod. [39] "VC: Element Valid" seems to have somewhat surprising
implication in respect to the validity of documents using character
references expanding to white space.  For example the following
document, which comes from the XML test suite (ID = 'rmt-e2e-15g') is
invalid:

<!DOCTYPE foo [
<!ELEMENT foo (foo*)>
]>
<foo><foo/>&#32;<foo/></foo>

while this document seems to be valid:

<!DOCTYPE foo [
<!ELEMENT foo (foo*)>
<!ENTITY bar"<foo/>&#32;<foo/>">
]>
<foo>&bar;</foo>

The implications of this VC for implementing an XML parser are quite
huge, because it requires character reference expansion to be
performed during and not before validation, because if character
reference expansion is done in an earlier separate step, the
information whether a certain whitespace character was encoded as a
character reference or in literal is lost.  In contrast, entity
reference expansion must take place before validation.

However, there is one exemption from this rule.  In the case of a tag
of type EMPTY, the validation has to take place before entity
reference expansion, while in the other cases it must take place after
entity reference expansion.  This is demonstrated by the following
(invalid) test case (ID = 'rmt-e2e-15a'):

<!DOCTYPE foo [
<!ELEMENT foo EMPTY>
<!ENTITY empty "">
]>
<foo>&empty;</foo>

Validation of <foo> cannot take place after expanding &empty;, because
<foo></foo> would be valid; while in the previous example
<foo>&bar;</foo> must first be expanded in order to be validated.

I would like to advocate a change in prod. [39] "VC: Element Valid",
requiring validation to take place after all character and entity
references have been expanded.  The consequences for former invalid
documents becoming valid now is very limited: no former valid
documents are becoming invalid, and only entities expanding to
whitespace or an empty replacement text might be affected, such as the
following (currently invalid) document from the XML test suite (ID =
'rmt-e2e-15h'):

<!DOCTYPE foo [
<!ELEMENT foo (foo*)>
<!ENTITY space "&#38;#32;">
]>
<foo><foo/>&space;<foo/></foo>

On the other side the benefits of a changed VC for the design of XML
parsers are huge.  The whole parsing process may now be split into two
clearly distinct stages:  1. wellformedness testing and reference
expansion, 2. validation.

Of course, this proposed modification would temporarily (until a fix
is released) break existing XML processor implementations (although I
reckon that not all prominent XML processors in fact implement this VC
correctly).  On the other side, the modification would bring XML
closer to its design goal no. 4:  "It shall be easy to write programs
which process XML documents."

Dieter Köhler

Received on Wednesday, 22 September 2004 15:27:00 UTC