XHTML DTD revisited: entity declarations and the MSXML/XJParser

Apologies if this has been discussed before.

This is in reference to Message-Id: <199909151443.KAA00876@dark.brown.edu>,
archived at
http://lists.w3.org/Archives/Public/www-html/1999Sep/0026.html ... the
original poster complained about IE5 choking on an XHTML DTD. The respondent
speculated that the problem was not in the DTD. I ran into the same behavior
in IE5 recently and discovered the actual cause is related to the XML parser
that IE5 uses and perhaps some redundancy in the DTD.

To demonstrate the behavior, simply attempt to view this XHTML document in
IE5:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/transitional.dtd">
<html>
  <head>
    <title>hello</title>
  </head>
  <body>
    <p>hello world</p>
  </body>
</html>

The problem occurs when parsing the DTD.

The Datachannel-Microsoft XML Parser that IE5 uses, and which can be
obtained independently at
http://msdn.microsoft.com/downloads/tools/xmlparser/xmlparser.asp (COM) or
http://msdn.microsoft.com/xml/IE4/jparser.asp (Java), or as the Datachannel
XJParser from http://xdev.datachannel.com, treats the following characters
*and* their numeric entity references specially: < > & " '

This reference explains what's going on, although I don't fully understand
it:
http://xdev.datachannel.com/downloads/xjparser/documentation/#pgfId-1001590

The consequences of this situation appear to be that if a DTD contains
<!ENTITY foo "&#38;">, the &#38; is going to be treated as the beginning of
an entity reference. Similarly, &#60; and &#62; (angle brackets) are going
to be treated as the beginning and end of tags. Single and double quotes
seem to be unaffected by this behavior.

The XHTML DTDs refer to a set of entity declarations that include the
following:
<!ENTITY amp     "&#38;"> <!--  ampersand, U+0026 ISOnum -->
<!ENTITY lt      "&#60;"> <!--  less-than sign, U+003C ISOnum -->
<!ENTITY gt      "&#62;"> <!--  greater-than sign, U+003E ISOnum -->

The parser will not allow &amp;, &lt; or &gt; to be redefined anyway, so
simply removing these declarations will allow the parser to function. The
other "solution" is to replace the & in the entity reference with &amp; like
this:
<!ENTITY amp     "&amp;#38;"> <!--  ampersand, U+0026 ISOnum -->
<!ENTITY lt      "&amp;#60;"> <!--  less-than sign, U+003C ISOnum -->
<!ENTITY gt      "&amp;#62;"> <!--  greater-than sign, U+003E ISOnum -->

http://msdn.microsoft.com/xml/general/xmlfaq.asp#issues-Entities suggests
using a DTD that defines HTML entities:
http://msdn.microsoft.com/xml/general/htmlentities.dtd ... Take a look at
this DTD and you will see that they are using both solutions: &amp; &gt; and
&lt; are not being redefined at all, but they are defining the unnecessary
&AMP; &GT; and &LT; by putting &amp; in the replacement text.


So I have the following questions:

1. Is the MS/Datachannel XML parser violating XML 1.0 by not allowing &amp;
&lt; or &gt; to be redefined? (I would think not, as their immutability is
crucial to the operation of an XML parser).

2. Is the MS/Datachannel XML parser violating XML 1.0 by treating &#38;
&#60; and &#62; in entity replacement text as if it were markup?

3. What does "&amp;#38;" as replacement text mean -- 1 character '&', or 4
characters '&#38;", or 9 characters '&amp;#38;'? Is the suggested approach
of using "&amp;#38;" as the entity replacement text wrong?

4. Is it redundant/unnecessary to have entity declarations for these
characters in XHTML at all, since XHTML is XML, and thus must have those
immutable entities defined by default?

-Mike

Received on Thursday, 28 October 1999 13:09:18 UTC