[whatwg] External document subset support

I really like this proposal, because entities are not the only thing
you can do with DTDs. You have attribute tokenization and
normalization, attribute defaulting, content models.
In particular, people in this group often say that namespaces are
difficult to use for authors. Given the appropriate DOCTYPE
declaration (for example XHTML11 plus MathML 2 plus SVG11), namespaces
and their attributes are no more a problem for authors.
Secondly, attribute normalization at the language level should provide
a consistent processing for special attributes (id and class in
XHTML10/11).
Further, content models could be used for warnings in the developer
console (though probably XML schemas are better here) and surely could
be used for better well-formedness error messages. Eg, un unclosed
<img> tag would be reported immediately after the opening tag, and not
at the location of the parent close tag. (This only applies if the XML
fragment is not well-formed).

On the other side, we have legacy XML content and the fact that many
pages refer directly to W3C DTDs. Luckily, the XML specification has a
feature to allow the page to indicate that external declarations are
not needed: the "standalone" declaration.
- standalone=yes means that no external subset is needed, nor are
needed external entities. Processing of internal subsets stops at the
first unread (external) parameter entity. General entity references
(other than amp,gt,lt,quot,apos and those declared in the internal
subset) are a well-formedness error. This is the minimum required
behaviour of a non-validating parser.
- standalone=no means that this document relies on external data, and
cannot be processed without such data. All subsets must be read and
processed (including attribute and element declarations) and all
parameter entities resolved (either internal or external). External
general entities referenced in the document are replaced with the
appropriate content.
- no standalone declaration could mean "standalone=yes" (not
conforming with XML), "standalone=no" (not backward compatbile) or
could mean a third way, such that only internal entities and entities
with a known public identifier are used. The DOCTYPE is processed if
and only if it is a known entity and there are no unread parameter
entities in the internal subset.
Entity retrival is based on the public identifier, if that is known to
the application, or on the system identifier if "standalone=no"..
Entities that cannot be retrieved (for network errors or
unsupported/malformed IRIs) are kept with the EntityReference node in
the DOM for general entities (this means that the ampersand followed
by the entity name followed by a semicolon is rendered, as per
XHTML1.0), and stop the processing of the DTD for parameter entities.

This proposal should solve a lot of problems (shown above), allowing
to uncover the full potential of XML1.0 while avoiding a DOS on w3.org
and keeping existing content working.

Giovanni

Received on Monday, 25 May 2009 08:19:50 UTC