[whatwg] External document subset support from Brett Zamir on 2009-06-19 (public-whatwg-archive@w3.org from June 2009)

From: Brett Zamir <brettz9@yahoo.com>
Date: Fri, 19 Jun 2009 11:53:04 +0800
Message-ID: <4A3B0BA0.3070607@yahoo.com>
Ian Hickson wrote:
> On Mon, 18 May 2009, Brett Zamir wrote:
>    
>> Section 10.1, "Writing XHTML documents" observes: "According to the XML
>> specification, XML processors are not guaranteed to process the external
>> DTD subset referenced in the DOCTYPE."
>>
>> While this is true, since no doubt the majority of web browsers are
>> already able to process external stylesheets or scripts, might the very
>> useful feature of external entity files, be employed by XHTML 5 as a
>> stricter subset of XML (similar to how XML Namespaces re-annexed the
>> colon character) in order to allow this useful feature to work for XHTML
>> (to have access to HTML entities or other useful entities for one, as
>> well as enable a poor man's localization, etc.)?
>>      
>
> While there are arguments on both sides of whether this is a good idea or
> not, I think the more important concern in this case is whether we can
> extend XML in this way. I think in practice we should leave this up to the
> XML specs and their successors. I don't think it would be appropriate for
> us to profile the XML spec in this way.
>
>    

While it is not my purpose to extend the debate on external DTD's, I 
wanted to bring up the following points (brought to light after a recent 
re-review of the spec) because it raises a few serious issues which I 
believe current browsers are failing at, and if the browsers do not 
address these issues, they would make claims for real XHTML 5 support 
(as with XHTML 1.* and plain XML support) unworkable. While I agree that 
any changes to XML itself should be up to the XML specs, from what I can 
now tell, it looks like a closer adherence to the existing spec would 
solve most of the existing problems. I wanted to share the following 
points which I think could resolve most of the issues, if the browsers 
would make the required changes.

I was pleasantly surprised to find that the spec seems to recommend 
solutions which I believe avoid the more serious issue of single point 
of failure problems.

(The other complaints with DTD's, such as avoiding cross-domain DTDs for 
the sake of security or avoidance of DOS attacks might be an optional 
issue if that may, in combination with adhering to existing 
recommendations, satisfy concerns, though I personally do not think such 
a risk is similar to inclusion of cross-domain scripts.)

So what follows is what I have gleaned from these various statements as 
applied to current browsers. I can provide specific citations, but I did 
not wish to expand this post unnecessarily (though I list references at 
the end).

The major issues which I think ought to be resolved by certain browsers, 
as they do not seem to be in accord with the XML spec and as a result, 
create interoperability problems:

1) Firefox and Webkit, should not give a single point of failure for a 
missing entity as they do now, (unless they switch to a validating 
parser which finds no declaration in the external file and the user is 
in validation mode), since such failures in a document with an external 
DTD are NOT well-formedness errors unless the document deliberately 
declares standalone=yes.
2) Explorer, which no longer seems to require in IE8 that the document 
be completely described by the DTD as I believe it had earlier (though 
it will report errors if the document violates rules which are 
specified), should, per the spec, really only report validation errors 
upon user option (ideally, I would say, off by default, and activatable 
on a case-by-case as well as preference-based basis). This will possibly 
speed things up if the option could be disabled as well as let their 
browser work with documents which violate validation. But this issue is 
not as serious as #1, since #1 prevents even valid documents from being 
interoperably viewed on the web.

If these issues are addressed by those aiming for compliance, the only 
disadvantages which will remain (and which are inherent in XML by 
allowing the co-existence of validating and non-validating parsers) are 
those issues described in http://www.w3.org/TR/REC-xml/#safe-behavior 
and http://www.w3.org/TR/REC-xml/#proc-types , namely that:

1) some (entity-related) /well-formedness/ errors (e.g., if an entity is 
not defined but is used) will go hidden to a non-validating parser as 
these will not need to load an entity replacement (which is not a big 
problem, since a document author should presumably have checked (with an 
application which does external entity substitution) that their entities 
integrate properly with the text--it is not as important, however, that 
they check for /validation/ errors, since as mentioned above, these need 
only be reported optionally).
2) The application may possibly not be notified by its processor of, 
e.g., entity replacement values, if it is a non-validating processor 
(though non-validating processors can also make such replacements). But 
since these are, as mentioned above, not to produce well-formedness 
errors, there is no single point of failure here either (though there 
may be some missing content, but indicated by an entity reference in the 
output display).
3) A few validation issues, such as duplicate declarations (which might 
include attribute defaults) can lead to undefined behavior (though given 
that validation is only optional even for validating applications, it 
seems all applications will have to deal with this).

In other words, as the spec seems to indicate from my reading, users 
going from one browser to the other will not face problems, unless:
1) They visit invalid documents and have the option to validate the 
document turned on (it is only supposed to be an option) and expect 
other browsers to report the same errors as well (not a big issue, since 
a document which describes its validation constraints and then breaks 
them is basically asking for trouble--and even here, the user is 
supposed to have the option to view the document without validation).
2) They expect to see the entity replacement text (and at least, this is 
not a single point of failure, and in many cases, such as when entities 
are merely used to represent symbols, the text can be fully read without 
any disruption in the document flow). Of course, doing the replacements 
would be even better to avoid this problem, and the solution does not 
require supporting validation.

There are also the following optional issues which browsers might wish 
to consider (though if these are not implemented, the above fixes alone 
would address the most serious problems):

1) Since even a non-validating processor is to inform the application 
that it recognized but did not read an entity (if it does not replace 
their references with content found in an external DTD), a browser like 
Opera (the only one that I can tell does not report such issues, even 
though it correctly does not lead to a single point of failure), might 
(if not implementing #2 below) wish to consider doing so, since a 
compliant processor at least is supposed to report such issues to the 
application (to do with it as it sees fit). But there is admittedly no 
obligation on the application to do so, and in any case, such reporting 
is not to be a single point of failure. But it still might be nice to 
distinguish the display of entities which are not found from 
deliberately escaped entities (e.g., &myEnt; produced by a missing 
entity currently appears the same in Opera (except in source view) as a 
deliberately escaped &amp;myEnt;)
2) Opera, Firefox, and Webkit (after the latter two fix the more serious 
issue mentioned above) might also wish to consider expanding their XML 
support for their users to:
     a) Show a link to optionally expand each external parsed entity 
references or other entities (if they don't do the following)
     b) Build on a non-validating parser to do automatic entity and 
default attribute value replacement, and attribute value normalization 
using an external DTD (at least same domain ones). The XML spec only 
warns against relying on this for the sake of an application having the 
freedom to switch between non-validating parsers which may or may not 
all take these actions--this issue doesn't impact interoperability for 
users (it only improves it), however, so even if there is no desire to 
support validation, they can still offer entity replacement, etc. to 
their users.
     c) Implement a validating parser which can do entity and default 
attribute value replacement, and attribute value normalization from an 
external DTD, as well as optionally validate the document at user 
discretion. This should not slow things down for the user, since the 
spec itself indicates that reporting of validation errors is required 
"at user option". This would give the user the best of both worlds--the 
opportunity to fully read XML/XHTML files online (and without any 
requirement to face a validation performance cost), and if they are, for 
example, a document author, they could choose to take a client-side 
performance hit to optionally check for validation. Of course, they'll 
need to load the external files in either of these cases to be able to 
do the replacements, but the document author will NOT need to provide 
full DTD validation in the external DTD, so users will not be forced to 
download DTDs reflecting the whole document structure, unless the 
document author wishes to reference such files). Indeed, authors might 
be encouraged not to include such content in their DTDs (performing 
validation offline) so that they and their users can reduce bandwidth, 
unless their purpose is to transparently show the validation (though DTD 
validation is of course not very strong).

References:
http://www.w3.org/TR/REC-xml/#wf-entdeclared
http://www.w3.org/TR/REC-xml/#proc-types
http://www.w3.org/TR/REC-xml/#safe-behavior
http://www.w3.org/TR/REC-xml/#dt-vc (validity constraint definition)
http://www.w3.org/TR/REC-xml/#include-if-valid

regards,
Brett

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20090619/75c72431/attachment-0001.htm>
Received on Thursday, 18 June 2009 20:53:04 UTC