On Henry's comment about documents with DOCTYPE but without markup declaration

I am replying on the XML Core WG list to a message Henry
wrote to the xml-editor list.  My action item was to draft
a response to the commentor, and I was to draw on Henry's
comments as well as others.

I am also posting my current draft response, but in this
message I comment on Henry's comments.

Below first Henry's message, then my comments on it below.


On 2014-01-21 12:27, Henry S. Thompson wrote:
> Leif Halvard Silli writes:
>
>> A document that lacks DTD is simply ”not valid”
>> <http://www.w3.org/TR/REC-xml/#sec-prolog-dtd>. And, as not valid,
>> whether it has validation errors is a question that is out of the
>> question.
> I presume you're referring here to these lines near the beginning:
>
>    [Definition: XML documents SHOULD begin with an XML declaration
>    which specifies the version of XML being used.] For example, the
>    following is a complete XML document, _well-formed_ but not _valid_:
>
>    <?xml version="1.0"?>
>    <greeting>Hello, world!</greeting>
>
>    and so is this:
>
>    <greeting>Hello, world!</greeting>
>
>    [emphasis in original]
>
> It's not *valid*, but it's not *invalid* either:
>
>    XML provides a mechanism, the document type declaration, to define
>    constraints on the logical structure and to support the use of
>    predefined storage units. [Definition: An XML document is *valid* if
>    it has an associated document type declaration and if the document
>    complies with the constraints expressed in it.]
>
> Each of your examples, i.e.
>
>    <!DOCTYPE html>
>    <html/>
> and
>    <!DOCTYPE html SYSTEM "about:legacy-compat">
>    <html/>
>
> clearly does have an "associated document type declaration", and equally
> clearly contain "failures to fulfill the validity constraints given in
> this specification" [1], so I conclude they are not only not valid,
> but invalid (although that, interestingly, is not a term defined in
> the spec.  What we find at [1] is an obligation on *validating
> processors* to _report_ "failures to fulfill the validity constraints
> given in this specification".)
>
> The validity constraint they both fail to fulfill is VC: Element Valid [2],
> which requires a declaration for every element in a document.
>
> It's unfortunate that the definition of *valid* is less explicit than
> the definition of conforming validating processor, but my guess is
> that the way the Core WG is most likely to fix that is by making the
> definition of *valid* stronger, not by making the Conformance section
> weaker.
>
> It would be possible to expand the definition of *validating
> processors* to be clearer about their responsibilities in the absence
> of a document type declaration, and that might be a good idea.
>
> It would also probably be a good idea to clarify that as things stand
>
>    <!DOCTYPE html>
>    <html/>
>
> is, using the usual convention, _invalid_, where
>
>    <html/>
>
> is neither valid _nor_ invalid, and to provide a definition of
> 'invalid' as "given a document type declaration, violating one or more
> of the constraints expressed by the declarations in the DTD, and
> failing to fulfill one or more of the validity constraints given in
> this specification".
>
> But to take account of the behaviour you cite of xmllint,
> likewise of rxp,
> (which treat the two cases above, and the even simpler
>   <html/>
> case, all as instances of an idiosyncratic validity error w/o
> precedent in the XML spec.), we would have to define what it meant to
> have an _empty_ document type declaration, which would be rather more
> difficult, and potentially backward incompatible.
>
> Consider, for example
>
>    <!DOCTYPE html []>
>    <html/>
>
> which causes both report the 'ordinary' undeclared element error, but
> xmllint to cmplain of a missing DTD.
>
> Note also that
>
>    <!DOCTYPE html>
>    <hmtl/>
>
> _is_ invalid, and we wouldn't want to lose that. . .
>
> ht
>
> [1] http://www.w3.org/TR/REC-xml/#sec-conformance
> [2] http://www.w3.org/TR/REC-xml/#elementvalid

On the telcon, Henry indicated (as he outlines above) that he
felt there may be something we could do to the XML spec to
improve it in this regard.

But I'm not sure I quite follow or agree for the most part.

I don't understand or particularly like the idea of introducing
the term "invalid".  I gather from what Henry says above that
he is using the term to mean "not well-formed" (or just not XML)
when he says that a well-formed (but not valid) document is
neither valid nor invalid.  But I don't see the point of
introducing "invalid" to mean "not well-formed" at this late
date.

On the other hand, Henry says that <!DOCTYPE html><html/> is
"invalid", and then that confuses me, since that is well-formed.

I don't think the XML spec should dictate when a validating
parser versus a non-validating parser should be used.  (I'm
not saying Henry suggests that either; I'm just trying to
outline what changes we might consider to the spec.)

What we have now in the spec is:

  Definition: An XML document is valid if it has an associated
  document type declaration and if the document complies with
  the constraints expressed in it.

  . . .

  validity constraint

   [Definition: A rule which applies to all valid XML documents.
   Violations of validity constraints are errors; they MUST, at
   user option, be reported by validating XML processors.]

Per the definition of validity constraint, a document is not
valid if it violates a validity constraint.  Perhaps that could
be made clearer in the definition of "valid" by augmenting its
definition to say:

  Definition: An XML document is valid if it has an associated
  document type declaration and if the document complies with
  the constraints expressed in it and the document violates no
  validity constraints.

Other than that, I don't see that the current issue leads us
to any other worthwhile changes to the spec.

paul

Received on Monday, 27 January 2014 22:33:34 UTC