W3C home > Mailing lists > Public > xml-editor@w3.org > January to March 2014

Re: Clarify that documents with DOCTYPE but without markup declaration are not subject to validation

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 6 Feb 2014 04:38:36 +0100
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
Cc: Jirka Kosek <jirka@kosek.cz>, xml-editor@w3.org
Message-ID: <20140206043836709942.b5ba32f5@xn--mlform-iua.no>
Sorry for my not immediate answer. See below.

Henry S. Thompson, Tue, 21 Jan 2014 18:27:29 +0000:
> Leif Halvard Silli writes:
> 
>> A document that lacks DTD is simply ”not valid”
>> <http://www.w3.org/TR/REC-xml/#sec-prolog-dtd>. And, as not valid,
>> whether it has validation errors is a question that is out of the
>> question.
> 
> I presume you're referring here to these lines near the beginning:
> 
>   [Definition: XML documents SHOULD begin with an XML declaration
>   which specifies the version of XML being used.] For example, the
>   following is a complete XML document, _well-formed_ but not _valid_:
> 
>   <?xml version="1.0"?>
>   <greeting>Hello, world!</greeting> 
> 
>   and so is this:
> 
>   <greeting>Hello, world!</greeting>
> 
>   [emphasis in original]

But, it is pretty obvious - to me - that what that section wants to 
point out is that the *XML* declaration has nothing to do with ”valid” 
or ”not valid”. Nor has it anything to do with well-formed or not 
well-formed.

> It's not *valid*, but it's not *invalid* either:

What is your point here? Is there third category, you say? What should 
a validating XML processor say if it parses the above document? 

It has always been pretty obvious - to me - that XML avoids ”invalid” 
simply because “invalid” has so many negative - and wrong - 
connotations. ”Valid” simply means ”not conforming to a spec [expressed 
via DTD grammar]”). And ”not valid” thus simply means that it does not 
conform to a spec expressed via a DTD grammar.

Thus the document above *is* invalid because invalid is just XML’s 
unspeakable synonym for ”not valid”.

>   XML provides a mechanism, the document type declaration, to define
>   constraints on the logical structure and to support the use of
>   predefined storage units. [Definition: An XML document is *valid* if
>   it has an associated document type declaration and if the document
>   complies with the constraints expressed in it.]
> 
> Each of your examples, i.e.
> 
>   <!DOCTYPE html>
>   <html/>
> and
>   <!DOCTYPE html SYSTEM "about:legacy-compat">
>   <html/>
> 
> clearly does have an "associated document type declaration", and equally
> clearly contain "failures to fulfill the validity constraints given in
> this specification" [1], so I conclude they are not only not valid,
> but invalid (although that, interestingly, is not a term defined in
> the spec.

The first validity constraint expressed in XML is that the DOCUMENT has 
a *DTD*. A grammar. A DOCTYPE without a grammar has no grammar. Is just 
the empty shell.

>  What we find at [1] is an obligation on *validating
> processors* to _report_ "failures to fulfill the validity constraints
> given in this specification".)

What we also find is a stressing of the fact that, quote: ”it is 
possible to construct a well-formed document containing a doctypedecl 
that neither points to an external subset nor contains an internal 
subset”. Clearly, such a document would be ”well-formed” but as well 
”not valid”.

Note how the spec here says ”doctypedecl” - it refers to the formal 
grammar. I interpret this as if it *avoids* the word ”document type 
declaration”.

Which is logical, when we consider that the spec, a little before that 
quote says, (my emphasis): ”The XML document type declaration 
**contains** or **points** to markup declarations that provide a 
grammar for a class of documents”. Something which each of my examples 
does not contain. (No, the about:legacy-compat is a URL that points to 
nowhere, thus there is not any empty DTD file anywhere.)

> The validity constraint they both fail to fulfill is VC: Element Valid [2],
> which requires a declaration for every element in a document.

That requirement is as well not met by ”<greeting>Hello, 
world!</greeting>”.

> It's unfortunate that the definition of *valid* is less explicit than
> the definition of conforming validating processor, but my guess is
> that the way the Core WG is most likely to fix that is by making the
> definition of *valid* stronger, not by making the Conformance section
> weaker.

I have not suggested to make the conformance section weaker. My 
understanding is that you seek to insert a third category, while the 
XML spec always has only had two categories.

> It would be possible to expand the definition of *validating
> processors* to be clearer about their responsibilities in the absence
> of a document type declaration, and that might be a good idea.
> 
> It would also probably be a good idea to clarify that as things stand
> 
>   <!DOCTYPE html>
>   <html/>
> 
> is, using the usual convention, _invalid_, where
> 
>   <html/>
> 
> is neither valid _nor_ invalid, and to provide a definition of
> 'invalid' as "given a document type declaration, violating one or more
> of the constraints expressed by the declarations in the DTD, and
> failing to fulfill one or more of the validity constraints given in
> this specification".

If so, then my message would seem to have resulted in the opposite of 
my intention.

What is the benefit of this proposal of yours? I see none. It only 
would seem to strengthen the belief that it is correct to use an empty 
DOCTYPE declaration as trigger to start XML 1.0 validation processor 
mode.

Because, in my case, I have a tool which support both XSD and DTD. XSD 
mode can bee triggered by the very presence of a XHTML namespace 
declaration. However, as soon my tool notifies the HTML 5 doctype (the 
short variant) it disables its XSD feature and starts its validation 
mode.

> But to take account of the behaviour you cite of xmllint,
> likewise of rxp,
> (which treat the two cases above, and the even simpler
>  <html/>
> case, all as instances of an idiosyncratic validity error w/o
> precedent in the XML spec.), we would have to define what it meant to
> have an _empty_ document type declaration, which would be rather more
> difficult, and potentially backward incompatible.
> 
> Consider, for example
> 
>   <!DOCTYPE html []>
>   <html/>
> 
> which causes both report the 'ordinary' undeclared element error, but
> xmllint to cmplain of a missing DTD.

Which is an OK complaint provided the user/author knows that 
xmllint/cmplain runs in XML 1.0 validation mode!

> Note also that
> 
>   <!DOCTYPE html>
>   <hmtl/>
> 
> _is_ invalid, and we wouldn't want to lose that. . .

It is “not valid”. If it is invalid then it is only in the ”not valid” 
sense. I believe have not proposed anything that could make us loose 
that it is not valid.

> [1] http://www.w3.org/TR/REC-xml/#sec-conformance

> [2] http://www.w3.org/TR/REC-xml/#elementvalid

-- 
leif halvard sillli
Received on Thursday, 6 February 2014 03:39:06 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 6 February 2014 03:39:13 UTC