Re: Proposal: Don't to XML wf-ness checks on text/html pages [was: Validator timeout and XML-LibXML bug] from Michael(tm) Smith on 2010-06-20 (public-qa-dev@w3.org from June 2010)

From: Michael(tm) Smith <mike@w3.org>
Date: Sun, 20 Jun 2010 13:05:25 +0900
To: Ville Skyttä <ville.skytta@iki.fi>
Cc: public-qa-dev@w3.org, ted@w3.org, Dominique Hazael-Massieux <dom@w3.org>, jean-gui@w3.org, tgambet@w3.org
Message-ID: <20100620040519.GB6370@sideshowbarker>

Ville Skyttä <ville.skytta@iki.fi>, 2010-06-19 00:06 +0300:

> On Friday 18 June 2010, Michael(tm) Smith wrote:
> 
> > About the idea of disabling XML wellformedness checks, I want to
> > raise something for discussion here that I've already also brought
> > up off-list, which is: I don't think we should do XML
> > wellformedness checking on pages that are served as text/html.
> 
> If we don't do that (for non-XML docs or at all) and leave it to 
> SGML::Parser::OpenSP, validator will be bitten by OpenSP's XML limitations.  I 
> gather this is pretty much the reason the "extra" XML wellformedness check 
> exists in the first place; it was added in April 2007, in validator 0.8.0 beta 
> 1.  More info: http://openjade.sourceforge.net/doc/xml.htm

Looking at that one-by-one:

- "XML constrains processing instructions with a target matching
  [Xx][Mm][Ll], both in terms of where they can occur and their
  content."

  That one, to me, does not seem important enough to justify
  adding an additional dependency (on XML::LibXML or whatever)

- "XML does not allow a parameter separator that is adjacent to a
  delimiter to be omitted."

  I don't know what that means. I see it's mentioned also in
  http://www.w3.org/TR/NOTE-sgml-xml-971215 but I still don't
  know what it means.

- "XML has constraints on the use of & in parameter literals. In
  SGML terms, XML says that the ero delimiter is recognized in a
  parameter literal, and that it must be followed by an entity
  reference, but the entity reference is not expanded."

- "Line ends are normalized using SGML conventions to a CR/LF
  character pair rather than using the XML convention of a single
  LF character."

  I think that does not make any difference as far as validation.

> One example of this is that XHTML documents (no matter with what content type 
> they are served with) containing something like:
> 
>     <p id="foo"class="bar">

That's an error in the non-XML HTML language, as well -- even in
HTML4. The HTML4 spec says:

  http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
  Any number of (legal) attribute value pairs, separated by
  spaces, may appear in an element's start tag.

I realize that's not how normative conformance requirements are
typically stated these days, but I suspect there are other
nominal HTML4 requirements which the validator is already
enforcing that are stated in the HTML4 spec itself less clearly
than that.

> (missing space between "foo" and class) will start to go unnoticed and 
> declared valid by the validator,

Couldn't we patch our copy of OpenSP to always report a lack of
spec between attributes as an error? Isn't that something that
could be caught and reported in the lexer/tokenizer (or
whatever else it might be called in SGML terms) part of the OpenSP
code? (Rather than introducing another dependency on XML::LibXML
or whatever.)

> of course assuming there are no other errors the validator does
> catch.  I think this would be such a serious problem that it
> should be considered only as a last resort, and if done, the
> note about XML support limitations that was there in validator <
> 0.8.0 should be brought back.

Yeah, I think everybody would agree that not being able to report
XML conformance violations in documents served with an XML MIME
type would not be acceptable.

  --Mike

-- 
Michael(tm) Smith
http://people.w3.org/mike

Received on Sunday, 20 June 2010 04:05:31 UTC