Re: validator.nu from Frank Ellermann on 2008-02-17 (public-html-comments@w3.org from February 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Sun, 17 Feb 2008 22:08:54 +0100
To: public-html-comments@w3.org
Message-ID: <fpa7m0$24b$1@ger.gmane.org>
Henri Sivonen wrote:
 
> Validator.nu checks the combination of the protocol 
> entity body and the Content-Type header. Pretending
> that Content-Type didn't matter wouldn't make sense
> when it does make a difference in terms of processing
> in a browser.

I checked if the W3C validator servers still claim that
application/xml-external-parsed-entity is chemical/x-pdb

This was either fixed, or it is an intermittent problem,
therefore I can continue my I18N tests today.  XHTML 1
like HTML 4 wants URIs in links.  For experiments with
IRIs I created a homebrewn XHTML 1 i18n document type.

Actually the same syntax renaming URI to IRI everywhere,
updating RFC 2396 + 3066 to 3987 + 4646 in DTD comments,
with absolute links to some entity files hosted by the
W3C validator - that caused the chemical/x-pdb trouble.

To get some results related to the *content* of my test
files I have to set three options explicitly:

* Be "lax" about HTTP content - whatever that is, XHTML 1
  does not really say "anything goes", but validator.nu
  apparently considers obscure "advocacy" pages instead
  of the official XHTML 1 specification as "normative".

* Parser "XML; load external entities" - whatever it is,
  validator.nu cannot handle the <?xml etc. intro for
  XHTML 1 otherwise.  But that is required depending on
  the charset, and certainly always allowed for XHTML 1.

* Preset "XHTML 1 transitional" - actually the test is
  not realy XHTML 1 transitional, but a uses a homebrewn
  XHTML 1 i18n DTD, but maybe that's beside the point for
  a validator not supporting DTDs to start with.

With those three explicitly set options it could finally
report that my test page is "valid" XHTML 1 transitional.

But it's *not*, it uses real IRIs in places where only URIs
are allowed, a major security flaw in DTD based validators:
<http://omniplex.blogspot.com/2007/11/broken-validators.html>

I know why DTD validators have issues to check URI syntax,
it's beyond me why schema validators don't get this right,
IMO "get something better than CDATA for attribute types"
is the point of not using DTDs.  And "can do STD 66 syntax
for URIs", a full Internet Standard, is the very minimum
I'd expect from something claiming to be better than DTDs.

The broken URIs starting "calc" (on XP with installed IE7)
from various applications were a hot topic for some months
in 2007 until Adobe, Mozilla, MS, etc. finally arrived at
the conclusion that the question whose fault that was isn't
relevant.  If all parties simply follow STD 66 it is okay.

Four more related XHTML 1 I18N tests likely can't fly with 
validator.nu not supporting the (very) basic idea of DTDs,
out of curiosity I tried it anyway:

| Warning: XML processors are required to support the UTF-8
| and UTF-16 character encodings. The encoding was KOI8-R
| instead, which is an incompatibility risk.

Untested, I hope US-ASCII wouldn't trigger this warning, as
a mobile-ok prototype did some months ago (and maybe still
does).   
  
Validator.nu accepts U-labels (UTF-8) in system identifiers,
W3C validator doesn't, and I also think they aren't allowed
in XML 1.0 (all editions).  Martin suggested they are okay,
see <http://www.w3.org/Bugs/Public/show_bug.cgi?id=5279>.

Validator.nu rejects percent encoded UTF-8 labels in system
identifiers, like the W3C validator.  I think that is okay, 
*unless* you believe in a non-DNS STD 66 <reg-name>, where
it might be syntactically okay.  Hard to decide, potentially
a bug <http://www.w3.org/Bugs/Public/show_bug.cgi?id=5280>.

 [back to the general "HTML5 considered hostile to users"]
> What are you trying to achieve?

As mentioned about ten times in this thread I typically try
to validate content, as author of the relevant document, or
in a position to edit (in)valid documents.  

The complete number of HTTP servers under my control at this
second (counting servers where I can edit dot-files used as
configuration files by a popular server) is *zero*.  That is
a perfectly normal scenario for many authors and editors.

Of course I'm not happy if files are served as chemical/x-pdb
or similar crap, but it is outside my sphere of influence, and
not what I'm interested in when I want to know what *I* did to
make the overall picture worse *within* documents edited by me.

Of course mediawiki *could* translate IRIs to equivalent URIs
when it claims to produce XHTML 1 transitional, etc., just to
mention another example.  They are IMO in an ideal position to
do this on the fly, for compatibility with almost all browsers,
and IRIs are designed to have equivalent URIs.  

Where "outside my sphere of influence" is negotiable, e.g. I'd
have reported chemical/x-pdb as bug today, but it was already 
fixed.  My "plan B" was to use the "official" absolute URIs on
a W3C server instead of the validator's SGML library, "plan C"
would be to copy these files and put them on the same server as
the homebrewn DTD.  While googlepages won't try chemical/x-pdb
I fear they'll never support the correct type for *.ent files,
that is rather obscure.

> Are you trying to check that your Web content doesn't have 
> obvious technical problems?

Normally, yes.  Of course we are discussing mainly my validator
torture test pages, intentionally *unnormal* pages.  I don't use
HTML 2 strict or HTML i18n elsewhere, I don't use "raw" IRIs on
"normal" XHTML 1 transitional pages because I know it's invalid,
I use obscure colour names in legacy markup working more or less
with any browser only on a single test page, and when you find
*hundreds* of "&" instead of "&amp;" on my blogger page this is
no test, but a blogger bug, and I reported it months ago.  Maybe
they don't care, or are busy with other stuff like "open-id", or
the most likely case:  For products with thousands of users such
bug reports NEVER reach developers, because they are filtered by
folks drilled to suppress^H^H^H^Hort technically clueless users.

> Or are you just trying to game a tool to say that your page is
> valid

Rarely.  I use image-links hidden by span within pre on one page,
at some point in time validators will tell me that this is a hack,
no matter if it works with all browsers I've ever tested.  Sanity
check with validator.nu:  Your tool says that this is an error.

Maybe HTML5 could permit it, I'm not hot about it unless somebody
produces a browser where this horribly fails.

> Why are you validating pages?

To find bugs.  And for some years I used the W3C validator and its
mailing list also as a way to learn XHTML 1 above a level offered
by an O'Reilly book.  Until I could read DTDs, read the XML spec.
often enough for a vague impression, and figured relevant parts of
the HTML history out.  Using a legacy "3.2" browser also helped.

 Frank
Received on Sunday, 17 February 2008 21:15:15 UTC