Re: validator.nu from Frank Ellermann on 2008-02-18 (www-validator@w3.org from February 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Mon, 18 Feb 2008 22:52:03 +0100
To: www-validator@w3.org
Cc: www-archive@w3.org
Message-ID: <fpcuiu$etj$1@ger.gmane.org>
Henri Sivonen wrote:
 
> Disclaimer: Still not a WG response.

This disclaimer business is odd.  Is the WG supposed to agree
on answers for all public comments ?  Should I add disclaimers
stating "speaking for myself as of today, and not necessarily
still insisting on what I said a year ago" ?

> Changed W3C list to www-archive, because this reply isn't
> feedback about HTML 5.

That is apparently a list to get mails on public record, not
exactly for discussions, I add the W3C validator list, maybe
remove wwww-archive.

Thanks for reporting and fixing the *.ent issues on the W3C
validator servers.

> HTML 4.01 already defined IRI-compatible processing for the
> path and  query parts, so now that there are actual IRIs,
> making Validator.nu complain about them doesn't seem 
> particularly productive.

Getting IRIs outside of <reg-host> right is not too complex,
for UTF-8 pages it is trivial.  But HTML 4 based on RFC 2396
couldn't foresee how simple RFC 3987 is for these parts.

However HTML 4 and RFC 2396 (1998) are seven years older than
RFC 3987 (2005), and older implementations supporting HTML 4
won't get it right.  In other words raw IRIs do *not* work on
almost all browsers. 

The accessibility issue should be obvious.  A few months ago
FF2 failed to support <ipath> on pages with legacy charsets.
It got the non-trivial <ihost> right, so far for one popular
browser.

Accessibility is not defined by "upgrade your browser", it has
"syntactically valid" as premise.  New "raw" IRIs in HTML 4 or 
XHTML 1 documents are not valid, they are no URLs as specified
in RFC 3986, let alone 2396.

Using "raw" IRIs on pages supposed to be "accessible" by some
applicable laws is illegal for HTML 4 (or older) documents, it
is also illegal for all XHTML 1 versions mirroring what HTML 4
does.  

Schema based validation not supporting URLs is really strange,
if it is too complex use DTDs, maybe try to create a HTML5 DTD
starting with the existing XHTML 1.1 modules.

"Won't do URLs because they are not productive" is also odd, 
thousands of browsers and other tools can't handle "raw" IRIs.
 
By design all "raw" IRIs can be transformed into proper URIs,
it is unnecessary to break backwards compatibility if all URI
producers do what the name says.

After that you have a simple split in the spirit of MIME, old
clients (URI consuments) get something they can handle, URLs,
the deployment of IRIs can begin where it should start, at the
stronger side (URI producers and servers).

Forcing the weaker side to upgrade is just wrong, the stronger
side has to begin with this job.  Leave IE6 (etc.) alone until 
they vanished voluntarily in about ten years.

If the IRI design would force us to do all or nothing it could
be different.  But that is not the case, IRIs are designed for
a smooth transition by defining an equivalent URI for any IRI.

>> Actually the same syntax renaming URI to IRI everywhere,
>> updating RFC 2396 + 3066 to 3987 + 4646 in DTD comments,
 
> That's a pointless exercise, because neither browsers nor
> validators ascribe meaning to DTD comments or production
> identifiers.

Human users including judges deciding what "accessible" means
and implementors deciding what URL means can see a difference.

For implementors building schema validators I'd say that they
MUST support URLs in a professional tool for various reasons
related to security, accessibility, ethics, professionalism.

The W3C validator has an excuse, it uses DTDs.  But even here
the issue is bug 4916 reported by Olivier 2007-08-07, at the
time when various URL security flaws (XP + IE7) hit the fan.

Related to ethics and professionalism, how should say Martin
or Richard create (X)HTML test pages for IRIs, or submit an
RFC 3987 implementation and interoperability report without
an XHTML document type permitting to use "raw" IRIs ?

HTML5 is still a draft and moving target at the moment, and
using atom or xmpp (designed with IRIs in mind, no backwards
compatibility issues) might be not what they need.

  [Back to validator.nu]
>> * Be "lax" about HTTP content - whatever that is, XHTML 1
>>  does not really say "anything goes", but validator.nu
>>  apparently considers obscure "advocacy" pages instead
>>  of the official XHTML 1 specification as "normative".
 
> Validator.nu treats HTML 5 as normative and media type-based  
> dispatching in browsers as congruent de facto guidance.

If you want congruent de facto guidance I trust that <embed>
and friends will pass as "valid" without warning (untested).

However when I want congruent de facto guidance I would use
another browser, ideally Lynx, not a validator.

Some weeks ago you quoted an ISO standard I haven't heard of
before for your definition of "valid".  If that ISO standard 
has "congruent de facto guidance" in its definition trash it
or maybe put it where you have DIS 29500.

> I've fixed the schema preset labeling to say "+ IRI".

Good, documented bugs can be features.  Somewhat suspicious,
I hope your validator can parse and check IRIs based on the
RFC 3986 syntax.  There are some ugly holes in RFC 3987 wrt
spaces and a few other characters.  Spaces can cause havoc
in space-separated URI lists and similar constructs.  

 [legacy charsets]
> US-ASCII and ISO-8859-1 (their preferred IANA names only)
> don't trigger that warning, because I don't have evidence
> of XML processors that didn't support those two in addition
> to the required encodings.

That's an odd decision, US-ASCII clearly is a "proper subset" 
of UTF-8 for any sound definition of "proper suset" in this
field.  But Latin-1 is no proper subset of the UTFs required
by XML.  (Arguably Latin-1 is a subset of "UTF-4", but that
is mainly a theoretical construct and no registered charset).

My intuitive interpretation of the HTML5 draft is that they
are ready with Latin-1 and propose windows-1252 as its heir
in the spirit of congruent de facto guidance.  If you see no
compelling use cases for NEL / SS2 / SS3 in ISO-8859-1 today
you could ditch the ISO 6429 C1 controls, in the same way as
UTF-8 replaced UTF-1 fourteen years ago.u+001b u+0045

>> Validator.nu accepts U-labels (UTF-8) in system identifiers,
>> W3C validator doesn't, and I also think they aren't allowed
>> in XML 1.0 (all editions).  Martin suggested they are okay,
>> see <http://www.w3.org/Bugs/Public/show_bug.cgi?id=5279>.
 
> Validator.nu URIfies system ids using the Jena IRI library
> set to the XML system id mode.

"Need some external library doing some obscure stuff" is IMO
precisely the problem in XML 1.0 (3rd and 4th edition). 

> Considering that the XML spec clearly sought to allow IRIs
> ahead of the IRI spec, would it be actually helpful to 
> change this even if a pedantic reading of specs suggested
> that the host part should be in Punycode?

Your interpretation differs from what I get:  The XML spec.
apparently does *not* suggest punycode, it is not based on
RFC 3987, it invents its very own pseudo-IRIs for the system
identifiers, requiring that clients use additional libraries
to get external entities using these constructs.

In other words a simple XML application using curl or wget or
what else supporting URLs could *breaks.  IMNSHO this is the
attitude towards backwards compatibility which makes me think
that "the W3C is hostile to users".  Sometimes producing good
stuff, but often with an ugly "upgrade your browser" catch 22.

Forcing XML applications to support punycode based on Unicode
3.2 (for IDNA2003, the IDNA200x stuff is not yet ready) only
to retrieve system identifiers is just wrong.

I'd understand it for XML 1.1, but why on earth in XML 1.0 ?
Who needs or wants this, ignoring any commercial interests ?

> I don't believe in non-DNS host names.

Nor me in this discussion.  In a draft about say the file: 
URI scheme I'd try to dance around the issue... ;-)  I dare
not propose to reject bug 5280, maybe it deserves a WONTFIX.

For bug 5279 somebody feeling responsible for XML 1.0 should
figure out what is wrong, your validator, the W3C validator,
or the XML 1.0 spec.  IMO XML 1.0 got it wrong, and the W3C
validator got it right.  

> you seem to be hostile to the idea of fixing how your 
> documents are served

I'd certainly prefer it if say "googlepages" serve KML files
as application/vnd.google-earth.kml+xml, and where I have 
href links to such beasts I add an explicit type attribute.
  
But note that I also have a "Kedit Macro Library" KML file
on another server, it's not as simple as it sounds.

Fixing servers is not under my control, and clearly servers
have no chance to know what say KML is.  That Google should
support their own inventions is already very far stretched.

Quick test, trying to get the sitemap.xml from googlepages:
They say text/xml; charset=UTF-8, not too shabby.  Now I
recall where I saw space separated URIs:  schemaLocation.

> if you want to be in the business of creating test suites,  
> getting hosting where you can tweak the Content-Type is
> generally a good way to start.

Yeah, so far I got away without it for the purpose of W3C
validator torture tests.  This failed with your validator
caused by chemical/x-pdb, which turned out to be a bug on
the W3C validator server, so that's still "in the family".

Maybe I should have insisted on IANA's server getting it
right for the HTML i18n DTD.  But after waiting a year for
its registration, when it finally worked "as is" with the
W3C validator, I wrote "thanks, and don't worry about it".

After all it's a historic document type, only I missed it.

> A validator can't know what parts you can edit and what
> parts you can't.

The most likely case is "can fix issues *within* the text",
the more or less dubious meta data sent by HTTP servers is
a different topic.  Offer to check it optionally.  Forcing
users to click three options before the task at hand, find
any bugs *within* the document, can start, is a nightmare.

And "upgrade your Web hoster" is also a different topic, an
ordinary user would not know that a W3C validator bugzilla
exists, or how to use it, IMO bugzilla is also a nightmare.

> However, if you care about practical stuff, you shouldn't
> even enable external entity loading, since browsers don't
> load external entities from the network.

For xml2rfc this works to some degree, bad things can happen
with invalid xml2rfc sources.  When I'm using a validator I
try to find bugs in documents, and when I wish to know what
browsers do I use a browser.  

 [image links within <pre> hidden by <span>] 
> Could you provide a URL to a demo page?

<http://purl.net/xyzzy/xedit.htm>  After setting all options
to get around a "congruent de facto guidance" in an advocacy
page overruling IETF and W3C standards from your POV it finds
29 invalid uses of <img> within <pre>.  One of the advantages
of schema validation, a <span> cannot confuse your validator.

 Frank
Received on Monday, 18 February 2008 21:50:48 UTC