Re: validator.nu from Henri Sivonen on 2008-02-20 (www-validator@w3.org from February 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 20 Feb 2008 20:05:26 +0200
To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Cc: W3C Validator Community <www-validator@w3.org>
Message-Id: <65E8313D-A81C-4FCB-B2E9-488E253914A4@iki.fi>
On Feb 18, 2008, at 23:52, Frank Ellermann wrote:

> Henri Sivonen wrote:
>
>> Disclaimer: Still not a WG response.
>
> This disclaimer business is odd.

As Anne already pointed out to you, the disclaimers are there due to a  
request from one of the WG Chairs:
http://lists.w3.org/Archives/Public/public-html/2008Jan/0196.html

> Is the WG supposed to agree on answers for all public comments ?

I don't know how that's supposed to work.

>> Changed W3C list to www-archive, because this reply isn't
>> feedback about HTML 5.
>
> That is apparently a list to get mails on public record, not
> exactly for discussions, I add the W3C validator list, maybe
> remove wwww-archive.

I don't expect readers of www-validator to like off-topic discussions  
about other validation services, but I guess it would be silly to  
switch lists for each message.

>> HTML 4.01 already defined IRI-compatible processing for the
>> path and  query parts, so now that there are actual IRIs,
>> making Validator.nu complain about them doesn't seem
>> particularly productive.
>
> Getting IRIs outside of <reg-host> right is not too complex,
> for UTF-8 pages it is trivial.  But HTML 4 based on RFC 2396
> couldn't foresee how simple RFC 3987 is for these parts.
>
> However HTML 4 and RFC 2396 (1998) are seven years older than
> RFC 3987 (2005), and older implementations supporting HTML 4
> won't get it right.  In other words raw IRIs do *not* work on
> almost all browsers.

Do they not work in the latest versions of, say, the top four browsers  
(IE7, Firefox 3, Safari 3 and Opera 9.5)? (I've noticed that you've  
used a legacy browser on a legacy OS until recently, but I'm not  
interested in supporting authoring of new pages for the deep legacy.)

The XHTML 1.0 / HTML 4.01 functionality in Validator.nu uses HTML5  
datatypes where DTD-based validation didn't constrain datatypes at  
all. The main reason to offer these enhanced legacy schemas is to make  
use of the HTML5 validation advancements while offering familiar and  
stable content models and attribute permissibility while HTML5 itself  
is still in flux.

> The accessibility issue should be obvious.  A few months ago
> FF2 failed to support <ipath> on pages with legacy charsets.
> It got the non-trivial <ihost> right, so far for one popular
> browser.
>
> Accessibility is not defined by "upgrade your browser", it has
> "syntactically valid" as premise.

I think the premise of defining whether something is accessible should  
be whether a person with a given disability can actually access the  
page with reasonable effort using the kind of tools that are commonly  
used by persons with the disability.

> New "raw" IRIs in HTML 4 or
> XHTML 1 documents are not valid, they are no URLs as specified
> in RFC 3986, let alone 2396.
>
> Using "raw" IRIs on pages supposed to be "accessible" by some
> applicable laws is illegal for HTML 4 (or older) documents, it
> is also illegal for all XHTML 1 versions mirroring what HTML 4
> does.

Validator.nu is not a legal tool. It most certainly isn't marketed as  
one.

> Schema based validation not supporting URLs is really strange,
> if it is too complex use DTDs, maybe try to create a HTML5 DTD
> starting with the existing XHTML 1.1 modules.

That doesn't make sense.

> "Won't do URLs because they are not productive" is also odd,
> thousands of browsers and other tools can't handle "raw" IRIs.
>
> By design all "raw" IRIs can be transformed into proper URIs,
> it is unnecessary to break backwards compatibility if all URI
> producers do what the name says.

Is that an indictment of IRIs in general as used in Web documents? If  
IRIs are bad, should HTML5 require plain URIs? Surely the  
compatibility situation doesn't change from the browser point of view  
when a page is supposed to be HTML5 instead of HTML 4.01. Should the  
IETF close down IRI activities or say that IRIs are only for typing  
into the browser address field?

[...]
>>> Actually the same syntax renaming URI to IRI everywhere,
>>> updating RFC 2396 + 3066 to 3987 + 4646 in DTD comments,
>
>> That's a pointless exercise, because neither browsers nor
>> validators ascribe meaning to DTD comments or production
>> identifiers.
>
> Human users including judges deciding what "accessible" means
> and implementors deciding what URL means can see a difference.

So things become more or less accessible if you change human-readable  
comments in a DTD?

> For implementors building schema validators I'd say that they
> MUST support URLs in a professional tool for various reasons
> related to security, accessibility, ethics, professionalism.

One might also say that IRIs must be supported for  
internationalization (which some people associate with ethics and  
professionalism).

[...]
> Related to ethics and professionalism, how should say Martin
> or Richard create (X)HTML test pages for IRIs, or submit an
> RFC 3987 implementation and interoperability report without
> an XHTML document type permitting to use "raw" IRIs ?

Just by doing it regardless of what an "XHTML document type" is  
claimed to permit?

[...]
>  [Back to validator.nu]
>>> * Be "lax" about HTTP content - whatever that is, XHTML 1
>>> does not really say "anything goes", but validator.nu
>>> apparently considers obscure "advocacy" pages instead
>>> of the official XHTML 1 specification as "normative".
>
>> Validator.nu treats HTML 5 as normative and media type-based
>> dispatching in browsers as congruent de facto guidance.
>
> If you want congruent de facto guidance I trust that <embed>
> and friends will pass as "valid" without warning (untested).

<embed> is valid in HTML5, yes.

> However when I want congruent de facto guidance I would use
> another browser, ideally Lynx, not a validator.

Testing your pages in Lynx is a good idea.

> Some weeks ago you quoted an ISO standard I haven't heard of
> before for your definition of "valid".  If that ISO standard
> has "congruent de facto guidance" in its definition trash it
> or maybe put it where you have DIS 29500.

Validator.nu claims to contain a RELAX NG validator. Referring to the  
standard family that defines ISO RELAX NG (and ISO Schematron and  
NVDL) is not inappropriate, in my opinion, even if you haven't heard  
of the standard family before.

[...]
> [legacy charsets]
>> US-ASCII and ISO-8859-1 (their preferred IANA names only)
>> don't trigger that warning, because I don't have evidence
>> of XML processors that didn't support those two in addition
>> to the required encodings.
>
> That's an odd decision, US-ASCII clearly is a "proper subset"
> of UTF-8 for any sound definition of "proper suset" in this
> field.  But Latin-1 is no proper subset of the UTFs required
> by XML.  (Arguably Latin-1 is a subset of "UTF-4", but that
> is mainly a theoretical construct and no registered charset).

The warning is about actual potential interoperability issues. There's  
a very popular XML parser--expat--that supports only UTF-8, UTF-16,  
ISO-8859-1 and US-ASCII in its default configuration. However, no one  
has shown me an XML parser that didn't support ISO-8859-1 (or  
ISO-8859-1 treated as an alias of Windows-1252), even though  
supporting ISO-8859-1 is not required. Given this situation and how  
common it is to mark ASCII-only XML as ISO-8859-1 in practice, it  
wouldn't make sense to complain about ISO-8859-1.

> My intuitive interpretation of the HTML5 draft is that they
> are ready with Latin-1 and propose windows-1252 as its heir
> in the spirit of congruent de facto guidance.

For text/html, ISO-8859-1 must be treated as an alias for Windows-1252  
in order not to Break the Web.

> If you see no
> compelling use cases for NEL / SS2 / SS3 in ISO-8859-1 today
> you could ditch the ISO 6429 C1 controls, in the same way as
> UTF-8 replaced UTF-1 fourteen years ago.u+001b u+0045

The XML side of Validator.nu warns about C1 controls.

[...]
>> Considering that the XML spec clearly sought to allow IRIs
>> ahead of the IRI spec, would it be actually helpful to
>> change this even if a pedantic reading of specs suggested
>> that the host part should be in Punycode?
>
> Your interpretation differs from what I get:  The XML spec.
> apparently does *not* suggest punycode,

Right, so if you want to refer to an IDN host, per strict reading of  
the spec, the system id needs to have a pre-punycoded hostname.

[...]
>> A validator can't know what parts you can edit and what
>> parts you can't.
>
> The most likely case is "can fix issues *within* the text",
> the more or less dubious meta data sent by HTTP servers is
> a different topic.  Offer to check it optionally.

The check is optional by on by default. However, when the check is  
relaxed, it only allows typical misconfigurations in order to avoid  
accidental downloading of image and video files. You just happened to  
come across a case where the misconfigured type was exceptionally weird.

[...]
> And "upgrade your Web hoster" is also a different topic, an
> ordinary user would not know that a W3C validator bugzilla
> exists, or how to use it, IMO bugzilla is also a nightmare.

Ordinary users don't publish test suites.

>>> [image links within <pre> hidden by <span>]
>> Could you provide a URL to a demo page?
>
> <http://purl.net/xyzzy/xedit.htm>  After setting all options
> to get around a "congruent de facto guidance" in an advocacy
> page overruling IETF and W3C standards from your POV it finds
> 29 invalid uses of <img> within <pre>.  One of the advantages
> of schema validation, a <span> cannot confuse your validator.

Clearly, %pre.exclusion; in HTML 4.01 was meant to exclude  
descendants--not just children, but DTDs couldn't express this.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 20 February 2008 18:05:47 UTC