Re: Validator output and standards conformance from Olle Olsson on 2002-04-04 (www-validator@w3.org from April 2002)

From: Olle Olsson <olleo@sics.se>
Date: Thu, 04 Apr 2002 16:25:08 +0200
To: W3C Validator <www-validator@w3.org>
CC: olleo <olleo@sics.se>
Message-ID: <3CAC6243.F166F083@sics.se>
Terje Bless wrote:

> Olle Olsson <olleo@sics.se> wrote:
>
> >Maybe this has been noted before, but being new to this list...
>
> It has, but don't let that stop you. Better multiple reports of the same
> problem then none at all. We appreciate all feedback so please don't
> hesitate to send something our way!
>
> >Just for fun, I asked the HTML-validator to validate the document
> >returned by the HTML validator [and it failed to validate].
> >
> >This is from an ordinary user's point of view a mere curiosity, but from
> >the point of view of whether W3C lives up to its own standards, it can
> >be regarded as an embarrasing deficiency.
>
> It is. That's why I'm working hard to fix it right now and, presumably, why
> Ville Skyttä has been subtly sending me patches for the the HTML output
> lately. Right, Ville? :-)
>
> >Whether these problems will be repaired can be of interest, as I am
> >presently dissecting the page by software to extract some information
> >from it. If the HTML structure of the validators output is  changed,
> >then I have to change my software. No big deal, as  long as I know
> >if/when it is going to happen.
>
> I _strongly_ reccomend against attempting to parse the output of the
> Validator at the moment! It isn't structured, it isn't consistent, and it's
> virtually /guaranteed/ to change frequently and at whim.

The background is that I wanted to have some "validation service" accessible
from some kind of software. I made the simple choice to use the W3C HTML
validation service, hoping to be able to make my software communicate with that
service. Now, that service  is at the moment aimed for "visual consumption",
i.e., it tries to present humanly digestable output. As I was looking for some
quick results, I decided  to parse the output document.

I would not trust my software to be fool-proof in any sense. As the validator
output is not documented, the only thing I could do was to construct a small
set of test cases and see what the validator returned. I have no idea whether I
have seen all types of pages that could be returned, and  hence whether there
are pages that I fail to parse.

I have actually used the validation service in this automated way during the
last few days. But I do regard it as a concept demonstrator (i.e., temporary,
quick hack), used to get some statistics about error rates in HTML pages.



>
>
> The real solution is for me to get my backside in gear and produce machine
> parseable output from the Validator (we have at least two previous requests
> for this feature). It's been on my TODO for a long time, but has been
> pushed back for one reason or another.
>
> When this is in place you'll optionally get XML output from the Validator
> that can be processed by any conformant XML Processor. With any kind of
> luck the implementation will be robust enough that I can guarantee
> well-formed output (IOW, provably wf'ed in the CS sense).
>
> If I get carried away it may even be exposed as an XML-RPC interface. :-)

The ideal is of course that the validator can return information that is
possible to parse in a well-determined way.  Like if it was a web-service, with
a well-defined parametric  call signature, and a well-defined type of result.

What you write gives  me hopes that this will eventually appear.

But what should be returned? What kind of information is to be regarded as the
"output" of the validation? As the validator is performing a complete parse of
the document in question, there is a lot of information that could be
accumulated. On the one hand one may identify things like
- number of lines
- size of document in number of characters
- number of elements
- maximum element nesting
- etc
i.e., things that not necessarily have direct relationships to the validation
per se.

Then there are the validation results. Here one would like to be able to see
problems classified into a number of distinct categorie, so one may say that
the documents has problems of types like:
 - illegal characters
 - unbalanced tags
 - illegal nesting of tags
 - referers to undefined attributes
 - etc.
Each identified problem should be associated to one of such type of problem.
This would then make it possible to extract error-statistics from the tree that
describes the full validation result.

To make this into a widely usable service, one should try to get agreement
among users about the ontology of problem types, where these types are known or
expected to be of use in some context. Unfortunately (sic!) standards documents
only defines an ontology to talk about _correct_ instances of this and that,
not an ontology that describes things that deviate from the standards. So there
is an amount of reserach/invention involved here. That kind of investigation
could be of more general use (outside of the context of HTML-validations), as
the problem of validation against this or that standard will be more and more
urgent in the future.

So, can we expect that people come up with wishes and dreams about an "ontology
for deviations/discrepancies"? Given that such an ontology is defined, it
should be a piece of cake to build a validator that delivers results expressed
in that ontology :-)

>
>
> >[Svenska W3C-kontoret: olleo@w3.org]
>
> Give me a holler when you decide to branch out to Norway! :-)

We are actually thinking about treating Norway and Sweden as part of a
Scandinavian region of W3C.


/olle

--
------------------------------------------------------------------
Olle Olsson   olleo@sics.se   Tel: +46 8 633 15 19  Fax: +46 8 751 72 30
 [Svenska W3C-kontoret: olleo@w3.org]
SICS [Swedish Institute of Computer Science]
Box 1263
SE - 164 29 Kista
Sweden
------------------------------------------------------------------
Received on Thursday, 4 April 2002 09:25:29 UTC