Re: A simpler Web service response format from Henri Sivonen on 2006-12-18 (www-validator@w3.org from December 2006)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 18 Dec 2006 12:13:17 +0200
To: Karl Dubost <karl@w3.org>
Cc: www-validator <www-validator@w3.org>
Message-Id: <AE64299D-DDA4-4EE7-A27F-31F9AB2F4219@iki.fi>
On Dec 18, 2006, at 08:42, Karl Dubost wrote:

> Le 15 déc. 2006 à 01:01, Henri Sivonen a écrit :
>> I had a look at the SOAP and Unicorn response formats for the W3C  
>> Validator in case I could reuse one of them. They both seemed  
>> unnecessarily complex. Also, generating the formats requires  
>> buffering.
>
> Could you explain what part is complex?
> Or what makes their complexity?

  * Messages are grouped by type (error, warning, misc) instead of  
just lumping them together in the order the messages were generated  
in the validation process. (The grouping is redundant and requires  
buffering.)

  * The message groups have double containers (errors and errorList).

  * For each message type, the generator of the messages has to count  
the messages and indicate the count before the messages. (The message  
counts are redundant data and generating them requires buffering.)

  * The formats echo information that the client already knows such  
as the URI of the validator, the URI of the input or in the case of  
the Unicorn format, the date.

  * The formats have unnecessary telescoping envelope elements. (A  
SOAP 1.2 format message ends with </m:markupvalidationresponse></ 
env:Body></env:Envelope>, where </env:Body></env:Envelope> is just  
cruft.)

  * The formats represent line and column numbers as text content of  
elements as opposed to attributes.

  * The SOAP format has SOAP namespace cruft. The Unicorn format has  
XSD cruft.

  * The formats require a boolean pass/fail proclamation near the  
start of the format. (This is redundant and requires buffering.)

EARL, which I initially missed, also has problems:

  * It requires an RDF processing layer on the consumption side.  
(Unless the consumer cheats, but if the consumer cheated and did not  
use an RDF processing layer, pretending that RDF is being used would  
be pointless.)

  * The graph model and the RDF/XML syntax is mostly overhead when  
used with validators/checkers that in practice just produce a list of  
messages and don't care about graphs (let alone merging them).

  * The concept of "assertor" is unnecessary in the case where the  
client knows what Web service it is accessing and it is obvious that  
the assertor is the service.

  * The concept of a "test criterion" presupposes an implementation  
strategy that makes it possible for the checker/validator to cite a  
particular criterion by a well-known URI that is used by different  
tools for the same criterion. This is looks good in theory, but it  
doesn't work well in practice unless the checker implementation is  
based on hand-crafted per-criterion checks *and* the implementation  
has a mechanism for citing well-known URIs. It turns out that when  
multiple criteria are embodied in a grammar-based schema, a  
validation engine cannot cite per-criterion URIs when a particular  
document tree doesn't have a derivation in the grammar. The EARL  
output from the W3C Validator illustrates this point rather nicely:  
It uses "http://www.w3.org/MarkUp/" as an all-encompassing testCase,  
which defeats the whole point of EARL's granular and inter-tool  
comparable test case URIs. Also, even when an implementation is  
assertion-based but uses an off-the-shelf Schematron engine, the  
tooling likely won't have a mechanism for citing a criterion URI.

  * EARL has a lot of stuff that is for expressing things that aren't  
applicable to the Web service use case where the client know what  
service it is invoking and with what input.

  * The producers of the reports have too much freedom in expressing  
things (e.g. different pointer alternatives), so implementing general- 
purpose EARL consumers becomes hard. On the other hand, implementing  
a consumer for the EARL subset emitted by a particular service  
defeats the point of having a spec like EARL in the first place.

>> I wrote up a quick format draft, which I may implement in the future:
>> http://hsivonen.iki.fi/validator-ws-ideas/#xml
>
> Interesting.
> Could you give an output example?

When the input document passes successfully, the output would be:
<messages xmlns="http://hsivonen.iki.fi/validator/messages/"></messages>

For http://hsivonen.iki.fi/validator/?doc=http%3A%2F%2Fhsivonen.iki.fi 
%2Ftest%2Fno-space-between-attributes.xhtml the output would be:
<messages xmlns="http://hsivonen.iki.fi/validator/ 
messages/"><info>The Content-Type was “application/xhtml+xml”. Using  
the XML parser (not resolving external entities).</info><warning  
line='1' column='109' uri='http://hsivonen.iki.fi/test/no-space- 
between-attributes.xhtml'>skipping entity: [dtd]</warning><info>Using  
the preset for XHTML 1.0 Strict based on the root namespace.</ 
info><error type='fatal' line='7' column='13' uri='http:// 
hsivonen.iki.fi/test/no-space-between-attributes.xhtml'>need  
whitespace between attributes</error></messages>

> one comment:
> I see
> 	The elements in this XML vocabulary are in the namespace “http:// 
> hsivonen.iki.fi/validator/messages/”.
>
> Will this namespace survive in the future? I'm just wondering  
> because there have been cases with troubles related to namespaces  
> change (for example Atom WG from 0.3 to 1.0)?

It is an unimplemented draft, so anything can happen. If I implement  
the draft, I need to pick a namespace URI and then stick to it. At  
the moment, the URI quoted above looks like the most likely candidate.

(BTW, since software and formats outlive organizations and move  
between organizations, I think it is a bad idea that namespace URIs  
and Java package names are supposed to have a domain name in them.  
This opens up a bikeshedding problem when people are uncomfortable  
with using a namespace URI or a Java package that has a domain name  
that is considered non-neutral in it.)

> As a side note, I often wonder if sending the line number is always  
> the best strategy for validation. Line number is very useful
> 	for fixing one file one time.
> But as soon as we modify the file, it might change. The CSS  
> Validator gives two bits of information when possible the line  
> number and the context.

In the case of the CSS validator, the context is the selector, right?  
Would a markup validator have to extract a piece of source text by  
having a back door in the parser at the point where the bytes have  
been decoded into characters but the text is still unparsed?

> I think we maybe do mistakes when we characterizing validation by  
> their results: information, error or warning. A Thing can be  
> alternatively in error, warning or have an information attached to  
> it but stays the same thing. Though far to be easy.

I think I don't understand what you are saying.

However, it did occur to me that io errors, schema loading errors and  
internal errors, which aren't the fault of the input document, should  
probably have a separate element (e.g. <incidental>) for them. The  
presence of one or more such errors could be considered an  
indeterminate result. (That is, the document did not have a change to  
pass or fail on its own right.) Having a separate element would make  
the format forward-compatible with new types of errors pertaining to  
the document and new types of incidental errors.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Monday, 18 December 2006 10:13:26 UTC