Re: proposal to have sequential / grouped messages in soap output from Henri Sivonen on 2007-10-31 (www-validator@w3.org from October 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 31 Oct 2007 18:29:42 +0200
To: olivier Thereaux <ot@w3.org>
Cc: W3C Validator Community <www-validator@w3.org>, "Chris. Parrish" <chris.forummail@swankinnovations.com>, Brett Bieber <brett.bieber@gmail.com>, Struan Donald <struandonald@gmail.com>
Message-Id: <6C618939-FE96-40F6-B929-BD1486073359@iki.fi>
Hi,

On Oct 30, 2007, at 21:19, olivier Thereaux wrote:

> On Oct 29, 2007, at 12:37 , Henri Sivonen wrote:
>> Does the sequential output require a rewrite of client code? If it  
>> does anyway, it might make sense to drop the SOAPness and make it  
>> plain old XML. Or are clients actually benefiting from the SOAP  
>> envelope in terms of tool support in a way that would break with POX?
>
> As far as I can tell most implementations just parse the XML of the  
> SOAP output. I think one of them does build upon a SOAP library and  
> thus expects the format to be in a SOAP envelope.
>
> One option I am pondering about is to leave the SOAP output as it  
> is (that is, with its oddly grouped messages) and revive an XML  
> output.

Makes sense.

> I looked at:
> http://wiki.whatwg.org/wiki/Validator.nu_XML_Output
> and it does look usable. The more I look at it, the more I think  
> the W3C validator could adopt this as XML output

Cool.

> (we used to have one but never really documented and since then  
> deprecated, we could revive it) provided we can make a few  
> (backward compatible) changes.
>
> * adding a warning element to info and error - would be nicer IMHO  
> than having warnings a type of info

Making warnings a type of info was a careful forward-compatibility  
design decision. There are three main classes of messages that  
clients need to know about in order to compute the validation outcome  
from message classes in a forward-compatible way. These main classes  
map to elements. The repertoire of message elements is not extensible  
without breaking the outcome computation semantics. The type  
attribute values are extensible in a forward-compatible way without  
breaking outcome computation in clients that do not know about a  
particular type attribute value.

All kinds of messages that do not imply invalidity and do not imply a  
non-document failure have the same element (<info>). Since warnings  
are a special case of this general class of messages, the warningness  
is in the type attribute, since the distinction between warning and  
other informative messages does not participate in the outcome  
computation.

> * checkedby

I think this is information that the client should already know, but  
adding a URI that points to the checker would be harmless except for  
the response size increase. (It should probably be called checked-by  
for consistency with the other hyphenated names.)

Instinctively, I'd make checked-by an optional attribute on the root  
element (taking a URI as the value). This assumes that the producer  
of the result format is writing out its own identity that it always  
knows in advance.

If this format were to be used by Unicorn as an output format, would  
it be necessary to mark checked-by on a per-message basis? If yes,  
then message elements should have a checked-by attribute as well and  
in the absence of the attribute, the checked-by attribute on the root  
element would be taken as the indication of source of the message.

> validity,

An earlier draft of the format had an explicit tri-state (success/ 
failure/indeterminate) outcome indicator element, but I removed it  
before I started implementing, because the format is otherwise  
designed to support forward-compatible computation of the outcome  
from the message data. Therefore, a validity indicator would always  
be either redundant or in disagreement with the messages due to a  
bug. For the latter case, the processing model would have to define  
what clients are to do if they get inconsistent data, which would  
complicate the spec.

> doctype,

I forgot this when I said earlier that there were only two things  
that the W3C Validator HTML output had but the Validator.nu XML  
format couldn't capture in its current form.

How about adding an optional element <doctype> that has two optional  
attributes: public and system? The content of the element could  
optionally contain a human-readable characterization of the doctype  
(e.g. "HTML 4.01 Strict"). The <doctype> element would be allowed as  
a child of the <messages> element (in any position relative to its  
siblings; that is, the validator could emit it as early or late as  
implementation-wise practical).

> charset,

Validator.nu currently reports the HTTP-level charset when source  
code in included in the response, since the HTTP-level charset is  
considered a metadatum of the source code.

However, the actual character encoding used for decoding the document  
is not reported anywhere when there's no HTTP-level declaration and  
the encoding is determined from the content.

An encoding name would naturally be the kind of data that goes in an  
attribute if you consider how the format otherwise puts things that  
aren't human-readable messages or source code in attributes. This  
raises the question of what element should host the attribute.  
Suppose there was an element called <metadata> for hanging various  
metadata attributes onto. An encoding attribute (to avoid "charset"  
per charmod) could go onto that element. The element could also have  
an attribute stating the root namespace. But then that raises a  
question why doctype would be an element on its own instead of its  
attributes being part of this new element.

The easy way out is to ask: Does the charset really need to be  
stated? :-)

> errorcount, warningcount etc

These are redundant data, but if they are added, they should probably  
be error-count and warning-count for consistency. When redundant data  
like error-count or warning-count is optional (I agree they should  
not be required), it isn't particularly useful. A consumer cannot  
trust optional data to be there. Therefore, a robust consumer that  
needs the error or warning count needs to be able to count the errors  
or warnings on its own. Once a consumer is able to count them anyway,  
it doesn't need the counts to be explicitly stated.

> let's make them optional, but I think they are useful.

I agree that the features you suggested are best left optional.

> They aren't a problem for a streaming response, if sent at the very  
> end, anyway.

Agreed. This could be relaxed a bit by saying that this new stuff can  
occur as late as the generator chooses.

> * some kind of identifier for the errors. I realize this may bring  
> some headaches if the format is shared by various tools, but for  
> localization and/or customization, it'd be extremely useful.

I guess every message element (<info>, <error>, <non-document-error>)  
could be given an attribute called e.g. message-id that gives an  
implementation-specific message identifier. (It should not be called  
id, since implying IDness would be bad as there can be multiple  
instances of a given message.) It could further be stipulated that  
since message-id is implementation-specific, checked-by SHOULD be  
used (on the root or on the message) when message-id is used. If a  
client does not recognize the checked-by value and, thus, is unable  
to use implementation-specific semantics, it could still compare the  
message-id values for strict string equality to discover which  
messages are instances of the "same" message. Hence, equivalence  
classes could be established without knowing the semantics of the  
equivalence classes.

Aside: I realize that the W3C Validator wants to communicate message  
ids, and I'm not trying to fight that. However, I'm probably not  
going to emit message ids from Validator.nu in the foreseeable  
future. Validator.nu emits errors from many different places  
including an HTTP client wrapper, parsers, RELAX NG validator(s),  
Schematron validator(s) and custom Java code. There is no error  
identification scheme even inside Validator.nu itself let alone  
between different online validators. Moreover, I have doubts about  
the usefulness of message ids for localization: you don't get  
parameters that went into a message formatter in their unformatted  
form, so you might as well run pattern matching against the error  
message itself directly.

> The output format you created is sequential, which is a good basis  
> for what we need. We'd also need a way to group errors by type, but  
> that can be an alternative format with a similar base. The main  
> issue is that your locator elements give their location as  
> attributes, which makes it hard to represent that a tool found  
> several instances of a given message.
>
> What do you think?

I think different grouping options are a UI feature. A software-to- 
software Web service API format should merely communicate sufficient  
data for the consumer to be able to group messages for its UI. The  
data format does need to change its data ordering when a consumer  
wants to show a grouped UI. Assuming a message-id attribute,  
consumers could group by that if they so choose. In the absence of  
the attribute, consumers could group by comparing the text content of  
the <message> element.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 31 October 2007 16:30:24 UTC