Evaluating Stuff With EARL from Sean B. Palmer on 2001-12-13 (w3c-wai-er-ig@w3.org from December 2001)

From: Sean B. Palmer <sean@mysterylights.com>
Date: Thu, 13 Dec 2001 00:26:27 -0000
To: <w3c-wai-er-ig@w3.org>
Message-ID: <00b301c1836c$d0857b80$ceb80150@localhost>
Sounds like it should be a simple thing, doesn't it? But we had all
that stuff to consider: years of background discussions on
identification and resources, the XPointer nightmare, equivalence
measures, trying to manage big sacks of terms, and so on. Defining
"stuff", and deploying a consistent language with which one can make
claims about what they want, isn't easy.

But I really want to stablize EARL; to say, "here, it's done, go away,
stop bothering us". And we can, if we follow a plan...

There are some terms in EARL that simply do not change - they have
remained consistent since the EARLiest drafts for 0.9. These include
terms such as "earl:asserts", "earl:passes", and so on. These are core
terms, they are convincingly stable. On the next level are the model
terms that should have been consistent, but may need to change because
of the "identification" discussions - terms like "earl:testSubject",
which I shall refer to. On the last level are the utility terms, stuff
like "earl:name", terms that we can sprinkle liberally into the
vocabulary space: packing the schema.

We expect languages to be able to evolve, but when they do, there is
always some trade off. Natural languages especially evolve rapidly,
taking on colloquial forms, and integrating cultural idioms into the
mainstream of the language. With programming and computer
data-oriented languages, we have to make the changes "jerkier", since
tools are not as clever as humans... they don't expect things to
change, and when they do, they don't know what to do about it.

Part of the vision of the Semantic Web was to ease the pain of new
versions of languages. Part of the reason for choosing RDF in the
first place for EARL was so that we could upgrade easier. However, in
practice, it doesn't quite work like that, for two reasons: * few
Semantic Web tools, * people will always want to create non-SW tools
that can still read EARL. So there is a certain tension between the
two ways of using EARL. AFAICT, the latter method is bound to be more
popular.

As languages evolve, the grammar and the structure change. EARL is no
exception to that rule. The aim is to let people add extensions to the
language, but make sure that these extensions can still be recognized
enough by current agents. This comes under the umbrella of two
phrases: forwards compatability, and partial understanding. It is
fairly easy to ground these in terms of the EARL model.

Let's take the example of the result property. This is the kind of
properties that say whether or not something has "passed", "failed",
or whatever. We had facilities in 0.95 for customizing the result
properties, perhaps adding a new type of result, or confidence levels.
However, we were thinking of dropping them, since they would probably
not be supported until some point way off in the future, and by then
would break current tools.

However, it should be possible to give tools some hope of recognizing
the new properties. Of course, from the Semantic Web/RDF POV it's
incredibly easy - not worth giving a second thought in some cases, but
we want to approach it from the POV of the general EARL user. What do
they have to sniff in order to come to conclusions about new result
properties?

Currently, the validity of a validity property is the most important
part. If EARL clients could simply search for the validity of any new
property that they did not understand, then it is possible for them to
work out roughly what is going on - partial understanding. Let me give
an example. The usual kind of model is the following:-

   :Sean earl:asserts { :MyPage earl:fails :MyCheckPoint } .

Now, we know (because it's a standard fact of the EARL language, in
0.95) that earl:fails has a validity of "fails". This kind of
information should be built into the clients, so that when they come
across an extension, they can roughly compare it to what they
currently know. Let's say that someone adds a property that lets them
give the confidence of a result. Such a example might follow the
following format:-

   :Sean earl:asserts { :MyPage :kindaFails :MyCheckPoint } .
   :kindaFails earl:validity earl:Fail; blargh:confidence :Low .

[Ignoring the fact that confidence levels were part of the language -
imagine we took them out, or didn't have them in the first place. I
can't predict what other extensions might be made, otherwise I'd add
them now]. An processor which didn't understand ":kindaFails" should
be made to look up the validity of this property. It could find
"earl:Fail", and roughly conclude that the page fails the checkpoint.
It's not particularly accurate, *but* sometimes getting along is
preferable to totally breaking.

And so there is the tension of just how much EARL clients should be
expected to know, just how much they need to be able to infer - what
are the core parts of EARL that we want processors to recognize? Of
course, the goal from my POV is to simply add inference to any
application that processes EARL, but that isn't practical. I'll leave
this as a semi-open question for now.

The other problem with stablizing the language is that Web
architecture is a little bit screwy. Fragment IDs on URI-refs only
apply consistently when the URI-ref is being used in a retreival
action, and the content-type of the representation can be known. I had
hoped that with a little ironing of the specifications (and this is a
matter of some contention) that genericity could be added to fragID
space, such that interoperability could be maintained between content
negotiation, etc. XPointer seemed to run contrary to what, to me, are
important principles, but then it is only following the current trend
of specifications.

The EARL "testSubject" property is interesting, because it attempts to
bulldoze over all of these problems, but it's more of a hack than
anything. Really, it means "the thing that I take to be represented
by", or sometimes, "a representation of" (using the word repr. in two
different senses there). Consider the following:-

:MyPage earl:testSubject <http://example.org/> .
:MyChunklet earl:testSubject <http://example.org/#blargh> .
:MyTool earl:testSubject <urn:x-tools:SomeTool> .
:WhatIsThis earl:testSubject <http://cam-seven-fish.ext/> .
:XMLChunklet earl:testSubject <http://example.org/#xpointer([...])> .

Can anyone honestly tell me what is being identified by the subjects
of each of those statements? I certainly can't. I can give you popular
interpretations, but I can't say definitely, because it's a sea of
opinons out there. Then, we have the further questions:-

* How do I identify only the attribute values in some representation?
* How do I point to things in non-XML languages?

Without a very, very, clear view of Web archtecture and "what is
identified", you get a mess. This is a very importna tpart of earl, on
the top end of the assertion.

On the bottom end of the assertion (the TestCase) we have further
worries, that I won't get into a lot, but it's basically the question
of whether we *specify* or *point* to a TestCase (or both).

I don't really want to start using the word "context", but it seems as
if I may have to... in the "context" of an EARL report, there are two
considerations that need to be made: what is identified within the
report, and what is identified on the Web. Conisidering the
"testSubject" statements again, the subjects of those statements are
things that are very RDF/EARL model specific - we don't care what they
resolve to on the Web, we simply care what is said about them in EARL
reports. For the objects of the triples, the opposite is true: we care
most about what they actually mean, what is referenced by the object.

Aaron has argued for a long time that fragURIs are harmful to RDF< get
them out of the specification. I did not agree with him; to some
extent I still don't agree with him... but it does seem as if for EARL
test subjects, we need to be very careful about what we are pointing
at. Let me define the following categories of "things":-

WebContent - representations of a resource. This is simply a series of
bytes, perhaps with a content type and language type attached. The
thing about WebContent is that you can point inside them and say "this
bit", just as you can point inside a sentence and say "the fifth
word". You can use an XPointer expression on it if you know that the
MIME type is XML.
Tool - some program that evalutes, authors, fixes, displays. This is
generally an abstract concept, but it is a special type of concept. It
may have an online description, it may be identified by a URI already
(or it may not), it may have a version, a code repository, and so
forth.
Document - some kind of thing, generally with IPR rights, that can be
evaluated. This may be an article. It is most certainly a resource,
and may have a number of representations attached to it. An example
"Document" is the W3C homepage - it can be rated as a work of art, or
whatever. It's a generalization of a set of WebContent.

and then, of course, there are all other resources - bananas, the
concept of love, dew on grass in springtime. Whatever. The above are
things that we can most easily ground in the Web, but that often get
confused for one another.

I recently proposed to get rid of "testSubject", for its vagueness,
and add reprOf. For reprOf, the domain would be WebContent. This means
that the object would have to be some kind of thing either slightly
more abstract than Document (seems to be Al's POV) or exactly document
(seems to be TimBL's POV). It would rule out having XPointers as an
object... which is fine: anything with an XPointer shoved onto the end
of it clearly does not identify anything abstract - it identifies a
chunk of XML.

So, what if we want to talk about a chunk of XML? How do we use an
XPointer? Well, now that I think about it, it kinda makes sense to use
the XPointered URI-ref as the subject itself:-

   <http://example.org/#xpointer([...])> a earl:WebContent .

I did at first want to make sure that all "WebContent" instances had a
"reprOf" property dangling from them... it still seems kinda necessary
to me: you have to say that the above is a bit of XML content,
otherwise it is meaningless. You also have to hack into the URI-ref
itself to get the xpointer. I would much prefer to use the following:-

   [ a earl:WebContent; earl:reprOf <http://example.org/>;
     earl:mime "text/xml"; earl:xpointer "xpointer([...])" ] .

It means that we kinda boycott XPointers as FragIDs, but I think of
that more as a benefit for both EARL and Web architecture than
anything else :-)

So it is feasable to set a cardinality restriction on all instances of
WebContent such that they must have one and only one earl:reprOf arc
hanging from them.

Jim immediately asked what happens about identifying all of the other
things that we want to talk about. Well, let's start with tools. It is
possible to give enough information about a tool to disambiguate it
from other tools. A "homepage" (for want of a better word), "version",
and "author" combination is probably enough to swing it. In fact,
"homepage" and "version/date" may constitute a UnambiguousPropertySet,
or at least, we can specify it as being so. So, an example test
subject which is a tool would be the following:-

   [ a earl:Tool; earl:homepage <http://mytool.org/>;
     earl:version "1.0" ] .

Now, we come to the more intreguing stuff - from the large (documents
and abstract concepts) to the small (all attributes in document x) to
a mixture (how to evaluate multiple things in EARL).

For the large stuff, we can just come up with a class which is
everything that Tool and WebContent aren't. We could further probe
some of the stuff that TimBL and Al et al. have been talking about,
but I don't think that it has all that much relevance to what we're
doing; or rather, not as much as I did. On the scale of the larger SW,
it's one of the most interesting topics... but here, I'm just going to
brush it off.

For the small stuff, and equivalence measures... this is interesting.
I managed to scribble something in the telecon on Monday that created
a good class for equivalence relations. Basically, they must be
reflexive, transitive, and symmetric. It is then a matter of some
simplicity to come up with an equivalence class for such a
relationship, where for each member a, and each other member a', where
the equivalence relationship is p, I believe the following always has
to be true: p(a, a'). There are some other things that you can
conclude from that, but I need to brush up on it a little... for now,
it is enough to work out how it can be deployed in EARL.

Equivalence relations are something which are, to a great extent,
implementor specific, but we can come up with a framework for
expressing the relationships, and some simple examples.

To create an equivalence relationship, we could set up a class in EARL
so that people can just say:-

   :sameAttributeContentAs a earl:EquivalenceRelationship .

then we could define a predicate - "earl:equivalenceClassOf" - with an
obvious usage. Of course, on many occasions, it will probably just be
easier to use the predicate a few times between test subjects, rather
than defining a custom class.

The main conclusion from this is that equivalence classes are rather
insignificant insofar as EARL is concerned. They're simply too local
to concern us too much. We'll put a basic framework into the language,
and then let people use it. It's not something that we need to spend a
great deal of time discussing.

The "mixture" question is quite simple to answer: you make the
statements one by one, or you provide some impelemntation specific
method for pointing to a whole range of test subjects, and then define
some conversion to standard EARL. If you can't or don't define the
conversion, then there is no way that we can say "this is standard
EARL". I don't think that it makes good sense to start putting RegExp
syntax into a basic evaluation language. It certainly runs contrary to
the principle of least power, which itself is a derivative of KISS.
Bags are pointless too. We have a method for evaluting things one at a
time, and that's enough.

So there you have it; EARL 1.0 architecture in a nutshell. Now that we
have that under our belt, I can go through a summary of the answers to
some of the recent "issues" that have presented themselves (as
summarized excellently in Wendy's upcoming talk, deriving from the
recent ER discussions):-

[[[
Issue 1: Identifying State Changes
Issue 2: Combining and Querying Results
Issue 3: Threading
Issue 4: Test Subjects
]]]

Identifying State Changes: this is basically the equivalence measures
thing. Since it varies from implemntation to implementation, and is
information that is only necessary to have within a certain range of
data processing (the step after EARL - what happens to the EARL), we
simply don't care about it too much. We'll provide a framework for
interoperability, and perhaps collect some of the scenarios etc.
Combining and Querying Results: Go find a decent RDF processor, or
build one yourself. I had hoped that there would be millions of the
things by now, but I was wrong... there are few decent ones. I
recommend CWM, of course, but it's not for everyone, it seems. It's an
impelementation question, at any rate.
Threading: Collorary to "Identifying State Changes". Doesn't concern
me.
Test Subjects: We now have a stable model for people to implement,
outlined above, which I shall incorporate into EARL 1.0, and which you
can all provide feedback on, until such a time as the chair decides
that the language is stable enough to release. I don't mind how many
iterations it takes, but I'm confident in myself that the above is
"it". Evolutionary measures are a different question.

Cheers,

--
Kindest Regards,
Sean B. Palmer
@prefix : <http://purl.org/net/swn#> .
:Sean :homepage <http://purl.org/net/sbp/> .
Received on Wednesday, 12 December 2001 19:27:30 UTC