Semantics, Basic, OWL 2 and POWDER

Stasinos and I have been discussing further the relationship between the 
different versions of POWDER. I think I now have a better understanding 
of what he's been saying and I'll try and convey that here.

This e-mail amounts to a small set of proposals to the group. Sorry it's 
on the long side but it's not trivial.

Up until recently, we have worked with the idea that our IRI constraints 
like includehosts, excludepathstartswith etc. would exist in both POWDER 
and POWDER-S. The basic point that Stasinos makes is that these are just 
simplifications of a regular expression. Actual processors are likely to 
use a full reg ex.

I explored this idea with the Regular Expression-based tool at [1]. What 
does this do? Let's take our favourite includehosts as the example.

1. Converts a white space separated list, like "example.org example.com" 
and turns it into a fragment of a regex so it becomes 
"example\.org|example\.com" - call it $working

2. It then puts that into a template regular expression:

\:\/\/([^\:\/\?\#\@]+\.)?($working)(\:([0-9]+))?\/

And then matches the whole of the candidate URI against that reg ex.

3. Each IRI constraint has a different template reg ex but the initial 
processing is the same.

That's important because it means that we have a common algorithm that 
gets us from includehosts, includepathendswith etc. through to a single 
"match the candidate URI against this set of regular expressions. If it 
matches all of them, it's in the IRI set."

The version of the grouping doc I sent to the member list on 22 April 
[2] says that this is the approach we take but implementers may choose a 
different method as long as it comes to the same result. Stasinos's 
point is that we can formalise this and that we should call it 
POWDER-BASIC. In other words, say what we're already saying but in a 
more formal way - and let an XSLT do the heavy lifting.

So we have our POWDER documents that we're now getting very familiar 
with and that Andrea is tying down in the XML schema. They're all in the 
POWDER namespace of http://www.w3.org/2007/05/powder#.

Let's now associate an XSLT with that root namespace that does what I've 
been thinking the processor should do and convert the string-based 
elements into regular expressions, retaining all the attribution and 
descriptor sets stuff untouched. i.e. only the <iriset /> elements are 
transformed by this XSLT.

We now have a document that will be in a different namespace, let's say 
http://www.w3.org/2008/05/powder-basic# (the fact that it's taken us a 
year to get to this point is depressing and potentially confusing but 
this is only ever going to be processed by machines, if ever). That 
document would look _very_ similar to a POWDER doc. It's still XML but 
now we only have includeregexp and excluderegexp elements in the IRI set 
(includehosts has disappeared along the way) so just pick those up and 
match the candidate IRI against them. Again, a dedicated POWDER 
processor does not have to have an XSLT engine - you could pick up the 
values of includehosts and process them directly, but you'd be doing 
what an XSLT processor would do for you which, in some circumstances, 
might be more efficient.

 From this we can say that a POWDER Processor MUST understand POWDER and 
POWDER-BASIC. It is not a requirement that it must convert one into the 
other.

Now let's associate the GRDDL transform with that new namespace that 
will get us from POWDER-BASIC into POWDER-S (which we could just call 
OWL if we didn't need the semantic extension to match a string to a URI).

By now I'm starting to worry about having two XSLTs that need to be 
created and tested but remember that the first XSLT is only doing what 
the text in the grouping doc says the processor should do. We've already 
got the template regular expressions and a tool that proves them so 
actually, no, this doesn't create a lot of extra work.

Would anyone ever run either XSLT? Only if they wanted to access POWDER 
information as OWL data on the semantic web. You can still parse POWDER 
in any way you like in a specialised processor. So what's the gain? 
Can't we just go from POWDER to POWDER-S in hone hop? Yes, but all that 
single step would do is to do what the two proposed ones do anyway and 
by splitting them up into two separate XSLT transforms we get increased 
flexibility - you can jump in mid way through.

For example...

We need to show that it's possible to use ISAN numbers in a POWDER way 
(because ISAN [3] is increasingly important in the audio-visual (i.e. 
movie)) world. ISAN numbers are URIs but they're not URLs. They have 
roots and episodes but not hosts and paths. So to use POWDER to describe 
things that have ISAN numbers you'd need to define the XSLT that did 
something similar to our "chop up the strings and render them inside 
template regular expressions" but the output would be  a POWDER-BASIC 
doc with includeregexp and excluderegexp elements ... which our POWDER 
processors can understand.

You could choose to jump in earlier in the chain. The WAF pattern 
matching would be a good candidate for this. An XSLT would split up the 
value of <includeiripattern /> and turn it into includeschemes, 
includehosts and includeports as required. Then you can transform that 
into POWDER-BASIC if you need to.

It would be nice, but not essential, if we could create XSLTs that did 
this - that would be a good exercise in the CR phase I'd say but not 
essential for the normative documents.

Now... I've remained silent on the issue of CIDR blocks and port ranges. 
That's because they're harder to deal with if you want to treat them as 
numbers. So here's a route to a lot less heartache and bother. Forget 
CIDR blocks and IP Ranges as numbers and just treat them as strings.

If you want to define an IRI set in terms of an IP address, or set of IP 
addresses, list them as hosts. If you really want to use CIDR blocks, 
write an XSLT that creates an enumerated list in a POWDER-BASIC doc from 
a CIDR block.

Likewise with ports. If we retain in/excludeports but ditch port ranges, 
then again, you can list port numbers. It's the idea of ports 76-300 
that causes the problem because you need to do arithmetic. Matching '70' 
against '70' is as easily performed as a string match as it is a 
numerical match. Again, one could write an extension (XSLT) that 
converted a port range into an enumerated list. Stasinos tells me that, 
at its longest, that could be a list of 32000 integers. That's a big 
number but, importantly, a finite one. Anyway, if we support ports and 
not port ranges, it ceases to be an issue.

If someone has a compelling use case for supporting CIDR blocks and/or 
port ranges, as opposed to port numbers and IP addresses, OK, we can 
think again, but the proposal here is to drop the complex in favour of 
the simple which appears to be sufficient.

So let's recap this as some proposals:

1. We introduce a new layer called POWDER-BASIC which has a new 
namespace and is identical to POWDER except that IRI sets are expressed 
solely in terms of includeregex and excluderegex elements.

2. The XSLT that transforms POWDER to POWDER-S is split in two. In the 
first 'half', associated with the POWDER root namespace, the values of 
elements like includehosts, excludeschemes etc. are processed to become 
regular expressions that are values for includeregex and excluderegex in 
a POWDER-BASIC document. The attribution, descriptor set and tag set 
elements remain unchanged. The template regular expressions are listed 
in the grouping doc of 22 April and can be tested at [1].

3. The second half of the XSLT completes the transformation to POWDER-S.

4. POWDER elements referring to CIDR blocks are deleted. The document 
will advise that an IP address is a legal value for a host in the 
in/excludehosts element. Where a host is defined in terms of an IP 
address, a POWDER processor MUST look up the IP address of the candidate 
IRI (that last bit sounds a little dodgy doesn't it? Can we improve on it?)

5. The in/excludeport ranges elements are amended to in/exclude ports 
which take a white space separated list of ports, not port ranges.

If accepted, this has implications for the documents and our work flow 
as follows.

1. Stasinos will amend the draft formal semantics doc to reflect these 
changes. The term 'POWDER-FORMAL' goes. The horrendous complexity around 
CIDR blocks and port ranges goes. His intention is to refer to OWL 2 
(now in public draft) as this allows greater flexibility in the 
declaration of data types. I'll bring this up in the SW Coordination 
Group. The examples in the latest version are still valid.

The semantic extension defined by JJC to link a string to a URI is still 
required as originally written. POWDER-S still isn't quite OWL, although 
OWL 2 gets us closer to 'native semantics.'

2. I'll begin to edit the grouping doc to reflect the changes. The set 
theory stuff all stays, then it will define the in/excluderegex elements 
in the POWDER-BASIC namespace before going on to define includehosts 
etc. in the POWDER namespace. Several aspects of the doc will be 
significantly simpler.

3. The DR doc will need only minor changes since the descriptor set and 
attribution elements are untouched by any of this. (I'm working on that 
today, hoping to resolve to publish a new version on Monday's call).

4. The XML schema that Andrea's working on - no change needed. Then copy 
and paste with a namespace difference and just allowing includeregex and 
excluderegex in the iriset element and that's the POWDER-BASIC schema done.

5. The XSLT that Kevin/Andrea are working on - no change to what has 
been done already.

6. Effect on POWDER processor - trivial.

Fly in the ointment. I'm attending a conference in Nuremberg Wed - Fri 
this week, next week ends on Thursday for me as I then go away Friday 
23- 30 May. Hmmm... Lots to do.


[1] http://www.icra.org/regularexpression/
[2] http://lists.w3.org/Archives/Member/member-powderwg/2008Apr/0041.html
[3] http://www.isan.org/

Received on Tuesday, 13 May 2008 13:23:54 UTC