Re: Semantics, Basic, OWL 2 and POWDER from Andrea Perego on 2008-05-13 (public-powderwg@w3.org from May 2008)

From: Andrea Perego <andrea.perego@uninsubria.it>
Date: Wed, 14 May 2008 01:14:56 +0200
To: Public POWDER <public-powderwg@w3.org>
Message-ID: <482A20F0.2090403@uninsubria.it>
Thanks for this update, Phil.

My comments inline.

> [snip]
> 
> Up until recently, we have worked with the idea that our IRI constraints 
> like includehosts, excludepathstartswith etc. would exist in both POWDER 
> and POWDER-S. The basic point that Stasinos makes is that these are just 
> simplifications of a regular expression. Actual processors are likely to 
> use a full reg ex.
> 
> I explored this idea with the Regular Expression-based tool at [1]. What 
> does this do? Let's take our favourite includehosts as the example.
> 
> 1. Converts a white space separated list, like "example.org example.com" 
> and turns it into a fragment of a regex so it becomes 
> "example\.org|example\.com" - call it $working
> 
> 2. It then puts that into a template regular expression:
> 
> \:\/\/([^\:\/\?\#\@]+\.)?($working)(\:([0-9]+))?\/
> 
> And then matches the whole of the candidate URI against that reg ex.
> 
> 3. Each IRI constraint has a different template reg ex but the initial 
> processing is the same.
> 
> That's important because it means that we have a common algorithm that 
> gets us from includehosts, includepathendswith etc. through to a single 
> "match the candidate URI against this set of regular expressions. If it 
> matches all of them, it's in the IRI set."
 >
> The version of the grouping doc I sent to the member list on 22 April 
> [2] says that this is the approach we take but implementers may choose a 
> different method as long as it comes to the same result. Stasinos's 
> point is that we can formalise this and that we should call it 
> POWDER-BASIC. In other words, say what we're already saying but in a 
> more formal way - and let an XSLT do the heavy lifting.

Fine to me.

> So we have our POWDER documents that we're now getting very familiar 
> with and that Andrea is tying down in the XML schema. They're all in the 
> POWDER namespace of http://www.w3.org/2007/05/powder#.
> 
> Let's now associate an XSLT with that root namespace that does what I've 
> been thinking the processor should do and convert the string-based 
> elements into regular expressions, retaining all the attribution and 
> descriptor sets stuff untouched. i.e. only the <iriset /> elements are 
> transformed by this XSLT.
> 
> We now have a document that will be in a different namespace, let's say 
> http://www.w3.org/2008/05/powder-basic# (the fact that it's taken us a 
> year to get to this point is depressing and potentially confusing but 
> this is only ever going to be processed by machines, if ever). That 
> document would look _very_ similar to a POWDER doc. It's still XML but 
> now we only have includeregexp and excluderegexp elements in the IRI set 
> (includehosts has disappeared along the way) so just pick those up and 
> match the candidate IRI against them. 

Well, this raises an issue. In the current XML schema, it is required 
for <iriset /> element instances to include one and exactly one instance 
of the <includehosts /> element---i.e., in an IRI set definition it is 
required to put a constraint on the host component of candidate IRIs. 
The purpose is to avoid IRI set definitions of the type "all the 
resources having an IRI path starting with /foo are blue." Of course, we 
can drop this constraint, but, supposing that we want to keep it, we 
have the following problem:

The proposal for the POWDER-BASIC schema requires just includeregexp, 
i.e., it does not require any constraint on the IRI host. This means 
that, if a POWDER-BASIC document is obtained from a POWDER one, the 
constraint on the IRI host is kept, since it is in the original POWDER 
document. However, if anyone writes directly his/her own POWDER-BASIC 
document, we cannot be sure that the constraint on the IRI host will be 
kept.

So, we have two options:

1. Drop the constraint on the IRI host: in such a case, this won't be 
required by POWDER, and POWDER-BASIC will be as proposed.

2. Keep the constraint: in such a case, the POWDER XML schema won't be 
changed, but we have to put such constraint in the POWDER-BASIC XML 
schema. I see here a possible option: the POWDER-BASIC XML schema 
requires one and exactly one instance of <includehosts /> AND 
<includeregexp /> in an <iriset />.

A last question. Do we really need two distinct XML schemas? Actually, 
POWDER-BASIC is a simplified version of POWDER, and thus it can be 
included as a possible variant in the POWDER XML schema. In other words, 
we can have a single XML schema, with two distinct namespace URIs.

> Again, a dedicated POWDER 
> processor does not have to have an XSLT engine - you could pick up the 
> values of includehosts and process them directly, but you'd be doing 
> what an XSLT processor would do for you which, in some circumstances, 
> might be more efficient.
> 
>  From this we can say that a POWDER Processor MUST understand POWDER and 
> POWDER-BASIC. It is not a requirement that it must convert one into the 
> other.

Fine to me. BTW, if a processor understands POWDER, it follows that it 
understands also POWDER-BASIC, since in/excluderegexp is an IRI 
constraint supported also by POWDER.

Moreover, I agree that it is not a requirement for the processor to 
enforce the XSL transform rules. And I would add, it should not be in 
charge of validating a POWDER doc against the XML schema either. The 
only requirement for the processor is to return the description of a 
resource upon submission of its IRI.

Said that, most programming languages currently provide built-in support 
for XSLT engines and XML validators, so it would be pretty easy to 
extend a POWDER processor with these features. Moreover, in order to be 
evaluated, a POWDER document must be valid. So, an alternative option is 
possible, namely: before being evaluated by the processor, a POWDER 
document is validated and then transformed into the POWDER-BASIC format. 
In such a case, a POWDER processor must understand only POWDER-BASIC.

> Now let's associate the GRDDL transform with that new namespace that 
> will get us from POWDER-BASIC into POWDER-S (which we could just call 
> OWL if we didn't need the semantic extension to match a string to a URI).

Fine to me.

> By now I'm starting to worry about having two XSLTs that need to be 
> created and tested but remember that the first XSLT is only doing what 
> the text in the grouping doc says the processor should do. We've already 
> got the template regular expressions and a tool that proves them so 
> actually, no, this doesn't create a lot of extra work.
> 
> Would anyone ever run either XSLT? Only if they wanted to access POWDER 
> information as OWL data on the semantic web. You can still parse POWDER 
> in any way you like in a specialised processor. So what's the gain? 
> Can't we just go from POWDER to POWDER-S in hone hop? Yes, but all that 
> single step would do is to do what the two proposed ones do anyway and 
> by splitting them up into two separate XSLT transforms we get increased 
> flexibility - you can jump in mid way through.
> 
> For example...
> 
> We need to show that it's possible to use ISAN numbers in a POWDER way 
> (because ISAN [3] is increasingly important in the audio-visual (i.e. 
> movie)) world. ISAN numbers are URIs but they're not URLs. They have 
> roots and episodes but not hosts and paths. So to use POWDER to describe 
> things that have ISAN numbers you'd need to define the XSLT that did 
> something similar to our "chop up the strings and render them inside 
> template regular expressions" but the output would be  a POWDER-BASIC 
> doc with includeregexp and excluderegexp elements ... which our POWDER 
> processors can understand.
> 
> You could choose to jump in earlier in the chain. The WAF pattern 
> matching would be a good candidate for this. An XSLT would split up the 
> value of <includeiripattern /> and turn it into includeschemes, 
> includehosts and includeports as required. Then you can transform that 
> into POWDER-BASIC if you need to.
> 
> It would be nice, but not essential, if we could create XSLTs that did 
> this - that would be a good exercise in the CR phase I'd say but not 
> essential for the normative documents.

About this issue, I think that it might be important to make clear which 
are the possible processing steps when dealing with extensions. If I've 
understood correctly what you've said, we have two possible processing 
options:

1. POWDER + extensions -> POWDER ( -> POWDER-BASIC [optional] )

2. POWDER + extensions -------------> POWDER-BASIC

Provided that we support both, are there any reasons why a specific 
extension should use the former or the latter? Some words on this might 
be useful to make clear how extensions can be implemented.

> Now... I've remained silent on the issue of CIDR blocks and port ranges. 
> That's because they're harder to deal with if you want to treat them as 
> numbers. So here's a route to a lot less heartache and bother. Forget 
> CIDR blocks and IP Ranges as numbers and just treat them as strings.
> 
> If you want to define an IRI set in terms of an IP address, or set of IP 
> addresses, list them as hosts. If you really want to use CIDR blocks, 
> write an XSLT that creates an enumerated list in a POWDER-BASIC doc from 
> a CIDR block.
> 
> Likewise with ports. If we retain in/excludeports but ditch port ranges, 
> then again, you can list port numbers. It's the idea of ports 76-300 
> that causes the problem because you need to do arithmetic. Matching '70' 
> against '70' is as easily performed as a string match as it is a 
> numerical match. Again, one could write an extension (XSLT) that 
> converted a port range into an enumerated list. Stasinos tells me that, 
> at its longest, that could be a list of 32000 integers. That's a big 
> number but, importantly, a finite one. Anyway, if we support ports and 
> not port ranges, it ceases to be an issue.
> 
> If someone has a compelling use case for supporting CIDR blocks and/or 
> port ranges, as opposed to port numbers and IP addresses, OK, we can 
> think again, but the proposal here is to drop the complex in favour of 
> the simple which appears to be sufficient.

Fine to me. Unless anybody disagrees, I'll update the XML schema 
accordingly.

> So let's recap this as some proposals:
> 
> 1. We introduce a new layer called POWDER-BASIC which has a new 
> namespace and is identical to POWDER except that IRI sets are expressed 
> solely in terms of includeregex and excluderegex elements.

+1

> 2. The XSLT that transforms POWDER to POWDER-S is split in two. In the 
> first 'half', associated with the POWDER root namespace, the values of 
> elements like includehosts, excludeschemes etc. are processed to become 
> regular expressions that are values for includeregex and excluderegex in 
> a POWDER-BASIC document. The attribution, descriptor set and tag set 
> elements remain unchanged. The template regular expressions are listed 
> in the grouping doc of 22 April and can be tested at [1].

+1

> 3. The second half of the XSLT completes the transformation to POWDER-S.

+1

> 4. POWDER elements referring to CIDR blocks are deleted. The document 
> will advise that an IP address is a legal value for a host in the 
> in/excludehosts element. Where a host is defined in terms of an IP 
> address, a POWDER processor MUST look up the IP address of the candidate 
> IRI (that last bit sounds a little dodgy doesn't it? Can we improve on it?)

Actually, both DNS lookup and reverse DNS lookup should be required. I 
mean, suppose a domain name www.example.org, corresponding to IP address 
123.123.123.123. This means that the following IRIs:
- http://123.123.123.123/doc1.html
- http://www.example.org/doc1.html
denote the same resource. So, a DR with scope "all the resources hosted 
by www.example.org" should imply "all the resources hosted by 
123.123.123.123" (and vice versa).

> 5. The in/excludeport ranges elements are amended to in/exclude ports 
> which take a white space separated list of ports, not port ranges.

+1

> If accepted, this has implications for the documents and our work flow 
> as follows.
> 
> [snip]
> 
> 4. The XML schema that Andrea's working on - no change needed. Then copy 
> and paste with a namespace difference and just allowing includeregex and 
> excluderegex in the iriset element and that's the POWDER-BASIC schema done.

See comments above.

 > [snip]

Andrea
Received on Tuesday, 13 May 2008 23:15:47 UTC