- From: Phil Archer <parcher@icra.org>
- Date: Tue, 13 May 2008 14:23:13 +0100
- To: Public POWDER <public-powderwg@w3.org>
Stasinos and I have been discussing further the relationship between the different versions of POWDER. I think I now have a better understanding of what he's been saying and I'll try and convey that here. This e-mail amounts to a small set of proposals to the group. Sorry it's on the long side but it's not trivial. Up until recently, we have worked with the idea that our IRI constraints like includehosts, excludepathstartswith etc. would exist in both POWDER and POWDER-S. The basic point that Stasinos makes is that these are just simplifications of a regular expression. Actual processors are likely to use a full reg ex. I explored this idea with the Regular Expression-based tool at [1]. What does this do? Let's take our favourite includehosts as the example. 1. Converts a white space separated list, like "example.org example.com" and turns it into a fragment of a regex so it becomes "example\.org|example\.com" - call it $working 2. It then puts that into a template regular expression: \:\/\/([^\:\/\?\#\@]+\.)?($working)(\:([0-9]+))?\/ And then matches the whole of the candidate URI against that reg ex. 3. Each IRI constraint has a different template reg ex but the initial processing is the same. That's important because it means that we have a common algorithm that gets us from includehosts, includepathendswith etc. through to a single "match the candidate URI against this set of regular expressions. If it matches all of them, it's in the IRI set." The version of the grouping doc I sent to the member list on 22 April [2] says that this is the approach we take but implementers may choose a different method as long as it comes to the same result. Stasinos's point is that we can formalise this and that we should call it POWDER-BASIC. In other words, say what we're already saying but in a more formal way - and let an XSLT do the heavy lifting. So we have our POWDER documents that we're now getting very familiar with and that Andrea is tying down in the XML schema. They're all in the POWDER namespace of http://www.w3.org/2007/05/powder#. Let's now associate an XSLT with that root namespace that does what I've been thinking the processor should do and convert the string-based elements into regular expressions, retaining all the attribution and descriptor sets stuff untouched. i.e. only the <iriset /> elements are transformed by this XSLT. We now have a document that will be in a different namespace, let's say http://www.w3.org/2008/05/powder-basic# (the fact that it's taken us a year to get to this point is depressing and potentially confusing but this is only ever going to be processed by machines, if ever). That document would look _very_ similar to a POWDER doc. It's still XML but now we only have includeregexp and excluderegexp elements in the IRI set (includehosts has disappeared along the way) so just pick those up and match the candidate IRI against them. Again, a dedicated POWDER processor does not have to have an XSLT engine - you could pick up the values of includehosts and process them directly, but you'd be doing what an XSLT processor would do for you which, in some circumstances, might be more efficient. From this we can say that a POWDER Processor MUST understand POWDER and POWDER-BASIC. It is not a requirement that it must convert one into the other. Now let's associate the GRDDL transform with that new namespace that will get us from POWDER-BASIC into POWDER-S (which we could just call OWL if we didn't need the semantic extension to match a string to a URI). By now I'm starting to worry about having two XSLTs that need to be created and tested but remember that the first XSLT is only doing what the text in the grouping doc says the processor should do. We've already got the template regular expressions and a tool that proves them so actually, no, this doesn't create a lot of extra work. Would anyone ever run either XSLT? Only if they wanted to access POWDER information as OWL data on the semantic web. You can still parse POWDER in any way you like in a specialised processor. So what's the gain? Can't we just go from POWDER to POWDER-S in hone hop? Yes, but all that single step would do is to do what the two proposed ones do anyway and by splitting them up into two separate XSLT transforms we get increased flexibility - you can jump in mid way through. For example... We need to show that it's possible to use ISAN numbers in a POWDER way (because ISAN [3] is increasingly important in the audio-visual (i.e. movie)) world. ISAN numbers are URIs but they're not URLs. They have roots and episodes but not hosts and paths. So to use POWDER to describe things that have ISAN numbers you'd need to define the XSLT that did something similar to our "chop up the strings and render them inside template regular expressions" but the output would be a POWDER-BASIC doc with includeregexp and excluderegexp elements ... which our POWDER processors can understand. You could choose to jump in earlier in the chain. The WAF pattern matching would be a good candidate for this. An XSLT would split up the value of <includeiripattern /> and turn it into includeschemes, includehosts and includeports as required. Then you can transform that into POWDER-BASIC if you need to. It would be nice, but not essential, if we could create XSLTs that did this - that would be a good exercise in the CR phase I'd say but not essential for the normative documents. Now... I've remained silent on the issue of CIDR blocks and port ranges. That's because they're harder to deal with if you want to treat them as numbers. So here's a route to a lot less heartache and bother. Forget CIDR blocks and IP Ranges as numbers and just treat them as strings. If you want to define an IRI set in terms of an IP address, or set of IP addresses, list them as hosts. If you really want to use CIDR blocks, write an XSLT that creates an enumerated list in a POWDER-BASIC doc from a CIDR block. Likewise with ports. If we retain in/excludeports but ditch port ranges, then again, you can list port numbers. It's the idea of ports 76-300 that causes the problem because you need to do arithmetic. Matching '70' against '70' is as easily performed as a string match as it is a numerical match. Again, one could write an extension (XSLT) that converted a port range into an enumerated list. Stasinos tells me that, at its longest, that could be a list of 32000 integers. That's a big number but, importantly, a finite one. Anyway, if we support ports and not port ranges, it ceases to be an issue. If someone has a compelling use case for supporting CIDR blocks and/or port ranges, as opposed to port numbers and IP addresses, OK, we can think again, but the proposal here is to drop the complex in favour of the simple which appears to be sufficient. So let's recap this as some proposals: 1. We introduce a new layer called POWDER-BASIC which has a new namespace and is identical to POWDER except that IRI sets are expressed solely in terms of includeregex and excluderegex elements. 2. The XSLT that transforms POWDER to POWDER-S is split in two. In the first 'half', associated with the POWDER root namespace, the values of elements like includehosts, excludeschemes etc. are processed to become regular expressions that are values for includeregex and excluderegex in a POWDER-BASIC document. The attribution, descriptor set and tag set elements remain unchanged. The template regular expressions are listed in the grouping doc of 22 April and can be tested at [1]. 3. The second half of the XSLT completes the transformation to POWDER-S. 4. POWDER elements referring to CIDR blocks are deleted. The document will advise that an IP address is a legal value for a host in the in/excludehosts element. Where a host is defined in terms of an IP address, a POWDER processor MUST look up the IP address of the candidate IRI (that last bit sounds a little dodgy doesn't it? Can we improve on it?) 5. The in/excludeport ranges elements are amended to in/exclude ports which take a white space separated list of ports, not port ranges. If accepted, this has implications for the documents and our work flow as follows. 1. Stasinos will amend the draft formal semantics doc to reflect these changes. The term 'POWDER-FORMAL' goes. The horrendous complexity around CIDR blocks and port ranges goes. His intention is to refer to OWL 2 (now in public draft) as this allows greater flexibility in the declaration of data types. I'll bring this up in the SW Coordination Group. The examples in the latest version are still valid. The semantic extension defined by JJC to link a string to a URI is still required as originally written. POWDER-S still isn't quite OWL, although OWL 2 gets us closer to 'native semantics.' 2. I'll begin to edit the grouping doc to reflect the changes. The set theory stuff all stays, then it will define the in/excluderegex elements in the POWDER-BASIC namespace before going on to define includehosts etc. in the POWDER namespace. Several aspects of the doc will be significantly simpler. 3. The DR doc will need only minor changes since the descriptor set and attribution elements are untouched by any of this. (I'm working on that today, hoping to resolve to publish a new version on Monday's call). 4. The XML schema that Andrea's working on - no change needed. Then copy and paste with a namespace difference and just allowing includeregex and excluderegex in the iriset element and that's the POWDER-BASIC schema done. 5. The XSLT that Kevin/Andrea are working on - no change to what has been done already. 6. Effect on POWDER processor - trivial. Fly in the ointment. I'm attending a conference in Nuremberg Wed - Fri this week, next week ends on Thursday for me as I then go away Friday 23- 30 May. Hmmm... Lots to do. [1] http://www.icra.org/regularexpression/ [2] http://lists.w3.org/Archives/Member/member-powderwg/2008Apr/0041.html [3] http://www.isan.org/
Received on Tuesday, 13 May 2008 13:23:54 UTC