POWDER Formalisms 2 - this time with promised addendum from Phil Archer on 2008-04-25 (public-powderwg@w3.org from April 2008)

From: Phil Archer <parcher@icra.org>
Date: Fri, 25 Apr 2008 15:22:16 +0100
To: Public POWDER <public-powderwg@w3.org>
Message-ID: <4811E918.8010206@icra.org>
In two recent e-mails [1,2], Stasinos has presented a formalism of 
POWDER semantics. Since [2] revises a chunk of [1] I have copied and 
pasted the full revised text below for easy reference. The semantics of 
the descriptor set is clear and is inline with what we've been 
discussing recently and which for I have tried to derive generalised 
rules for the GRDDL (XSLT) transform [3].

However, what Stas is proposing for the IRI set semantics differs 
substantially from what is in the latest (member only) version of the 
Grouping Doc [4] (and the currently published one, [5]). And of course, 
it's the semantics of our IRI sets that sets POWDER apart from other 
parts of the Semantic Web.

Jeremy's solution, proposed on this list and discussed further in 
Athens, is to define a Semantic Extension for a property wdr:hasIRI that 
effectively maps "http://example.com/" to <http://example.com> and then 
says of other properties - see Semantic Extension above for details". It 
uses the same approach and mathematical terminology as used to define 
the formal semantics of RDF itself [6].

Stas is suggesting something much more detailed - and arguably more 
precise as a result. For example, under the Jeremy model we'd retain 
terms like 'includehosts' in POWDER-S. Stas says we can do away with 
that and (programmatically) reduce such elements to a regular expression 
- which I've been working on and, I think, proved [7] - at least for the 
string-based ones.

As I understand it, Stas has gone a little further and established a 
framework for the expression of POWDER-S in which the processing steps 
to be taken are made explicit through the use of elements from the XSLT 
2 and XSD namespaces. For each element it says "extract /this value/ 
from the candidate IRI and match is against /this/ value" - with regular 
expressions etc. provided. The same approach works for the string parts 
of a URI and the numerical ones (port numbers and CIDR blocks) - 
although the latter are, of necessity, complex. Note, even this approach 
requires the semantic extension that maps strings to IRIs.

So, on one reading of the Stasinos approach, is that a POWDER Processor 
must support XSLT 2. Kevin clarified that this is not the case [8] 
(thankfully!). So it seems to me that the XSLT 2 and XSD elements 
formalise the semantics and processing model for POWDER, but do not, of 
themselves, create a constraint on implementations.

OK - our task now is to decide what to do with all this. Bearing in mind 
that at our face to face in Athens in January we said we'd be ready for 
Last Call 'by Easter'. We hoped that we meant Occidental Easter but 
given the location of that meeting we really meant Orthodox Easter - and 
that means today, 25th April which is Good Friday by the Orthodox 
calendar. Whatever we do - we're already late and we are seriously 
running out of time. Our already extended charter expires at the end of 
the year (31st December by Orthodox and Occidental calendars!) - and we 
have the small matter of CR and PR to get through yet.

I see three possibilities - if you have a fourth, please say so.

1. We carry on as we are. The Semantic Extension in the grouping doc is 
cited as the formal basis for an IRI set and we quietly leave Stasinos' 
work to one side, perhaps using sections where appropriate for defining 
what 'A POWDER Processor' must do.

2. We incorporate Stasinos' work into the two primary tech documents, 
probably replacing the text semantic extension section of the grouping 
document at [5] (I'm not sure how to do this).

3. We discuss and tidy up Stas's work a little but essentially we 
already have 90% of a a new document called 'POWDER: Formal Semantics". 
We re-phrase the relevant sections of the DR and Grouping docs, passing 
the formalism off to this new doc.

Options 2 and 3 have two possible variants:

Variant a) POWDER-S includes all the XSLT 2 and XSD elements Stas has used.

Variant b) POWDER-S looks like it does in my 'try Again' e-mail [3] but 
the semantics of the terms regex (and, I think, portranges and ipranges) 
are defined in the Stasinos style. In other words we have a new layer to 
our semantics:

1. POWDER - Nice and friendly, mostly XML
2. POWDER-S - RDF/OWL* - i.e. it's OWL if you know what you've got
3. POWDER-Formal - What POWDER-S means

Of these we would only fully implement POWDER 1 and 2 as part of our CR 
work (as we plan now) but 3 would provide further underpinning for 
POWDER-S and for the implementation of it. We would surely have to do at 
least some testing using an XSLT 2 tool to give the formalism some 
validity - if only to check the angle brackets.

Stasinos' paper is below the references

WDYT?

Phil.


[1] http://lists.w3.org/Archives/Public/public-powderwg/2008Mar/0017.html
[2] http://lists.w3.org/Archives/Member/member-powderwg/2008Apr/0044.html
[3] http://lists.w3.org/Archives/Public/public-powderwg/2008Apr/0054.html
[4] http://lists.w3.org/Archives/Member/member-powderwg/2008Apr/0041.html
[5] http://www.w3.org/TR/2008/WD-powder-grouping-20080324/#formalSemantics
[6] http://www.w3.org/TR/rdf-mt/
[7] http://lists.w3.org/Archives/Public/public-powderwg/2008Apr/0013.html
[8] http://lists.w3.org/Archives/Member/member-powderwg/2008Apr/0048.html





Intro
=====

POWDER/XML documents receive formal semantics through a GRDDL
transform, associated with the POWDER namespace, that allows the XML
data to be rendered and processed as OWL/RDF. Or, rather, POWDER-S, a
fragment of OWL/RDF extended in a way that allows to referring to and
operating upon the string representation of a resource.

The POWDER/XML format specifies a number of elements denoting
attribution, validity time, and other issues relating to the level of
trust assigned to a POWDER document. These fall though the transform
and are not meant to be interpreted in OWL/RDF; they are only
meaningful when used by POWDER tools that use them as input to an
extra-logical procedure which MAY use this data to decide whether the
POWDER document _as a whole_ should be taken into account or
discarded. We shall not deal with these elements any further, and
proceed under the assumption that our document has passed all relevant
tests.

Unqualified names should be assumed to be in the wdr: namespace.


DR Semantics
============

POWDER documents are used to describe sets of resources using
description vocabularies defined in RDF or plain string literals (tags).
POWDER/XML documents have <dr/> elements, each assigning all and every
member of a set of descriptors to a set of resources.

As an example, consider:

<dr>
  <iriset>...</iriset>
  <descriptorset>
    <voc:colour ref="http://rgb.org/colours.rdf#red"/>
    <voc:shape>square</voc:shape>
    <tag>red</tag>
    <tag>light red</tag>
    <taglist>light red</taglist>
  </descriptorset>
</dr>

where <iriset/> specifies a set or resources in a way that will be
dealt with later, and voc: is an arbitrary RDF vocabulary.

The <voc:colour/> element specifies that the <voc:colour/> relation
holds between all resources in specified by <iriset/> and the
http://rgb.org/colours.rdf#red resource.

The content of <voc:shape/> is interpreted as a string literal. The
<voc:shape/> element specifies that all resources in <iriset/>
has the value "square" for the <voc:shape/> dataproperty.

<tag/> is a string property defined by POWDER. Its content is a
single string literal, possibly including spaces.
<taglists/> is a string property defined by POWDER. Its content is a
space-separated list of string literals.

The overall description of the resources in <iriset/> is the union of
the descriptions in the <descriptorset/>. In our example:
  a voc:colour relation to http://rgb.org/colours.rdf#red
AND
  a voc:shape "square"
AND
  the tags "red", "light", and "light red"

We formally interpret the above as follows: there is an OWL class
containing all resources that share all of these properties, and there
is an OWL class of all resources denoted by <iriset/>, and the latter
is a subset of the former. In OWL/RDF we say:

<RDF>

   <owl:Class rdf:ID="resourceset_1">
     all resources specified by <iriset>...</iriset>
   </owl:Class>

   <owl:Class rdf:ID="description_1">
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="voc:color"/>
          <owl:hasValue rdf:resource="http://rgb.org/colours.rdf#red"/>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="voc:shape"/>
          <owl:hasValue>square</owl:hasValue>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="wdr:tag"/>
          <owl:hasValue>red</owl:hasValue>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="wdr:tag"/>
          <owl:hasValue>light</owl:hasValue>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="wdr:tag"/>
          <owl:hasValue>red light</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
   </owl:Class>

   <owl:Class rdf:about="#resourceset_1">
     <rdfs:subClassOf rdf:ID="description_1"/>
   </owl:Class>

</RDF>

It is possible to have more than one <iriset/> elements, in which case
a resource receives all of the the descriptions by belonging to any
one of the corresponding resource sets. For example:

<dr>
  <iriset>.1.</iriset>
  <iriset>.2.</iriset>
  <descriptorset>
    <voc:colour ref="http://rgb.org/colours.rdf#red"/>
    <taglist>light red</taglist>
  </descriptorset>
</dr>

receives the following semantics:

<RDF>

   <owl:Class rdf:ID="resourceset_1">
     all resources specified by <iriset>.1.</iriset>
   </owl:Class>

   <owl:Class rdf:ID="resourceset_2">
     all resources specified by <iriset>.2.</iriset>
   </owl:Class>

   <owl:Class rdf:ID="description_1">
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="voc:color"/>
          <owl:hasValue rdf:resource="http://rgb.org/colours.rdf#red"/>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="wdr:tag"/>
          <owl:hasValue>red</owl:hasValue>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="wdr:tag"/>
          <owl:hasValue>light</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
   </owl:Class>

   <owl:Class>
     <owl:unionOf rdf:parseType="Collection">
       <owl:Class rdf:about="#resourceset_1"/>
       <owl:Class rdf:about="#resourceset_2"/>
     </owl:unionOf>
     <rdfs:subClassOf rdf:ID="description_2"/>
   </owl:Class>

</RDF>

A POWDER/XML implementio is free to choose any traversal policy for
treating miltiple </iriset> elements in a DR: first match wins, last
match wins, shortest irisets first, and so on, as long as all irisets
are tried before deciding that DR does not apply to a resource.

The ordering of irisets is not important and a POWDER/XML
implementation is free to try them in any order whatsoever (in order
listed, shorter first, etc), as long as all irisets are tried before
deciding that a resource is outside the scope of the DR.

DR authors may use the order of the irisets to suggest an efficient
scope evaluation strategy, by putting the irisets with the widest
coverage first, so that an implementation that chooses to follow the
suggested evaluation order is more likely to terminate the evaluation
after fewer checks.


POWDER Semantics
================

A POWDER document may have any number of <dr> elements, all of which
are simultaneously asserted and ordering is not important. So, for
example:

<powder>
   <dr>
    <iriset>.1.</iriset>
    <descriptorset>
      <voc:shape>square</voc:shape>
    </descriptorset>
   </dr>
   <dr>
    <iriset>.2.</iriset>
    <descriptorset>
      <voc:colour ref="http://rgb.org/colours.rdf#red"/>
    </descriptorset>
   </dr>
</powder>

receives the following semantics:

<RDF>
   <owl:Class rdf:ID="resourceset_1">
     all resources specified by <iriset>.1.</iriset>
   </owl:Class>

   <owl:Class rdf:ID="resourceset_2">
     all resources specified by <iriset>.2.</iriset>
   </owl:Class>

   <owl:Class rdf:ID="description_1">
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="voc:shape"/>
          <owl:hasValue>square</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
   </owl:Class>

   <owl:Class rdf:ID="description_2">
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="voc:color"/>
          <owl:hasValue rdf:resource="http://rgb.org/colours.rdf#red"/>
        </owl:Restriction>
      </owl:intersectionOf>
   </owl:Class>

   <owl:Class rdf:about="#resourceset_1">
     <rdfs:subClassOf rdf:resource="#description_1"/>
   </owl:Class>

   <owl:Class rdf:about="#resourceset_2">
     <rdfs:subClassOf rdf:resource="#description_2"/>
   </owl:Class>
</RDF>

The <owl:intersectionOf/> of a singleton collection is the latter's
single element anyway, so it is better to keep the
<owl:intersectionOf/> element even though it is redundant, in order to
keep the transform simple and not require the extra check.

Note that resourceset_1 and resourceset_2 are not necessarity
disjoint, so that some resources may be both red AND square.

A POWDER document may have an <ol/> element with is an ordered list of
<dr> elements, which receives a first-match semantics. <ol/> elements
are meant to be used to express exceptions to more general rules. So,
for example:

<powder>
   <ol>
     <dr>
      <iriset>.1.</iriset>
      <descriptorset>
        <voc:shape>square</voc:shape>
      </descriptorset>
     </dr>
     <dr>
      <iriset>.2.</iriset>
      <descriptorset>
        <voc:shape>round</voc:shape>
      </descriptorset>
     </dr>
     <dr>
      <iriset>.3.</iriset>
      <descriptorset>
        <voc:shape>triangle</voc:shape>
      </descriptorset>
     </dr>
   </ol>
</powder>

receives the following formal semantics, where belonging to
description_1 automatically precludes belonging to description_2 and
description_3; and belonging to description_2 automatically precludes
belonging to description_3:

<RDF>
   <owl:Class rdf:ID="resourceset_1">
     all resources specified by <iriset>.1.</iriset>
   </owl:Class>

   <owl:Class rdf:ID="resourceset_2">
     all resources specified by <iriset>.2.</iriset>
   </owl:Class>

   <owl:Class rdf:ID="resourceset_3">
     all resources specified by <iriset>.3.</iriset>
   </owl:Class>

   <owl:Class rdf:ID="description_1">
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="voc:shape"/>
          <owl:hasValue>square</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
   </owl:Class>

   <owl:Class rdf:ID="description_2">
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="voc:shape"/>
          <owl:hasValue>round</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
   </owl:Class>

   <owl:Class rdf:ID="description_3">
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="voc:shape"/>
          <owl:hasValue>triangle</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
   </owl:Class>

   <owl:Class rdf:about="#resourceset_1">
     <rdfs:subClassOf rdf:resource="#description_1"/>
   </owl:Class>

   <owl:Class>
     <owl:intersectionOf rdf:parseType="Collection">
       <owl:Class rdf:about="#resourceset_2"/>
       <owl:complementOf>
         <owl:Class rdf:about="#resourceset_1"/>
       </owl:complementOf>
     </owl:intersectionOf>
     <rdfs:subClassOf rdf:ID="description_2"/>
   </owl:Class>

   <owl:Class>
     <owl:intersectionOf rdf:parseType="Collection">
       <owl:Class rdf:about="#resourceset_3"/>
       <owl:complementOf>
         <owl:Class rdf:about="#resourceset_2"/>
       </owl:complementOf>
       <owl:complementOf>
         <owl:Class rdf:about="#resourceset_1"/>
       </owl:complementOf>
     </owl:intersectionOf>
     <rdfs:subClassOf rdf:ID="description_3"/>
   </owl:Class>
</RDF>


IRI Sets
========


The last missing bit of the transformation now is the one that builds
the <owl:Class rdf:ID="resourceset_X"/> descriptions from <iriset/>
elements.

<iriset/> elements subsume one or more elements, each
representing a range of values for IRIs. An IRI is in the <iriset/> if
it is covered by ALL of the elements in <iriset/>. The following
range specifications MUST be supported:

  <includeIRItype/>,<excludeIRItype/>

<includeIRItype/> and <excludeIRItype/> elements have two children
nodes: an <xsl:analyze-string/> element, as defined in the XSLT2
specification [XSLT2] and an <xsd:simpleType/> element, as defined in
the XML Schema specification [1]. An IRI is in the range of
<includeIRItype/> if, after being transformed by
<xsl:analyze-string/>, the result of the transformation is within the
lexical space of the XSD type. An IRI is in the range of
<excludeIRItype/> if, after being transformed by
<xsl:analyze-string/>, the result of the transformation is outside the
lexical space of the XSD type.

The intended use of this mechanism is that <xsl:analyze-string/> is
used to tokenize the IRI into meaningful sub-strings, which can then
be checked against XSD facet restrictions. This allows POWDER to
handle situations where numerical comparisons are required, like port
ranges. For example:

<iriset>
   <includeIRItype>
     <xsl:analyze-string select="."
                         regex = 
"{'^"{'^http://([^:/?#@]*)\.example\.org:([0-9]+)'}">
       <xsl:matching-substring>
         <xsl:value-of select="regex-group(2)"/>
       </xsl:matching-substring>
       <xsl:non-matching-substring>
         0
       </xsl:non-matching-substring>
     </xsl:analyze-string>
     <xs:simpleType>
       <xsd:restriction base="integer">
         <xsd:minInclusive value="80" />
         <xsd:maxInclusive value="100" />
       </xsd:restriction>
     </xs:simpleType>
   </includeIRItype>
</iriset>

specifies all resources on http://example.org and any subdomain
thereof, fetched from ports 80-100.

It might sometimes be easier to concetrate on parts of an IRI and
specify constraints as a series of restrictions, all of which must
match. We shall revisit this point when discussing the wdrurl
extension.

The <iriset/> mechanism allows a DR to express any grouping of
resources whatsoever, no matter how complex:

(A) each include* and exclude* element expresses an atomic
     proposition. For all X, if includeX exists, excludeX also exists
     and vice versa; furthermore includeX and excludeX are mutually
     exclusive. Hence, one can negate all atomic propositions, although
     not complex propositions.

(B) An <iriset/> may contain multiple include* and exclude* tags, and
     all must hold for the iriset to hold. Hence one can express the
     conjunction of any set of atomic propositions and negations of
     atomic propositions.

(C) A DR may contain multiple <iriset/> elements, and if any of them
     holds, then the DR holds. Hence one can express the disjunction of
     conjunctions of sets of atomic propositions and negations of
     atomic propositions.

The three expressions above allow the expression of Disjunctive Normal
Form proposition. Since arbitrarily complex propositions can be
brought into DNF, the three expressions above allow the expression of
any proposition.


IRI Set Semantics
=================

Providing OWL/RDF semantics for <iriset/> elements is not directly
possible, since RDF does not provide any means for accessing or
manipulating the string representation of an IRI. We extend OWL/RDF
with a built-in hasIRI datatype property as follows:

hasIRI rdf:type owl:DatatypeProperty .
hasIRI rdf:type owl:Property .
hasIRI rdfs:domain owl:Thing .
hasIRI rdfs:range xsd:string .

and the further stipulation that
  R owl:hasIRI s .
iff the string representation of resource R is s.

Furthermore, we extend the RDF datatype map with a new datatype for
each <includeIRItype/> element in the POWDER/XML document. All these
datatypes d are subsumed by the wdr:IRIType datatype, which is
subsumed by xsd:string :

wdr:iriType rdf:type rdfs:Datatype .
wdr:iriType rdfs:subClassOf rdfs:Literal .
wdr:iriType rdfs:subClassOf xsd:string .
d rdf::type rdfs:Datatype .
d rdfs:subClassOf wdr:iriType .

These iriType nodes have:
(a) a wdr:transform property with an xsl:analyze-string value,
(b) a wdr:hasType property with an xsd:simpleType value.

wdr:transform rdfs:domain wdr:iriType .
wdr:hasType rdfs:domain wdr:iriType .

The semantics of wdr:iriType nodes is:

(a) their lexical space is the subset of xsd:string that, after going
     through the transformation pointed at by wdr:transform, will be in
     the lexical space of the XSD type pointed at by wdr:hasType
(b) their lexical-to-value mapping is the same as for xsd:string
(c) their value space is the same as for xsd:string

It is now possible to provide semantics to <iriset/> by constructing
an RDF datatype from the <iriset/> and restricting the values of
hasIRI to the new datatype. So the example above becomes:

<owl:Class>
   <owl:Restriction>
     <owl:onProperty rdf:resource="&owl;hasIRI"/>
     <owl:allValuesFrom>
       <rdfs:Datatype>
         <rdfs:subClassOf rdf:resource="&wdr;iriType"/>
         <wdr:transform>
           <xsl:analyze-string
                   select="."
                   regex = 
"{'^"{'^http://([^:/?#@]*)\.example\.org:([0-9]+)'}">
             <xsl:matching-substring>
               <xsl:value-of select="regex-group(2)"/>
             </xsl:matching-substring>
             <xsl:non-matching-substring>0</xsl:non-matching-substring>
           </xsl:analyze-string>
         </wdr:transform>
         <wdr:hasType>
           <xs:simpleType>
             <xsd:restriction base="integer">
               <xsd:minInclusive value="80" />
               <xsd:maxInclusive value="100" />
             </xsd:restriction>
           </xs:simpleType>
         </wdr:hasType>
       <rdfs:Datatype>
     </owl:allValuesFrom>
   </owl:Restriction>
</owl:Class>

which describes the set of all abstract resources, the concrete IRI
string of which is such that when transformed as described by
wdr:transform will yield a literal which is in the lexical space of
the value of wdr:hasType.

An <excludeIRItype/> element would translate to:

<owl:Class>
   <owl:ComplementOf>
     <owl:Restriction>
       <owl:onProperty rdf:resource="owl:hasIRI"/>
       <owl:allValuesFrom>
         <rdfs:Datatype> ... </rdfs:Datatype>
       </owl:allValuesFrom>
     </owl:Restriction>
   <owl:ComplementOf>
</owl:Class>

to describe the set of all abstract resources, the concrete IRI string
of which is such that when transformed as described by wdr:transform
will yield a literal which is not in the lexical space of the value of
wdr:hasType.


IRISet Extensions
=================

In Sect "IRISet Semantics" above, a vocabulary of 6 tags was specified 
for defining sets of resources through their IRIs.

[[

PA: these got lost in the revision. The 6 referred to are

  <includepattern/>,<excludepattern/>,
  <includeports/>,<excludeports/>,
  <includeCIDRranges/>,<excludeCIDRranges/>
]]

Except for the numerical port and IP restrictions over URLs, the only 
operation supported over generic IRIs is regular expession matching.

Creators of POWDER documents may extend the vocabulary used in
specifying IRI Sets, by defining new <iriset/> elements. All such
extentions to the POWDER vocabulary MUST be defined by means of GRDDL
transformations [GRDDL] to terms of the basic POWDER vocabulary in the
wdr: namespace.

Extensions do not need to, but are well advised to, define pairs
of complementary vocabulary items (includeX and excludeX) for the
reasons explained above.

Developers of POWDER tools MAY directly implement extensions they know
about, but MUST include a mechanism for retrieving and applying the
GRDDL transformations to extensions they do not know about.


The URLSet Extension
====================

POWDER's basic use cases involve information resources available on
the Web, identified by URLs containing host names, directory paths, IP
addresses, port numbers, and so on. POWDER-WG provides the URLSet
extension to IRISet, by defining the following vocabulary items under
the wdrurl namespace:

<wdrurl:includeschemes/>        <wdrurl:excludeschemes/>
<wdrurl:includehosts/>          <wdrurl:excludehosts/>
<wdrurl:includeexactpaths/>     <wdrurl:excludeexactpaths/>
<wdrurl:includepathcontains/>   <wdrurl:excludepathcontains/>
<wdrurl:includepathstartswith/> <wdrurl:excludepathstartsWith/>
<wdrurl:includepathendswith/>   <wdrurl:excludepathendsWith/>
<wdrurl:includequerycontains/>  <wdrurl:excludequerycontains/>
<wdrurl:includeexactqueries/>   <wdrurl:excludeexactqueries/>
<wdrurl:includepattern/>        <wdrurl:excludepattern/>
<wdrurl:includeports/>          <wdrurl:excludeports/>
<wdrurl:includeCIDRranges/>     <wdrurl:excludeCIDRranges>

pathcontains and querycontains may appear any number of times within
an IRI set definition, but the rest may appear up to once.

These receive semantics in terms of the POWDER IRISet vocabulary
through the Rabin regular expression [Rabin], which splitis URIs into
their component parts:
   (([^:/?#]+):)?(//([^:/?#@]*)(:([0-9]+))?)?([^?#]*)(\?([^#]*))?
We shall write rre to mean the string representation of the
Rabin regular expression.

In this manner,

   <wdrurl:includeschemes>http ftp</wdrurl:includeschemes>

means:

<iriset>
   <includeIRItype>
     <xsl:analyze-string select="." regex = "{'rre'}">
       <xsl:matching-substring>
         <xsl:value-of select="regex-group(2)"/>
       </xsl:matching-substring>
       <xsl:non-matching-substring>
         0
       </xsl:non-matching-substring>
     </xsl:analyze-string>
     <xs:simpleType>
       <xsd:restriction base="string">
         <enumeration value="http"/>
         <enumeration value="ftp"/>
       </xsd:restriction>
     </xs:simpleType>
   </includeIRItype>
</iriset>

wdrurl:includehosts is more complicated, as it specifies the suffix
of the host group of the IRI, and not the whole group.

   <wdrurl:includehosts>example.org example.net</wdrurl:includehosts>

means:

<iriset>
   <includeIRItype>
     <xsl:analyze-string select="." regex = "{'rre'}">
       <xsl:matching-substring>
         <xsl:value-of select="regex-group(4)"/>
       </xsl:matching-substring>
       <xsl:non-matching-substring>
         0
       </xsl:non-matching-substring>
     </xsl:analyze-string>
     <xs:simpleType>
       <xsd:restriction base="string">
         <xsd:pattern value="^|\.(example\.org)|(example\.net)$" />
       </xsd:restriction>
     </xs:simpleType>
   </includeIRItype>
</iriset>

And so on for the various string parts.

<wdrurl:includepattern>some_reg_exp</wdrurl:includepattern> can be
used as a less verbose way of saying:

   <includeIRItype>
     <xsl:analyze-string select="." regex = "{'some_reg_exp'}">
       <xsl:matching-substring>yes</xsl:matching-substring>
       <xsl:non-matching-substring>no</xsl:non-matching-substring>
     </xsl:analyze-string>
     <xs:simpleType>
       <xsd:restriction base="string">
         <enumeration value="yes"/>
       </xsd:restriction>
     </xs:simpleType>
   </includeIRItype>

It might sometimes be easier to concetrate on parts of an IRI and
specify constraints as a series of restrictions, all of which must match.
For instance, the IRISet:

<iriset>
   <includehosts>example.org</includehosts>
   <includepattern>
     ^[^?]+\?(.*&)?s=football[&$]
   </includepattern>
   <includepattern>
     ^[^?]+\?(.*&)?c=gr[&$]
   </includepattern>
   <includepattern>
     ^[^?]+\?(.*&)?l=first[&$]
   </includepattern>
</iriset>

is a way of requesting three query conjuncts in any order, and is much
shorter and clearer than having to list all possible permutations.


Port ranges are handled slightly differently, are they impose
numerical restrictions, so that:

<includeports>80 8080-8100</includeports>

translates to (noting that absence of a port in the IRI defaults
to port 80):

   <includeIRItype>
     <xsl:analyze-string select="." regex = "{'rre'}">
       <xsl:matching-substring>
         <xsl:value-of select="regex-group(6)"/>
       </xsl:matching-substring>
       <xsl:non-matching-substring>
         80
       </xsl:non-matching-substring>
     </xsl:analyze-string>
     <xsd:simpleType>
       <xsd:union>
         <xsd:simpleType>
           <xsd:restriction base="integer">
             <xsd:enumeration value="80"/>
           </xsd:restriction>
         </xsd:simpleType>
         <xsd:simpleType>
           <xsd:restriction base="integer">
             <xsd:minInclusive value="8080" />
             <xsd:maxInclusive value="8100" />
           </xsd:restriction>
         </xsd:simpleType>
       </xsd:union>
     </xsd:simpleType>
   </includeIRItype>


CIDR ranges are even trickier, as they require some more
sophisticated calculations.

<includeCIDRranges>aaa.bbb.ccc.ddd/rr</includeCIDRranges>

means:

   <includeIRItype>
     <xsl:analyze-string select="." regex = 
"{'([0-9]{1-3})\.([0-9]{1-3})\.([0-9]{1-3})\.([0-9]{1-3})'}">
       <xsl:matching-substring>
         <xsl:value-of select="regex-group(1) * 255 * 255 * 255 + 
regex-group(2) * 255 * 255 + regex-group(3) * 255 + regex-group(4)"/>
       </xsl:matching-substring>
       <xsl:non-matching-substring>
         -1
       </xsl:non-matching-substring>
     </xsl:analyze-string>
     <xsd:simpleType>
       <xsd:restriction base="integer">
         <xsd:minInclusive value="minV" />
         <xsd:maxInclusive value="maxV" />
       </xsd:restriction>
     </xsd:simpleType>
   </includeIRItype>

where minV and maxV are replaced by appropriate numerical values at
the time of the wdrurl -> wdr transform as follows:
(UNTESTED, but you get the general gist: convert the 4-tuple of bytes
to a single integer, so one can do comparisons.)

<xsl:template match="includeCIDRranges">
   <includeIRItype>
     <axsl:analyze-string select="." regex = 
"{'([0-9]{1-3})\.([0-9]{1-3})\.([0-9]{1-3})\.([0-9]{1-3})'}">
       <axsl:matching-substring>
         <axsl:value-of select="regex-group(1) * 255 * 255 * 255 + 
regex-group(2) * 255 * 255 + regex-group(3) * 255 + regex-group(4)"/>
       </axsl:matching-substring>
       <axsl:non-matching-substring>
         -1
       </axsl:non-matching-substring>
     </axsl:analyze-string>
     <xsd:simpleType>
       <xsd:restriction base="integer">
         <xsl:analyze-string select="." regex = 
"{'([0-9]{1-3})\.([0-9]{1-3})\.([0-9]{1-3})\.([0-9]{1-3})(/([0-9]{1-2}))?'}">
           <xsl:matching-substring>
             <xsl:call-template name="minIP">
               <xsl:with-param name="ip" <xsl:value-of 
select="regex-group(1) * 255 * 255 * 255 + regex-group(2) * 255 * 255 + 
regex-group(3) * 255 + regex-group(4)"/>
               <xsl:with-param name="rr" select="regex-group(6)"/>
               <xsl:with-param name="acc" "0"/>
             </xsl:call-template>
             <xsl:call-template name="maxIP">
               <xsl:with-param name="ip" <xsl:value-of 
select="regex-group(1) * 255 * 255 * 255 + regex-group(2) * 255 * 255 + 
regex-group(3) * 255 + regex-group(4)"/>
               <xsl:with-param name="rr" select="regex-group(6)"/>
               <xsl:with-param name="acc" "0"/>
             </xsl:call-template>
           </xsl:matching-substring>
           <xsl:non-matching-substring>
             <xsl:call-template name="minIP">
               <xsl:with-param name="ip" <xsl:value-of 
select="regex-group(1) * 255 * 255 * 255 + regex-group(2) * 255 * 255 + 
regex-group(3) * 255 + regex-group(4)"/>
               <xsl:with-param name="rr" "32"/>
               <xsl:with-param name="acc" "0"/>
             </xsl:call-template>
             <xsl:call-template name="maxIP">
               <xsl:with-param name="ip" <xsl:value-of 
select="regex-group(1) * 255 * 255 * 255 + regex-group(2) * 255 * 255 + 
regex-group(3) * 255 + regex-group(4)"/>
               <xsl:with-param name="rr" "32"/>
               <xsl:with-param name="acc" "0"/>
             </xsl:call-template>
           </xsl:non-matching-substring>
       </xsd:restriction>
     </xsd:simpleType>
   </includeIRItype>
</xsl:template>

<xsl:template name="minIP">
   <xsl:param name="ip"/>
   <xsl:param name="rr"/>

   <xsl:variable name="acc" as="xs:integer" select="{$ip}">
     <xsl:for-each select="1 to {$rr}">
       <xsl:value-of select=". idiv 2"/>
     </xsl:for-each>
   </xsl:variable>
   <xsl:variable name="min" as="xs:integer" select="{$acc}">
     <xsl:for-each select="1 to {$rr}">
       <xsl:value-of select=". * 2"/>
     </xsl:for-each>
   </xsl:variable>
   <xsd:minInclusive value="{$min}" />
<xsl:template name="minIP">

<xsl:template name="maxIP">
   <xsl:param name="ip"/>
   <xsl:param name="rr"/>

   <xsl:variable name="acc" as="xs:integer" select="{$ip}">
     <xsl:for-each select="1 to {$rr}">
       <xsl:value-of select="(. idiv 2) + 1"/>
     </xsl:for-each>
   </xsl:variable>
   <xsl:variable name="max" as="xs:integer" select="{$acc}">
     <xsl:for-each select="1 to {$rr}">
       <xsl:value-of select=". * 2"/>
     </xsl:for-each>
   </xsl:variable>
   <xsd:maxInclusive value="{$max}" />
<xsl:template name="maxIP">




Multiple Layers of Extensions
=============================

It might sometimes be useful to also build upon already defined
extensions. For example, some content providers serve dynamic content
stored in a database, so that IRIs express queries to the database.
This kind of IRIs have certain structure, but this structure is
neither obvious nor easily human-interpreted. Furthemore, conventional
grouping mechanisms cannot be used to group resources, as the site
structure does not match any directory hierarchy.

As an example, consider sport.example.com, a sports news site,
where IRIs look like the one shown in Example 3-2-1. The adopted
scheme is systematic so that sport=2&countryID=16 provides a front
page with news about Greek basketball and links to various Greek
basketball leagues, sport=3&countryID=16 a front page about Greek
volleyball, etc. Eg:
   http://sport.example.com/matches.asp?sport=1&countryID=16&champID=2

A POWDER document providing metadata about this web site would have to
use regular expression matching with explicit reference to the
numerical values in the country and sport fields of the query. This
process is error-prone, and requires extensive changes if the
underlying database schema is modified or extended.

As an alternative, the site developer may provide a POWDER vocabulary
extension that abstracts away from the database schema to allow
reference to sports and countries. POWDER document authors can then
use the properties in this extension to create POWDER documents
are valid even if the site schema is modified, as long as the site
developer updates the relevant transformations.

So a POWDER/XML document might look like this:

<wdrsport:SportWDR
    xmlns:wdrsport="http://www.sports.example.com/resolvable#"
    xmlns:wdrurl="http://www.w3.org/2007/05/powder/resolvable#"
    xmlns:wdr="http://www.w3.org/2007/05/powder#"
    xmlns:voc="http://www.example.org/vocabulary.rdf#">

   <wdr:dr>
     <wdr:iriset>
       <wdrurl:includeschemes>http</wdrurl:includeschemes>
       <wdrurl:includehosts>sport.example.com</wdrurl:includehosts>
       <countries>Greece</countries>
       <sports>Football Basketball</sports>
     </wdr:iriset>
     <wdr:descriptorset>
       <voc:shape>round</voc:shape>
     </wdr:descriptorset>
   </wdr:dr>
</wdrsport:SportWDR>

A POWDER/XML tool specifically built for sport.example.com or other sites
following the same query patterns will immediately know how to handle
this information. Other POWDER tools will apply the GRDDL transform
associated with the wdrsport: namespace to get the following translation:

<wdrurl:POWDER
    xmlns:wdrurl="http://www.w3.org/2007/05/powder/resolvable#"
    xmlns:wdr="http://www.w3.org/2007/05/powder#"
    xmlns:voc="http://www.example.org/vocabulary.rdf#">

   <wdr:dr>
     <wdr:iriset>
       <includeschemes>http</includeschemes>
       <includehosts>sport.example.com</includehosts>
       <includequerycontains>countryID=16</includequerycontains>
       <includequerycontains>countryID=16</includequerycontains>
       <includequerycontains>sport=1 sport=2</includequerycontains>
     </wdr:iriset>
     <wdr:descriptorset>
       <voc:shape>round</voc:shape>
     </wdr:descriptorset>
   </wdr:dr>

</wdrurl:POWDER>

A web-oriented POWDER/XML tool will immediately know what to do with
wdrurl: vocabulary items. Other POWDER tools will apply the GRDDL transform
associated with the wdrurl: namespace to get the vanilla POWDER translation.
Finally, an even more generic RDF/OWL tool will apply the transform
associated with the wdr: namespace to get the even more verbose
RDF/OWL translation, as described above.


Non-URL Identifiers
===================

Although POWDER is mostly involved with resources that are identified
by URLs, there is a number of other use cases; for example one might
use POWDER to provide meta-data about physical, off-line resources
like books or DVDs.

The International Standard Audiovisual Number [ISAN1] is a voluntary
numbering system for the identification of audiovisual works.
Following ISO 15706, the numbers are written as 24 bit hexadecimal
digits in the following format [ISAN2].

	-----root----- 		episode 		-version- 	
ISAN 	1881-66C7-3420 	- 	0000 	-7- 	9F3A-0245 	-U

The root of an ISAN number is assigned to a core work with the other
numbers being used for things like episodes, different language
versions, promotional trailers and so on.

Since ISAN numbers are URNs [URN], and hence IRIs of the urn: scheme
[URIS], a vocabulary can readily be defined to allow IRI Sets to be
defined based on ISAN numbers. The terms might be along the lines of:

includeroots — the value of which would be a white space separated of
hexadecimal digits and hyphens that would be matched against the first
three blocks in the ISAN number.

includeepisodes — a white space separated list of hexadecimal digits
and hyphens that would be matched against the 4th block of 4 digits in
the ISAN number.

includeversions — a white space separated list of hexadecimal digits
and hyphens that would be matched against the 5th and 6th blocks of 4
digits in the ISAN number.

The set of all audio visual resources that relate to two particular
works might then be so defined:

Custom ISAN pattern:

<wdr:iriset>
   <isan:includeroots>1881-66C7-3420 1881-66C7-3421</isan:includeroots>
</wdr:iriset>

Corresponding vanilla POWDER/XML:


<iriset>
   <includeIRItype>
     <xsl:analyze-string select="."
                         regex = 
"{'^urn:isan:([0-9A-F]{4})-([0-9A-F]{4})-([0-9A-F]{4})-([0-9A-F]{4})-[0-9A-F]-([0-9A-F]{4})-([0-9A-F]{4})-[0-9A-F]'}">
       <xsl:matching-substring>
         <xsl:value-of select="regex-group(1)"/> <xsl:value-of 
select="regex-group(2)"/> <xsl:value-of select="regex-group(3)"/>
       </xsl:matching-substring>
 
<xsl:non-matching-substring>GGGG-GGGG-GGGG</xsl:non-matching-substring>
     </xsl:analyze-string>
     <xsd:simpleType>
       <xsd:union>
         <xsd:simpleType>
           <xsd:restriction base="string">
             <enumeration value="1881-66C7-3420"/>
           </xsd:restriction>
         </xsd:simpleType>
         <xsd:simpleType>
           <xsd:restriction base="string">
             <enumeration value="1881-66C7-3421"/>
           </xsd:restriction>
         </xsd:simpleType>
       </xsd:union>
     </xsd:simpleType>
   </includeIRItype>
</iriset>

This example demonstrates the extendability power offered by using
XSLT2 transformations: numerical constraints (like, here, defining
numerical ranges for, say, the 3rd block) can easily be defined
using wdr: primitives.




REFERENCES
==========

[1] 
http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html#rf-pattern
[2] 
http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#Complex_Type_Definitions
[3] http://www.w3.org/TR/owl-semantics/syntax.html#2.1
[4] http://www.w3.org/TR/owl-semantics/mapping.html
[GRDDL] http://www.w3.org/TR/grddl/
[XSLT2] http://www.w3.org/TR/xslt20/
[Rabin] J. Rabin, URI Pattern Matching for Groups of Resources.
         Draft 0.1 17 June 2006. 
http://www.w3.org/2005/Incubator/wcl/matching.html
[URN] http://www.iana.org/assignments/urn-namespaces
[ISAN1] http://www.isan.org/
[ISAN2] 
http://www.isan.org/portal/page?_pageid=166,41960&_dad=portal&_schema=PORTAL
Received on Friday, 25 April 2008 14:23:27 UTC