Copyright © 2007 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
The Protocol for Web Description Resources (POWDER) facilitates the publication of descriptions of multiple resources such as all those available from a Web site. This document describes how sets of resources may be defined, either for use in Description Resources or in other contexts. An OWL Class is to be interpreted as the Resource Set with its predicates and objects either defining the characteristics that elements of the set share, or directly listing its elements. Resources that are directly identified or that can be interpreted as being elements of the set can then be used as the subject of further RDF triples.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is a Public Working Draft, designed to aid discussion. The POWDER Use Cases and Requirements document [PUC] details the use cases and requirements that motivated the creation this document. Changes since earlier versions of this document are recorded in the change log.
This document was developed by the POWDER Working Group. The Working Group expects to advance this Working Draft to Recommendation Status.
Please send comments about this document to public-powderwg@w3.org (with public archive); please include the text "comment" in the subject line.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
includeRegEx
The Protocol for Web Description Resources (POWDER) facilitates the publication of descriptions of multiple resources such as all those available from a Web site. These descriptions are attributable to a named individual, organization or entity that may or may not be the creator of the described resources. This contrasts with more usual metadata that typically applies to a single resource, such as a specific document's title, which is usually provided by its author.
Description Resources (DRs) are described separately [DR]. This document sets out how
groups (i.e. sets) of resources may be defined, either for use in DRs or in other contexts. Set theory has been
used throughout as it provides a well-defined framework that leads to unambiguous definitions. However, it is used
solely to provide a formal version of what is written in the natural language text. Companion
documents describe the RDF/OWL vocabulary [VOC] and XML data types [WDRD]
that are derived from this and the Description Resources document, setting out each term's domain, range and
constraints. As each term is introduced in this document, it is linked to its description in the vocabulary document.
The POWDER vocabulary namespace is http://www.w3.org/2007/05/powder#
for which we use the
prefix wdr
.
In designing a system to define sets of resources we have drawn on earlier work [Rabin] carried out in the Web Content Label Incubator Activity [WCL-XG], and taken into account the following considerations.
Defining a Resource Set by specifying the characteristics that the resources in the set share is clearly an indirect approach, albeit a very useful one in the real world. In a logical sense, the definition must be interpreted to arrive at the full set. The implicit constraint on the resources in the set is that they exist. Newly created resources that match the set definition will become members of the Resource Set, even though at the time the definition was created, they didn't exist. Despite this, as stated above, Resource Set definitions must be unambiguous so that an application can always determine with certainty whether a specific resource is or is not within the defined set of resources.
More formally, a Resource Set definition D denotes a set of resources RS = DI, where DI is the interpretation of D, i.e., the set of resources sharing the characteristics denoted by D.
We take this further and allow a set definition to be built up in stages.
A Resource Set RS is denoted by a set definition DRS in terms of one or more characteristics that the elements of the set have in common. Each characteristic itself gives rise to a set definition D1, D2, …, Dn, so that the complete set definition DRS comprises D1, D2, …, Dn.
The Resource Set RS is the intersection of the sets denoted by the definitions in DRS.
Formally, RS = DRSI = D1I ∩ D2I ∩ … ∩ DnI = (D1 ∧ D2 ∧ … ∧ Dn)I.
For example, suppose that a resource set RS is denoted by the following definitions:
example.org
”foo
“As already noted, there is a further definition here that is implicit, namely that the resources exist. Therefore, the complete set definition,
DRS, denotes those resources that exist AND that have the characteristics of being available from example.org
AND that have a URI with a path component beginning with foo
.
We define an instance of an OWL class to take the place of the Resource Set and the properties of that Class are the set definitions D1, D2, …, Dn. The example can therefore be written using the following pseudo triples:
RS | rdf:type | Resource Set |
is_available_from | example.org | |
has_a_URI_with_a_path_component_beginning_with | foo |
Whether a specific resource R, known as the candidate resource, is a member of Resource Set RS or not, is determined by comparing its characteristics with those denoted by the set definitions used in DRS. It must be an element of the intersection of the sets defined by the interpretation of D1, D2, …, Dn to be an element of RS.
If a set definition is empty, that is, if the Resource Set Class has no properties, then the set is undefined and RS MUST be considered as the Empty Set. Formally:
Let RS be a resource set, and let DRS be the set of resource set definitions denoting the resources in RS: if DRS = ∅, then RS = ∅.
There are two ways in which a Resource Set may be defined.
A Resource Set may be defined using any combination of these methods. Furthermore, each may be negated so that, for example, it is possible to define a set as "all resources on example.com
except those on video.example.com
shot in widescreen format." This is shown in Example 4-6.
A Resource Set may be defined in terms of the IRIs, URIs or IP addresses of resources that are its members. Determining whether a candidate resource, is or is not a member of the set, can therefore be done by comparing its address with the data in the set definition. Importantly, if the set is defined solely in terms of IRIs or URIs, this can be done before deciding whether to fetch the candidate resource or perform a DNS lookup, thus maximizing processing efficiency in many environments.
We define a range of methods to support set definition by address, and provide support for methods defined in other Recommendations.
The syntax of a URI, as defined in RFC3986 [URIS], provides a generic framework for
identification schemes that goes beyond what is demanded by the POWDER use cases [PUC].
We therefore limit our work to IRIs and URIs with the syntax: scheme://host:port/path?query
(as
shown below). The user info and fragment components are not supported as it is felt that these are not
useful in defining Resource Sets and may add a layer of unnecessary complexity. That said it is noteworthy that
Resource Sets may be defined using additional vocabularies as set out in Section 6.
That extension method, or the use of the includeRegEx
and excludeRegEx
properties, means that user info and fragments can be used in Resource Set definitions if required.
http://www.example.com:1234/example1/example2?query=help \ / \ / \ /\ / \ / --- ------------- -- ---------------- -------- | | | | | scheme host port path query
The following Regular Expression, elaborated from that offered in RFC 3986 [Rabin], provides a means of splitting URIs of this type into their component parts.
(([^:/?#]+):)?(//([^:/?#@]*)(:([0-9]+))?)?([^?#]*)(\?([^#]*))?
This yields the components as shown in Table 1.
Component | RE variable |
---|---|
scheme | $2 |
host | $4 |
port | $6 |
path | $7 |
query | $9 |
For each URI component we define a corresponding RDF property, the value of which is a white space-separated list of strings, any one of which must match the relevant portion of the URI of the candidate resource.
Formally, we have a set definition D = URI component matches(?x, {string1 | string2 | … | stringn}), where ?x is a variable denoting the URI component under consideration, and {string1 | string2 | … | stringn} denotes a set consisting either of string string1, or string2, or … stringn.
Any number of set definitions D1, D2, …, Dn can be declared and, as stated in Section 1.2, the overall Resource Set is the intersection of the sets that can be interpreted from those definitions. However with some exceptions, each particular RDF property can only appear 0 or 1 times and some are mutually exclusive. Greater detail on this is provided as terms are introduced and in Section 4.
Strings are matched according to one of four rules:
startsWith
, meaning that the URI component starts with any of the strings listed in the value of the relevant property;endsWith
, meaning that the URI component ends with any of the strings listed in the value of the relevant property;exact
, meaning that there is an exact match between the candidate URI component and at least one of the strings listed in the value of the relevant RDF property;contains
, meaning that at least one of the strings listed in the value of the relevant RDF property appears somewhere in the URI component.Recognizing the great diversity of potential uses and set definition requirements, multiple properties are defined relating to the path and query components. Furthermore, for each property there is a 'negative' property, that is, a property whose value is a list of strings that must not be present in the relevant URI component.
RDF Property | URI component | Matching Rule | Negative RDF property |
---|---|---|---|
includeSchemes | scheme | exact | excludeSchemes |
includeHosts | host | endsWith | excludeHosts |
includePorts | port | exact | excludePorts |
includePortRanges † | port | exact | excludePortRanges |
includeExactPaths † | path | exact | excludeExactPaths |
includePathContains ‡ | contains | excludePathContains | |
includePathStartsWith | startsWith | excludePathStartsWith | |
includePathEndsWith | endsWith | excludePathEndsWith | |
includeQueryContains ‡ | query | contains | excludeQueryContains |
includeExactQueries | exact | excludeExactQueries |
As a quick example, the set of all resources on example.org
, whether fetched using http
or https
, where the path component of their URIs starts with foo
, and where the path does not end with .png
is defined thus:
<wdr:ResourceSet> <wdr:includeSchemes>http https</wdr:includeSchemes> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includePathStartsWith>/foo</wdr:includePathStartsWith> <wdr:excludePathEndsWith>.png</wdr:excludePathEndsWith> </wdr:ResourceSet>
The semantics and constraints of each of the terms in Table 2 is further defined in the POWDER Vocabulary document [VOC]. Precise details of how values for each term are combined is discussed is Section 4 below. However, it is worth noting the points made in the following sub-sections.
Ranges of Ports are defined as x-y, where x < y, that is, the lower and upper values in the range are separated by a hyphen. Multiple ranges can, of course, be listed using white space as the separator. Specific ports can be included or excluded using the includePorts
and excludePorts
properties so that the set of all resources on example.org via ports 3125 to 5236 excluding ports 4345 and 5000 can be expressed as in Example 2-2.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includePortRanges>3125-5236</wdr:includePortRanges> <wdr:excludePorts>4345 5000</wdr:excludePorts> <wdr:ResourceSet>
† includePorts
and includePortRanges
are mutually exclusive, that is, a Resource Set definition may include 0 or 1 of these RDF properties but not both. This is because, as has been noted, a candidate resource must share all of the characteristics defined in the Resource Set to be an element of it. Multiple definitions of port numbers would therefore require the URI of a candidate resource to have multiple ports (which is impossible).
‡ includePathContains
and includeQueryContains
may appear any number of times within a Resource Set definition so that it is easy to create one in which multiple strings must be present in paths and/or queries. This is in contrast to all other terms in Table 2 which can only occur 0 or 1 times since the URI of a candidate resource can only have one scheme, one host etc.
Query strings typically contain a series of name-value pairs separated by ampersands thus:
?name1=value1&name2=value2
These are usually acted on by the server to generate content in real time and the order of the name-value pairs is unimportant. For practical purposes ?name1=value1&name2=value2
is equivalent to ?name2=value2&name1=value1
. Therefore, if the candidate resource's URI includes a query string, and if the Resource Set definition refers to the query string then:
includeQueryContains
, excludeQueryContains
, includeExactQueries
and excludeExactQueries
, a POWDER processor must split the string into its constituent pairs at the ampersand character*.* If a server is known to use a different delimiter then a different RDF property must be defined, see Section 6.
N.B. If using the RDF properties relating to the query string of a URI then the real-time generation of content should be taken into account. It may be difficult, if not impossible, to predict with certainty what the content of the resource will be and therefore the Resource Set may not be fully defined. It follows that query string-based RDF properties should be used with caution.
Before any IRI or URI matching can take place the following canonicalization steps should be applied to the candidate resource's IRI or URI. These steps are consistent with RFC3986 [URIS], RFC3987 [IRIS] and URISpace [URISpace].
http
/
' is appended.
' characters in the host are removed, i.e. http://www.example.com.
becomes http://www.example.com
The following table gives some examples.
Input IRI/URI | Canonical form |
---|---|
www.example.com | http://www.example.com/ |
http://www.example.com | http://www.example.com/ |
HTTPS://WWW.EXAMPLE.COM/FOO | https://www.example.com/FOO |
http://www.example.com./foo | http://www.example.com/foo |
http://www.example.com:80/foo | http://www.example.com/foo |
Input IRI/URI | Canonical form |
---|---|
http://example.com/staff/Fran%c3%a7ois |
http://www.example.com/staff/François |
http://example.com/my%20doc.doc |
http://www.example.com/my doc.doc |
In this next example the %2F in is a literal slash, not a path separator, and so is left as %2F | |
http://www.example.com/foo/his%2Fhers |
http://www.example.com/foo/his%2Fhers |
To complement the URI/IRI canonicalization steps described in the previous section, related processing steps must also be carried out on the strings supplied as set defining data, that is, the values for the RDF properties listed in Table 2.
Bear in mind that if the data is serialized in XML, URI strings specified in the resource set definition will be escaped according to the XML syntax using entity references for specific characters (escaping <
with <
and &
with &
is mandatory, others may also be used). Moreover, since Resource Set definition properties take a white space-separated list of URI strings as their value, whenever a URI string contains an unescaped white space (i.e., a white space not encoded as %20
), it will be substituted by %20
.
The following steps should therefore be applied to each item in the list separately.
&
becomes &
, etc.scheme
or host
, it is normalized to lower case.host
, trailing '.' characters are removed.includePathStartsWith
, excludePathStartsWith
, includeExactPaths
or
excludeExactPaths
must begin with the '/'
character which is pre-pended if absent.If the set definition includes values related to the port
then matching of the data against the candidate resource's URI/IRI must be carried out as follows:
includePorts
or includePortRanges
then, when matching, if the default port for the candidate resource's URI/IRI is present in the list of supplied values (or the specified ranges), but the candidate resource's URI/IRI does not specify the port, the candidate resource IS an element of the set IF all other conditions are met.includePorts
or includePortRanges
then, when matching, if the default port for the candidate resource's URI/IRI is present in the list of supplied values (or the specified ranges), but the candidate resource's URI/IRI does not specify the port, the candidate resource is NOT an element of the Resource Set.includeUriPattern
and excludeUriPattern
PropertiesEnabling Read Access for Web Resources [WAF] defines a method for encoding the domains and
sub-domains from which access to resources on a given Web site should be granted or denied. The
includeUriPattern
and
excludeUriPattern
properties support this syntax directly.
Domains and sub-domains may be substituted by a wildcard character (*) according to the following EBNF:
access-item ::= (scheme "://")? domain-pattern (":" port)? | "*" domain-pattern ::= domain | "*." domain
scheme
and port
are used as defined in RFC 3986.
domain
is an internationalized domain name as defined in RFC 3490.
It follows that:
<wdr:includeHosts>example.com</wdr:includeHosts>
and
<wdr:includeUriPattern>example.com</wdr:includeUriPattern>
are equivalent. However, *.example.com
, meaning resources on sub-domains of example.com
but not on example.com itself, is not a valid value for includeHosts
.
Note that paths and query strings MUST NOT be included in the pattern. If these are required in a Resource Set definition, the relevant properties from Table 2 can be used.
includeRegEx
and excludeRegEx
PropertiesThe RDF properties discussed above all take white space-separated lists of strings as their values. It is
believed that these properties will be easy to use and cover the overwhelming majority of cases. However, the
use of strings with fixed matching rules clearly presents a restriction on flexibility. To support fully flexible
set definition by URI, the includeRegEx
and excludeRegEx
properties take a
Regular Expression (RE) and should be applied to the candidate resource's complete URI (after following
the canonicalization steps above).
The RE syntax used defined by XML schema as modified by XQuery 1.0 and XPath 2.0 Functions and Operators [XQXP].
N.B. The value of the includeRegEx
and excludeRegEx
properties MUST be a single Regular Expression, not a white space-separated list.
As an example, the set of all the resources hosted either by example.org
or example.net
, where the path component of their URIs starts either with foo
or bar
, can be defined thus:
<wdr:ResourceSet> <wdr:includeRegEx>^(([^:/?#]+):)?(//[^:/?#]+\.)*example.(org|net)/(foo|bar)</wdr:includeRegEx> </wdr:ResourceSet>
It is important to note that Example 2-3 does not take account of the need to escape certain characters.
The following characters are used as meta characters in Regular Expressions and MUST therefore be escaped if used in an RE pattern given as the value of the includeRegEx
property:
. \ ? * + { } ( ) [ ]
In addition, the < (less than) character MUST always be escaped since, if the set definition is given in RDF/XML, it could be mistaken for the beginning of the closing <wdr:includeRegEx>
tag.
As a safeguard against unintended consequences, other characters that always or typically have special meaning within URI strings and/or XML SHOULD also be escaped, namely:
! " # % & ' , - / : ; = > @ [ ] _ ` ~
As a result, Example 2-3 should properly be written as shown in Example 2-4 below.
<wdr:ResourceSet> <wdr:includeRegEx>^(([^\:\/\?\#]+)\:)?(\/\/[^\:\/\?\#]+\.)*example\.(org|net)\/(foo|bar)</wdr:includeRegEx> </wdr:ResourceSet>
includeRegEx
Example 2-4 uses a modified version of the RE given Section 2.1, substituting individual portions with specific strings. This is the safest method but is not, perhaps, the most natural way to proceed. If a less rigorous approach is taken it is easy to make mistakes when specifying REs, and incorrect REs in set definitions will have one of two possible (and obvious) consequences
Example 2-5 shows how this can happen.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includeRegEx>https</wdr:includeRegEx> </wdr:ResourceSet>
The intention in the RE given in Example 2-5 is probably to say "all resources on example.org
with a URI beginning with https
." However, as the RE is not anchored at either end, what this actually means is "all resources on example.org
where the URI includes https
". Thus this Resource Set includes both:
https://www.example.org/page.html
http://www.example.org/why_we_use_https.html
Adding in anchors at the beginning and end of the RE can have equally undesirable consequences.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includeRegEx>^https$</wdr:includeRegEx> </wdr:ResourceSet>
In Example 2-6, the intention is, again probably, to define the set of "all resources on example.org
fetched using https
only". However, adding both the ^ and $ anchors at the beginning and end of the RE means that the whole URI must be https
from start to finish — which can never be true so this Resource Set is equivalent to the empty set.
Example 2-7 shows one possible way to encode the intended set definition.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includeRegEx>^https</wdr:includeRegEx> </wdr:ResourceSet>
Whilst Example 2-7 'works', the potential dangers of using REs mean that it is generally better to use component strings where possible. Example 2-7 is therefore better written as shown in Example 2-8 below.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includeSchemes>https</wdr:includeSchemes> </wdr:ResourceSet>
A set of resources can be defined in terms of the IP address(es) from which the resources are served. To support
this we define two RDF properties: includeIPs
,
which takes a white space-separated list of single IP addresses, and
includeIpRanges
which takes a white space
separated list of CIDR blocks
[CIDR]. Negative versions of the these RDF properties are also defined:
excludeIPs
and
excludeIpRanges
respectively.
As with includePorts
, and for similar reasons, includeIPs
and includeIpRanges
are mutually exclusive, that is, a Resource Set may include one or other, but not both of these RDF properties.
The includeIPs
RDF property is simple enough: Example 2-9 defines the Resource Set as all resources available from IP address 123.123.123.123.
<wdr:ResourceSet> <wdr:includeIPs>123.123.123.123</wdr:includeIPs> </wdr:ResourceSet>
The includeIpRanges
RDF property allows the definition of a resource set based on a range of IP addresses, specified in a CIDR block. A CIDR block has the form <IP address>/
x, where the CIDR prefix x is a number ranging from 1 to 32, denoting the leftmost x bits which a set of IP addresses shares. For instance, the CIDR block 123.234.245.254/8
, denotes the range of IP addresses sharing the leftmost 8 bits, i.e., starting with 123
.
As an example, suppose that a Resource Set definition should denote all the resources hosted by the machines with IP addresses 123.234.245.254
and 123.234.245.255
. This can be expressed by the following Resource Set definition:
<wdr:ResourceSet> <wdr:includeIpRanges>123.234.245.254/31</wdr:includeIpRanges> </wdr:ResourceSet>
includeIpRanges
PropertyIn order to use CIDR blocks correctly, it must be taken into account that a CIDR prefix refers to the binary representation of an IP address. For instance, the binary representation of IP address 123.234.245.254
corresponds to
01111011 11101010 11110101 11111110
A CIDR block 123.234.245.254/31
denotes a range of IP addresses
01111011 11101010 11110101 1111111b
i.e., the range of IP addresses sharing the leftmost 31 bits with b either 1 or 0 (formally b ∈ {0,1}). Consequently, the CIDR block 123.234.245.254/31
denotes the following IP addresses:
01111011 11101010 11110101 11111110 = 123.234.245.254 01111011 11101010 11110101 11111111 = 123.234.245.255
This also means that the CIDR block 123.234.245.255/31
is equivalent to 123.234.245.254/31
.
It is important to note that the number N of IP addresses denoted by a CIDR block corresponds to 232−x. Therefore, if x = 32, N = 20 = 1, if x = 31, N = 21 = 2, etc. Therefore, it is possible to denote a range of IP addresses using wdr:includeIpRanges
only when the number N of IP addresses is a power of 2. Otherwise, it is necessary to provide a white space separated list of CIDR blocks or, alternatively, individual IP addresses. For instance, the resources hosted by the machines with IP addresses 123.234.245.253
, 123.234.245.254
, and 123.234.245.255
can be expressed as shown in Example 2-11.
<wdr:ResourceSet> <wdr:includeIpRanges>123.234.245.253/32 123.234.245.254/31</wdr:includeIpRanges> </wdr:ResourceSet>
OR
<wdr:ResourceSet> <wdr:includeIPs>123.234.245.253 123.234.245.254 123.234.245.255</wdr:includeIPs> </wdr:ResourceSet>
Incidentally, as already noted, includeIPs
and includeIpRanges
are mutually exclusive. It is perhaps tempting to create a Resource Set definition like that shown in Example 2-12, however, this would require a candidate resource to be available from both 123.234.245.253
AND either 123.234.245.254
OR 123.234.245.255
which is impossible so that Example 2-12 is tantamount to the empty set.
<wdr:ResourceSet> <wdr:includeIpRanges>123.234.245.254/31</wdr:includeIpRanges> <wdr:includeIPs>123.234.245.253</wdr:includeIPs> </wdr:ResourceSet>
Defining Resource Sets by IP address puts a burden on the processor since it will often have to perform a DNS look up to determine whether a candidate resource is, or is not, a member of the Resource Set. Furthermore, it is particularly easy to include resources in the set by accident using such a broad-sweep approach. If a Web site is hosted on a shared server, for example, it is very likely that the set will include resources by mistake.
Defining a Resource Set by IP address would, however, be appropriate where a content provider operates a large network of servers, or where particular types of content to be described are hosted on servers that can easily be identified by their IP address.
includeResources
and excludeResources
propertiesIt is useful to be able to include or exclude resources from sets by simple listing. The
includeResources
and
excludeResources
RDF
properties support this, both of which take white space separated lists of IRIs and/or URIs. To give a simple
example, the set of all resources on example.org except its stylesheet and JavaScript library can be
encoded as shown in Example 2-13.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:excludeResources>http://www.example.org/stylesheet.css http://www.example.org/jslib.js</wdr:excludeResources> </wdr:ResourceSet>
As emphasized throughout this document, each RDF property and its value creates a set definition of its own and the full Resource Set is the intersection of those sets. Thus an alternative way of looking at Example 2-13 is to say that a candidate resource is a member of the Resource Set if
it is on example.org
AND does not have the URI http://www.example.org/stylesheet.css
AND does not have the URI http://www.example.org/jslib.js
.
It is tempting to use includeResources
in a similar fashion as shown in Example 2-14.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includeResources>http://www.w3.org/Icons/valid-xhtml10</wdr:includeResources> </wdr:ResourceSet>
The intention in this example is to include the W3C's valid XHTML 1.0 icon in the set of resources on example.org. However, a resource would have to be both on the example.org
host AND have a URI that matched http://www.w3.org/Icons/valid-xhtml10
to be an element of the set. Since this is impossible, such a definition is, again, tantamount to the empty set.
The solution is to use the OWL set operator owl:unionOf
as shown in Example 2-15.
<wdr:ResourceSet> <owl:unionOf rdf:parseType="Collection"> <wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> </wdr:ResourceSet> <wdr:ResourceSet> <wdr:includeResources>http://www.w3.org/Icons/valid-xhtml10</wdr:includeResources> </wdr:ResourceSet> </owl:unionOf> </wdr:ResourceSet>
Here we have two discrete Resource Sets, each of which is made up of, in this case, a single RDF property and its value; and the overall Resource Set comprises the union of those two sets. The use of the OWL set operators is discussed in detail in Section 4.
includeRedirection
propertyIf a Resource Set is defined in terms of the URIs of the resources that are elements of the set then resolving the URIs may lead to redirection through 3xx HTTP status codes [HTTPCODE]. By default, such redirection MUST lead to the 'new' resource itself being compared with the Resource Set definition. That is, if the resource identified by URI1 is an element of the Resource Set but, when resolving it, the user agent is redirected via a 3xx HTTP response code to URI2, then the resource identified by URI2 MUST itself be compared with the Resource Set definition to determine whether or not it is an element of the set.
Recognizing that there may be circumstances where this default behavior may cause unnecessary latency, redirected resources MAY be
included by use of the includeRedirection
property. The range of
this RDF property allows for any of HttpAnyRedirect
,
HttpPermRedirect
or
HttpTempRedirect
to be given as its value. These classes are all
based on those defined in the HTTP in RDF vocabulary [HTTPRDF]. See the POWDER Vocabulary [VOC] for details.
As their names suggest, the HTTP redirection classes allow Resource Set definitions to allow any redirection, specifically permanent redirection
(i.e. HTTP response code 301) or any of the temporary redirection HTTP response codes (302, 303 and 307).
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includeRedirection rdf:resource="http://www.w3.org/2007/05/powder#HttpPermRedirect" /> </wdr:ResourceSet>
Example 2-16 encodes that if, when resolving any URI on the example.org
domain (or its sub-domains), the user agent is redirected through a 301 (permanent) HTTP response code then the target resources are elements of the Resource Set, even if those resources are on a different domain. Resources resolved following other redirects would not be included unless they were also on the example.org
domain.
The definition of a Resource Set by reference to the addresses of its elements is not always practical or relevant. For example, numeric URIs generated by a content management system may not reveal any information about a given resource and there are other situations where knowledge of one property itself allows the inference of further properties. For example, if the title of a document includes the word 'draft' it may be possible to infer that different terms of use apply than if the word is absent.
We therefore provide two RDF properties, includeConditional
and
excludeConditional
, the object of which is the base RDFS Class 'Resource'
that represents the resources that are, or are not, elements of the set respectively. Any characteristic of those resources can be defined in
the usual way to confer membership of the Resource Set or exclude resources from it. For instance, Example 3-1 defines the
set of resources on the example.org domain whose language is French (the prefix ex
denoting any vocabulary). Although, in common
with most other set definition terms, includeConditional
and excludeConditional
may each only occur once in a
Resource Set definition, any number of predicates from any vocabularies may be defined as RDF properties of the RDFS Class to which they link.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includeConditional rdf:parseType="Resource"> <ex:lang>fr</ex:lang> </wdr:includeConditional> </wdr:ResourceSet>
Importantly, it is for the processor to define a suitable method for determining whether a candidate resource has the stated property and
therefore whether or not it is an element of the Resource Set. It follows that using includeConditional
and excludeConditional
breaks the design goals since a generic POWDER processor will not be able to determine with certainty whether a given candidate
resource is or is not an element of the Resource Set. Referring to Example 3-1, two different outputs are possible:
However, a Resource Set definition may offer a hint as to the best method to take in making such a determination.
The RDF property lookUpService
links to a description of a service through
which resource properties MAY be discovered (if the processor has another method available, it is acceptable to use it).
Such a description may be a natural language document, a WSDL file or be in any
other format. Such a description would allow a POWDER processor to be extended to give a definitive answer as to whether a candidate resource was
or was not an element of the Resource Set. The following example explores this further.
The trustmark.example organization wants to define a Resource Set as everything on the Web sites to which it has granted its seal of approval. It can then publish a Description Resource [DR] that provides a semantically-rich, machine-processable version of that seal, effectively automating its 'click to verify' system. Since the organization already publishes the list of approved Web sites, both in an HTML document and as an ATOM feed, determining whether a candidate resource is or is not an element of the resource Set is straightforward, albeit outside the POWDER processing model.
In Example 3-2, the lookUpService
property points to a natural language document (at http://trustmark.example/doc.htm)
that gives the URI of the list of approved Web sites, an ATOM feed of those same sites, and additional details of how the data is presented.
This allows a developer to extend a POWDER processor to identify with certainty whether a candidate resource is or is not in the set of resources that carry the trustmark
by referring to the data in whichever format he/she finds easiest.
<wdr:ResourceSet> <wdr:includeConditional> <rdfs:Resource> <ex:lang>fr</ex:lang> </rdfs:Resource> </wdr:includeConditional> <wdr:lookUpService rdf:resource="http://trustmark.example/doc.htm" /> </wdr:ResourceSet>
The model here is similar to that used for HTML Profile [HTMLPROF] — the look up service description will not be parsed every time the Resource Set is queried. Rather, the expectation is that the descriptive document should remain stable and its contents become well established. Citing the document should therefore be sufficient for established look up services to be identified and used.
As set out briefly in Section 2.1 and referred to throughout this document, Resource Sets are defined using RDF properties whose values are white space separated lists of possible values. The exceptions to this are the includeRegEx
and excludeRegEx
properties which take a single Regular Expression. Taken from the point of view of determining whether a candidate resource is or is not an element of the Resource Set, the values of the include RDF properties are combined with logical OR. In Example 4-1, the candidate resource is an element of the Resource Set if it is on example.org
OR example.com
.
<wdr:ResourceSet> <wdr:includeHosts>example.org example.com</wdr:includeHosts> </wdr:ResourceSet>
This is the only way to encode the set of resources on these two hosts (excepting the possibility of doing so using a Regular Expression). A validation error SHOULD be raised if any set definition RDF property, other than includePathContains
or includeQueryContains
, appears more than once in a given Resource Set. Example 4-2 is therefore invalid.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:includeHosts>example.com</wdr:includeHosts> </wdr:ResourceSet>
A candidate resource MUST satisfy ALL definitions in a given Resource Set. Therefore the set of all resources on example.org
or example.com
that have a path starting with foo
or bar
is defined as shown in Example 4-3.
<wdr:ResourceSet> <wdr:includeHosts>example.org example.com</wdr:includeHosts> <wdr:includePathStartsWith>/foo /bar</wdr:includePathStartsWith > </wdr:ResourceSet>
Expressed using set theory, each RDF property is a resource set definition intentionally denoting a set of resources.
Thus, given the following two resource set definitions:
D1 = includeHosts(?x, {example.com
, example.org
})
D2 = includePathStartsWith(?x, {foo
, bar
})
the Resource Set is the intersection of the extension of such resource set definitions:
RS = D1I ∩ D2I
In natural language, the same is true for the exclude properties. That is, Example 4-4 says that a resource is a member of the set if it is on example.org and does not have a path beginning with foo or bar.
<wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <wdr:excludePathStartsWith>/foo /bar</wdr:includePathStartsWith > </wdr:ResourceSet>
However, when converting from natural language into Boolean logic, we actually need to combine the listed values for the exclude properties with AND. Example 4-4 can be written as
if (host = example.org) AND (path ≠ foo) AND (path ≠ bar)
This is an application of DeMorgan's Theorem which states that if P and Q are Boolean statements then the expression: NOT(P OR Q) is equivalent to NOT(P) AND NOT(Q). More formally:
¬(P ∨ Q) = ¬(P) ∧ ¬(Q)
It is therefore consistent to state that POWDER processors MUST:
This is made explicit in the POWDER Vocabulary [VOC].
It is believed that the RDF properties described in sections 2 and 3 provide sufficient flexibility to cover the majority of uses for the grouping of resources. However, there is a clear limit on expressivity which needs to be addressed, for example, it is impossible using the system described so far to express the set of resources on example.org
with a path beginning with foo
and the resources on example.com
that have a path beginning with bar
(again, that is, it's impossible without using the includeRegEx
property and a regular expression). To define such a Resource Set requires the union of two discrete sets and this can be achieved using the OWL set operators [OWLSO], as shown in Example 4-5.
1 <wdr:ResourceSet> 2 <owl:unionOf rdf:parseType="Collection"> 3 <wdr:ResourceSet> 4 <wdr:includeHosts>example.org</wdr:includeHosts> 5 <wdr:includePathStartsWith>/foo</wdr:includePathStartsWith> 6 </wdr:ResourceSet> 7 <wdr:ResourceSet> 8 <wdr:includeHosts>example.com</wdr:includeHosts> 9 <wdr:includePathStartsWith>/bar</wdr:includePathStartsWith> 10 </wdr:ResourceSet> 11 </owl:unionOf> 12 </wdr:ResourceSet>
Lines 3 - 6 and 7 - 10 of Example 4-5 are Resource Set definitions in their own right and the overall Resource Set is the union of these two. Formally we can write:
D1 = includeHosts(?x, {example.org
})
D2 = includePathStartsWith(?x, {foo
})
D3 = includeHosts(?x, {example.com
})
D4 = includePathStartsWith(?x, {bar
})
RS1 = D1I ∩ D2I
RS2 = D3I ∩ D4I
RS = RS1 ∪ RS2
OWL's intersectionOf
set operator can also be used although it is anticipated that this will be rare since a Resource Set is the intersection of the various sets defined within it. One scenario where it is appropriate to use owl:intersectionOf
is where Resource Sets are defined by reference to multiple external data sources using the property look up method described in
Section 3.2.
In theory, the OWL complementOf
property can also be used. However, this can readily lead to significant logic problems since it is an 'open world' definition. To give an example, in order to determine the elements of the set of movies that have not received bad reviews, one would have to collect all movie reviews ever published and note the ones that were not bad. Since it is a critical design goal that a processor MUST be able to determine with certainty whether a candidate resource is or is not an element of a Resource Set, the OWL complementOf
property SHOULD NOT be used.
A combination of the exclude RDF properties described in sections 2 and 3 and OWL's unionOf
operator can be used to create precise, that is, closed world, Resource Set definitions that exclude particular resources. For example, at the end of Section 1.2 we claimed that it is possible to define the set of "all resources on example.com
except those on video.example.com
shot in widescreen format." Example 4-6 shows how this can be done in a relatively few lines.
<wdr:ResourceSet> <owl:unionOf rdf:parseType="Collection"> <wdr:ResourceSet> <wdr:includeHosts>example.com</includeHosts> <wdr:excludeHosts>video.example.com</wdr:includeHosts> </wdr:ResourceSet> <wdr:ResourceSet> <wdr:includeHosts>example.com</includeHosts> <wdr:excludeConditional rdf:parseType="Resource"> <ex:format>widescreen</ex:format> </wdr:excludeConditional> </wdr:ResourceSet> </owl:unionOf> </wdr:ResourceSet>
The owl:unionOf
operator may be used to create highly complex nested Resource Set definitions such as that shown in Example 4-7.
<wdr:ResourceSet> <owl:unionOf rdf:parseType="Collection"> <wdr:ResourceSet> <wdr:includeHosts>example.org</wdr:includeHosts> <owl:unionOf rdf:parseType="Collection"> <wdr:ResourceSet> <wdr:includePathStartsWith>/foo</wdr:includePathStartsWith> </wdr:ResourceSet> <wdr:ResourceSet> <owl:unionOf rdf:parseType="Collection"> <wdr:ResourceSet> <wdr:includePathEndsWith>bar</wdr:includePathEndsWith> </wdr:ResourceSet> <wdr:ResourceSet> <wdr:excludePathEndsWith>foo</wdr:includePathEndsWith> </wdr:ResourceSet> </owl:unionOf> </wdr:ResourceSet> </owl:unionOf> </wdr:ResourceSet> <wdr:ResourceSet> <wdr:includeHosts>example.com</wdr:includeHosts> <wdr:includePathStartsWith>/bar</wdr:includePathStartsWith> </wdr:ResourceSet> </owl:unionOf> </wdr:ResourceSet>
Whilst Resource Set definitions like Example 4-7 are possible, their use will place a substantial burden on the processor and SHOULD be avoided. The Resource Set it defines is the set of resources on example.org with a URI path starting with foo or ending with either foo or bar, plus the resources on example.com that have a URI path starting with bar.
It is important to note that, when a set definition denotes resource by their address, we can obtain the same result by using the includeRegEx
property, which would usually provide a more efficient solution. Example. 4-7 can be rewritten as shown in Example 4-8.
<wdr:ResourceSet> <wdr:includeRegEx>(example.org\/(foo)|(.*(foo|bar)$))|(example.com\/bar)</wdr:includeRegEx> </wdr:ResourceSet>
It is recognized that a number of the design goals and constraints set out in Section 1.1 are in tension with each other, notably that Resource Set definitions must be easy to write, be comprehensible by humans and, as far as is possible, should avoid including or excluding resources unintentionally.
To answer the call to make it easy to write Resource Set definitions, a wide variety of RDF properties have been defined that are, it is hoped, easy to use and comprehend by humans. It is anticipated that Example 5-1 will be typical.
<wdr:ResourceSet> <wdr:includeHosts>example.mobi</includeHosts> <wdr:excludePathStartsWith>/cgi-bin /test /private</wdr:excludePathStartsWith> </wdr:ResourceSet>
This is analogous to the sort of resource grouping in a robots.txt file [ROBOTS] that invites crawlers to probe all parts of a Web site except the cgi-bin, the testing and private areas.
Now suppose that the content provider responsible for example.mobi sets up a service called 'Test Your IQ.' realizing that the Resource Set definition will exclude the testyouriq
section of the Web site (as it begins with test
), he/she adds a new line to the Resource Set definition in an attempt specifically to include the new section thus:
<wdr:ResourceSet> <wdr:includeHosts>example.mobi</includeHosts> <wdr:excludePathStartsWith>/cgi-bin /test /private</wdr:excludePathStartsWith> <wdr:includePathStartsWith>/testyouriq</wdr:includePathStartsWith> </wdr:ResourceSet>
This would not have the desired effect! The critical part of this definition now says that a candidate resource is a member of the Resource Set if it has a path that begins with testyouriq
AND does NOT have a path that begins with test
. This can never be true and therefore
Example 5-2 is equivalent to the empty set.
This example serves to highlight an important point: that it is perfectly possible to create a set definition that includes logical inconsistencies. A POWDER processor MUST, indeed can only, treat such Resource Set definitions as the Empty Set.
The correct solution to the problem is not to specify a further property in the original Resource Set, but to create an additional Resource Set definition and combine the two with an owl:unionOf
operator thus:
<wdr:ResourceSet> <owl:unionOf rdf:parseType="Collection"> <wdr:ResourceSet> <wdr:includeHosts>example.mobi</includeHosts> <wdr:excludePathStartsWith>/cgi-bin /test /private</wdr:excludePathStartsWith> </wdr:ResourceSet> <wdr:ResourceSet> <wdr:includeHosts>example.mobi</includeHosts> <wdr:includePathStartsWith>/testyouriq</wdr:includePathStartsWith> </wdr:ResourceSet> </owl:unionOf> <wdr:ResourceSet>
In this document we have laid out just two methods to define a set of resources: one referring to resource addresses and the other to resource properties. The address-based methods are clearly designed to be used with information resources available on the Web that can be identified by matching things like host names, paths and IP addresses. There is no limit on the distinguishing characteristics that can be used to define a set of resources, however, and so there should not be unnecessary constraints on how the protocol works.
The POWDER Vocabulary [VOC] uses pre-defined data types from XML Schema as well as other atomic data types, and then derives list data types from them. As the following examples show, an analogous approach can be taken with any system used for identifying resources so that little augmentation would be needed for a POWDER processor to be able to handle the data.
Importantly, if a Resource Set is defined using any term that the processor does not recognize then it MUST treat it as the empty set.
The International Standard Audiovisual Number [ISAN1] is a voluntary numbering system for the identification of audiovisual works. Following ISO 15706, the numbers are written as 24 bit hexadecimal digits in the following format [ISAN2].
-----root----- | episode | -version- | ||||
ISAN | 1881-66C7-3420 | - | 0000 | -7- | 9F3A-0245 | -U |
The root of an ISAN number is assigned to a core work with the other numbers being used for things like episodes, different language versions, promotional trailers and so on.
A vocabulary can readily be defined to allow Resource Sets to be defined based on ISAN numbers. The terms might be along the lines of:
includeRoots
— the value of which would be a white space separated of hexadecimal digits and hyphens that would be matched against the first three blocks in the ISAN number.
includeEpisodes
— a white space separated list of hexadecimal digits and hyphens that would be matched against the 4th block of 4 digits in the ISAN number.
includeVersions
— a white space separated list of hexadecimal digits and hyphens that would be matched against the 5th and 6th blocks of 4 digits in the ISAN number.
includeIsanPattern
— a regular expression that should be matched against the entire ISAN number.
The set of all audio visual resources that relate to two particular works might then be defined as shown in Example 6-1.
<wdr:ResourceSet> <ex_isan:includeRoots>1881-66C7-3420 1881-66C7-3421</ex_isan:includeRoots> </wdr:ResourceSet>
Developers may create their own URL patterns for use in specific services. For example, Google Custom Search Engine [Google]
uses wildcards so that www.example.org/*
means "all
the resources on www.example.org." Such a system is easily used within a Resource Set, only requiring the definition
of a single RDF property myPattern
as shown below.
<wdr:ResourceSet> <ex:myPattern>www.example.org/*</ex:myPattern > </wdr:ResourceSet>
robotstxt.org
This document is at http://www.robotstxt.org/.The editors duly acknowledge the earlier work in this area carried out by Jo Rabin and the contributions made by all members of the POWDER Working Group.
includeUserInfo
and includeFragments
properties since these are not strictly part of HTTP, the former can cause security issues, especially when written as username:password, and grouping by fragments is very vague since there is no sure way to define the end of a fragment.