W3C

Protocol for Web Description Resources (POWDER): Grouping of Resources

W3C Working Draft — 7 January 2009

This version
http://www.w3.org/TR/2008/WD-powder-grouping-20090107/
Latest version
http://www.w3.org/TR/powder-grouping/
Previous version
http://www.w3.org/TR/2008/WD-powder-grouping-20081114/
Editors:
Phil Archer, Institute of Informatics & Telecommunications (IIT), NCSR "Demokritos" (formerly at FOSI)
Andrea Perego, Università degli Studi dell'Insubria
Kevin Smith, Vodafone Group R & D

Abstract

The Protocol for Web Description Resources (POWDER) facilitates the publication of descriptions of multiple resources such as all those available from a Web site. This document describes how sets of IRIs can be defined such that descriptions or other data can be applied to the resources obtained by dereferencing IRIs that are elements of the set. IRI sets are defined as XML elements with relatively loose operational semantics. This is underpinned by the formal semantics of POWDER which include a semantic extension, defined separately. A GRDDL transform is associated with the POWDER namespace that maps the operational to the formal semantics.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This Working Draft reflects the comments received following the Last Call period ended 5th December 2008. The Working Group intends to refer to this version when seeking transition to Proposed Recommendation. It is inappropriate to refer to or link to this version (please refer either to the Last Call document or the Proposed Recommendation if and when it is available. Changes to this document since the previous version are recorded in the Change Log.

This document was developed by the POWDER Working Group. The Working Group expects to advance this Working Draft to Recommendation Status.

Please send comments about this document to public-powderwg@w3.org (with public archive); please include the text "comment" in the subject line.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Introduction
1.1 Design Goals and Constraints
1.2 Outline Methodology
1.3 Operational Semantics
1.4 Formal Semantics
2 Defining a Resource Set
2.1 Constraints on IRI Components
2.1.1 IRI Constraints Referring to Ports: includeports and excludeports
2.1.2 IRI Constraints Referring to Queries: includequerycontains and excludequerycontains
2.1.3 IRI/URI Canonicalization
2.1.4 Data encoding
2.2 Grouping using Wildcards: The includeiripattern and excludeiripattern Constraints
2.3 Grouping by Regular Expression: The includeregex and excluderegex Constraints
2.3.1 Safe Use of includeregex
2.4 Grouping by IP Address
2.5 Enumerating Elements of an IRI Set: the includeresources and excluderesources Constraints
2.6 Complex Sets: Negation, Conjunction and Disjunction
3 Extension Mechanism
3.1 Extension Example: Custom IRI Patterns
3.2 Extension Example: Custom Site Structure
3.3 Extension Example: ISAN
4 Conformance Criteria
5 References
6 Acknowledgments
7 Change Log
Appendix A Summary of POWDER Elements

1 Introduction

The Protocol for Web Description Resources (POWDER) facilitates the publication of descriptions of multiple resources such as all those available from a Web site. These descriptions are attributable to a named individual, organization or entity that may or may not be the creator of the described resources. This contrasts with more usual metadata that typically apply to a single resource, such as a specific document's title, which is usually provided by its author.

Description Resources (DRs) are described separately [DR]. This document sets out how groups (i.e. sets) of resources may be defined, either for use in DRs or in other contexts. Set theory has been used throughout as it provides a well-defined framework that leads to unambiguous definitions. However, it is used solely to provide a formal version of what is written in the natural language text.

POWDER uses a limited set of XML elements to define sets of resources and these have relatively loose semantics. However, a GRDDL [GRDDL] transform is associated with the POWDER root namespace through which formal semantics are accessible as RDF/OWL. This is known as Semantic POWDER or POWDER-S. The details of the GRDDL transform and the formal semantics are defined separately [FORMAL] and outlined in Section 1.4 below. The use cases, a primer, test suite and schema namespace documents complete the document set.

The POWDER schema namespace is http://www.w3.org/2007/05/powder# for which we use the prefix wdr. The POWDER-S namespace is http://www.w3.org/2007/05/powder-s# for which we use the prefix wdrs and All namespaces and prefixes used in this document are shown in the table below.

Table 1: Namespace and prefixes used in this document
Prefix Namespace
wdr http://www.w3.org/2007/05/powder#
wdrs http://www.w3.org/2007/05/powder-s#
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs http://www.w3.org/2000/01/rdf-schema#"
owl http://www.w3.org/2002/07/owl#
xsd http://www.w3.org/TR/xmlschema-2/
ex An arbitrary prefix used to denote an 'example vocabulary'

In this document, the words MUST, MUST NOT, SHOULD, SHOULD NOT and MAY are to be interpreted as described in RFC 2119 [RFC2119].

White space is any of U+0009, U+000A, U+000D and U+0020. A space-separated list is a string of which the items are separated by one or more space characters (in any order). The string may also be prefixed or suffixed with zero or more of those characters. To obtain the values from a space-separated list user agents MUST replace any sequence of space characters with a single U+0020 character, dropping any leading or trailing U+0020 character, and then chopping the resulting string at each occurrence of a U+0020 character, dropping that character in the process.

The (unqualified) terms POWDER, POWDER Document and Description Resource (DR) refer to operational representations and semantics. The term POWDER-S refers to documents and data that express the formal semantics of POWDER. Unqualified XML element names are in the POWDER (wdr) namespace

1.1 Design Goals and Constraints

In designing a system to define sets of resources we have drawn on earlier work [Rabin] carried out in the Web Content Label Incubator Activity [WCL-XG], and taken into account the following considerations.

  1. It must be possible to define a set of resources, either by describing the characteristics of the IRIs of resources in the set, or by simply listing its elements.
  2. It must be possible to determine with certainty whether a given resource is or is not an element of the Resource Set, as long as the resource's IRI is known.
  3. The ease of creation of accurate and useful Resource Sets is important.
  4. It should be possible to write concise Resource Set definitions.
  5. Resource Set definitions must be easy to write, be comprehensible by humans and, as far as is possible, should avoid including or excluding resources unintentionally.
  6. It must be possible to create software that implements Resource Set definitions primarily using standard and commonly available components and specifically must not require the creation of custom parsing components.
  7. So far as is possible, use of processing resources should be minimized, especially by early detection of a match or failure to match.

1.2 Outline Methodology

Operationally, POWDER does not define resource sets, rather, it facilitates the definition of sets of IRIs (International Resource Identifiers) [IRIS], which can be used to denote resources in terms of their identifiers. We use the notion of IRIs instead of URIs[URIS] since IRIs are a superset of URIs. Therefore, an IRI set definition may denote a set of IRIs as well as a set of URIs.

Defining a resource set by specifying the characteristics that the identifiers of resources in the set share is clearly an indirect approach, albeit a very useful one in the real world. In a logical sense, the definition must be interpreted to arrive at the full set.

More formally, an IRI Set definition D denotes a set of IRIs IS = DI, where DI is the interpretation of D, i.e., the set of IRIs sharing the characteristics denoted by D.

We take this further and allow an IRI set definition to be built up in stages.

An IRI set IS is denoted by an IRI set definition DIS in terms of one or more characteristics that the elements of the set have in common. Each characteristic is expressed by an IRI constraint C, and IRI constraints C1, C2, … Cn give rise to IRI set definitions D, 1, D2, … Dn, so that the complete IRI set definition DIS comprises D1, D2, … Dn.

The IRI set IS is the intersection of the IRI sets denoted by the IRI set definitions in DIS.

Formally:

IS = DISI = D1ID2I ∩ … ∩ DnI = (D1D2 ∧ … ∧ Dn)I.

For example, suppose that an IRI set IS is denoted by the following definitions:

Then, DIS will be defined as follows: “the top level components of the host component of the IRI exactly match example.org” AND “the path component of the IRI begins with /foo.”

Whether the IRI of a specific resource R, known as the candidate resource, is a member of IRI Set IS or not is determined by comparing its characteristics with those denoted by the set definitions used in DIS. It must be an element of the intersection of the sets defined by the interpretation of D1, D2, …, Dn to be an element of IS.

If an IRI set definition contains no constraints, then its interpretation is by definition the empty set ∅. Formally:

Let IS be an IRI Set, and let DIS be the set of IRI Set definitions denoting the IRIs in IS: if DIS = ∅, then IS = ∅.

1.3 Operational Semantics

The POWDER XML schema [WDR] defines the set of XML elements and attributes to be used for enforcing the operational semantics of an IRI set definition.

More precisely, we define an XML element iriset to take the place of the IRI set, and its child elements denote the set of IRI constraints C1, C2, …, Cn. The example reported in the previous section can therefore be written as follows:

Example 1-1: A simple IRI Set definition

<iriset>
  <includehosts>example.org</includehosts>
  <includepathstartswith>/foo</includepathstartswith>
</iriset>

1.4 Formal Semantics

The operational semantics described above are underpinned by formal semantics. A GRDDL [GRDDL] transform is associated with the POWDER namespace that allows the XML data to be rendered and processed as RDF/OWL with one important proviso — that a semantic extension is understood. Defined fully in the Formal Semantics document [FORMAL], this allows a candidate resource's IRI to be matched against regular expressions that are values of an OWL data type property wdrs:matchesregex (or wdrs:notmatchesregex in the case of patterns that are to be excluded). An OWL class takes the place of the IRI set and resources whose IRIs match all the property restrictions defined using wdrs:matchesregex and wdrs:notmatchesregex are instances of that class. The regular expression syntax used is defined by XML schema as modified by XQuery 1.0 and XPath 2.0 Functions and Operators [XQXP].

As shown in Example 1-1 above, the POWDER XML elements generally take strings as values. These are converted into regular expressions as a first step in the GRDDL transform which renders POWDER documents in an intermediate format known as POWDER-BASE. It is POWDER-BASE that is then transformed into POWDER-S. For clarity, this two-stage process is not referred to in the main section of this document on defining a resource set which only presents POWDER and POWDER-S examples. POWDER-BASE is, however, an important part of the extension mechanism of POWDER Resource Grouping. The Formal Semantics document gives full details of the transformation of all elements of POWDER documents to POWDER-BASE and POWDER-S.

The result of the GRDDL transformation on Example 1-1 above is shown below.

Example 1-2: The POWDER-S encoding of Example 1-1

<owl:Class rdf:nodeID="iriset_1">
  <owl:equivalentClass>
    <owl:Class>
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">(([^\/\?\#]*)\@)?([^\:\/\?\#\@]+\.)?(example\.org)(:([0-9]+))?\/</owl:hasValue>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?(\/foo)</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
    </owl:Class>
  </owl:equivalentClass>
</owl:Class>

2 Defining a Resource Set

A Resource Set is defined in terms of the IRIs of resources that are its members. Determining whether a candidate resource is, or is not, a member of the set, can therefore be done by comparing its IRI with the data in the set definition. Importantly, defining the Resource Set in terms of IRIs allows us to verify whether the candidate resource is in the set without having to fetch and parse it, or perform a DNS lookup, thus maximizing processing efficiency in many environments.

We define a range of methods to support set definition by IRI, and provide support for methods defined in other Recommendations.

2.1 Constraints on IRI components

The syntax of an IRI, as defined in RFC 3987 [IRIS], provides a generic framework for identification schemes that goes beyond what is demanded by the POWDER use cases [USECASES]. We therefore limit our work to IRIs with the syntax: scheme://iuser@ihost:port/ipath?iquery#ifragment, as shown below:

http://jdoe@www.example.com:1234/example1/example2?query=help#fragment
\  /   \  / \             / \  /\                / \        / \      /
 --     --   -------------   --  ----------------   --------   ------ 
  |      |         |          |         |              |        |
scheme iuser    ihost        port     ipath         iquery  ifragment
       info 

The following Regular Expression, elaborated from that offered in RFC 3986 [Rabin], provides a means of splitting both URIs and IRIs of this type into their component parts.

^(([^:/?#]+):)?(//((([^/?#]*)@)?([^/?#:]*)(:([^/?#]*))?))?([^?#]*)(\?([^#]*))?(#(.*))?

If the IRI of the candidate resource is valid, this yields the components as shown in Table 2 (strings that are not valid IRIs will inevitably lead to unpredictable results).

Table 2: Mapping between regular expression variables and IRI components
ComponentRE variable
scheme $2
iuserinfo $6
ihost $7
port $9
ipath $10
iquery $12
ifragment $14

For the scheme, ihost, port, ipath, and iquery IRI components we define corresponding IRI constraints, the value of most of which is a white space-separated list of strings, any one of which must match the relevant portion of the IRI of the candidate resource. The exception is the constraint relating to query strings which is discussed in Section 2.1.2.

The iuserinfo and ifragment components are not used in POWDER IRI set definitions directly as it is felt that these may add a layer of unnecessary complexity with few practical applications. That said, it is important not to discard these components when processing the candidate resource's IRI. Furthermore, IRI sets may be defined using additional vocabularies as set out in Section 3. That extension method, or the use of the includeregex and excluderegex properties (see Section 2.3 below), means that user info and fragments can be used in IRI set definitions if required.

Formally, an IRI set definition D is expressed by one or more IRI constraints of the form C = IRI_component_matches(?x, {string1 | string2 | … | stringn}), where ?x is a variable denoting the IRI component under consideration, and {string1 | string2 | … | stringn} denotes a set consisting either of string string1 OR string2 OR … OR stringn.

Any number of IRI constraints C1, C2, …, Cn can be declared, and, as stated in Section 1.2, the overall IRI set is the intersection of the sets that can be interpreted from IRI set definitions corresponding to Cn. With some exceptions, each particular IRI constraint can only appear 0 or 1 times.

Strings are matched according to one of four rules:

Recognizing the great diversity of potential uses and set definition requirements, multiple IRI constraints are defined relating to the path component. Furthermore, for each constraint there is a ‘negative’ constraint, that is, a constraint whose value is a list of strings that must not be present in the relevant IRI component.

Table 3: Basic IRI constraints used to define IRI sets. These and other elements introduced in subsequent sections are summarized in the Appendix.
IRI constraint IRI component Matching rule Negative constraint
includeschemes scheme exact excludeschemes
includehosts ihost endsWith excludehosts
includeexactpaths ipath exact excludeexactpaths
includepathcontains contains excludepathcontains
includepathstartswith startsWith excludepathstartswith
includepathendswith endsWith excludepathendswith
includeports port exact excludeports
includepathcontains may appear any number of times within an IRI set definition, so that it is easy to create one in which multiple strings must be present in paths. This is in contrast to all other terms in Table 3 which can only occur 0 or 1 times, since the IRI of a candidate resource can only have one scheme, one host etc.

As a quick example, the set of all resources on example.org, whether fetched using specifically http or https, where the path component of their IRIs starts with foo, and where the path does not end with .png or .jpg is defined thus:

Example 2-1: An IRI Set definition using four IRI constraints

<iriset>
  <includeSchemes>http https</includeSchemes>
  <includeHosts>example.org</includeHosts>
  <includePathStartsWith>/foo</includePathStartsWith>
  <excludePathEndsWith>.png .jpg</excludePathEndsWith>
</iriset> 

As outlined in Section 1.4, the POWDER GRDDL transform maps the IRI constraints in Table 3 to regular expressions against which the candidate IRI can be matched. These are shown in Table 4 below where var means the value of the XML element following processing as set out in the formal semantics document [FORMAL]. In brief this turns white space separated lists of strings into alternative values within the regular expression such that:

<includehosts>example.org 
              example.com
</includehosts>

becomes

(example\.org|example\.com).

Table 4. Template regular expressions for IRI constraints that take a white space separated list of values. See Section 2.3 for details of the meta character escaping used in these regular expressions.
IRI Constraint
(include / exclude…)
Regular Expression
schemes ^var\:\/\/
hosts \:\/\/(([^\/\?\#]*)\@)?([^\:\/\?\#\@]+\.)?var(\:([0-9]+))?\/
ports \:\/\/(([^\/\?\#]*)\@)?([^\:\/\?\#\@]+\.)*[^\:\/\?\#\@]+\:var\/
exactpaths \:\/\/(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?var($|\?|\#)
pathcontains \:\/\/(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?\/[^\?\#]*var[^\?\#]*[\?\#]?
pathstartswith \:\/\/(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?var
pathendswith \:\/\/(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?\/[^\?\#]*var($|\?|\#)

These template regular expressions may be useful in processing POWDER documents directly but other methods of determining whether a candidate IRI does or does not match a particular constraint are equally valid.

Example 2-2 below uses the regular expressions from Table 4 in the POWDER-S version of Example 2-1.

Example 2-2: The IRI Set defined in Example 2-1 encoded in POWDER-S

1  <owl:Class rdf:nodeID="iriset_1">
2    <owl:equivalentClass>
3      <owl:Class>
4        <owl:intersectionOf rdf:parseType="Collection">
5          <owl:Restriction>
6            <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
7            <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">^(http|https)\:\/\/</owl:hasValue>
8          </owl:Restriction>
9          <owl:Restriction>
10           <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
11           <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">(([^\/\?\#]*)\@)?([^\:\/\?\#\@]+\.)?(example\.org)(:([0-9]+))?\/</owl:hasValue>
12         </owl:Restriction>
13         <owl:Restriction>
14           <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
15           <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?(\/foo)</owl:hasValue>
16         </owl:Restriction>
17         <owl:Restriction>
18           <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#notmatchesregex" />
19           <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?\/[^\?\#]*(\.png|\.jpg)($|\?|\#)</owl:hasValue>
20         </owl:Restriction>
21       </owl:intersectionOf>
22     </owl:Class>
23   </owl:equivalentClass>
24 </owl:Class>

Note the use of notmatchesregex in line 18 to encode the excludepathendswith element.

2.1.1 IRI Constraints Referring to Ports: includeports and excludeports

Although ports are clearly integers, POWDER treats them as a string in the same way as the other constraints in Table 3. Port ranges (such as 80-100) are not supported but note that the value of includeports and excludeports is a white space separated list so that multiple ports may be enumerated.

2.1.2 IRI Constraints Referring to Queries: includequerycontains and excludequerycontains

Query strings typically contain a series of name-value pairs separated by ampersands thus:

?name1=value1&name2=value2

These are usually acted on by the server to generate content in real time and the order of the name-value pairs is unimportant. For practical purposes ?name1=value1&name2=value2 is equivalent to ?name2=value2&name1=value1. As a result, a significant amount of processing must be done to determine whether or not a candidate IRI is or is not an element of an IRI set that includes either the includequerycontains or excludequerycontains IRI Constraints.

To keep such processing manageable, the includequerycontains and excludequerycontains IRI Constraints take a single value not a white space separated list of values.

Section 2.6 includes a further discussion on creating unions of multiple IRI sets which would allow multiple query strings to be parsed.

By default, the POWDER GRDDL transform assumes that the delimiting character in a query string is the ampersand (&). However, an alternative delimiter can be specified as the value for the delimiter attribute on includequerycontains and excludequerycontains constraints. Example 2-3 below shows this.

Example 2-3: An IRI Set definition using includequerycontains

<iriset>
  <includehosts>socialnetwork.example.com</includehosts>
  <includequerycontains delimiter=",">id=abcdef,group=12345</includequerycontains>
</iriset>

The GRDDL transform splits the value provided for the includequerycontains or excludequerycontains IRI Constraints into its constituent pairs at the delimiting character and the presence of each name-value pair within the candidate IRI is then tested for independently. The template regular expression for such a test is:

\:\/\/(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?\/[^\?\#]*\?([^\#]*d)?q(d|$)

Where d is the delimiting character and q is the name-value pair. The Formal Semantics document [FORMAL] sets this out in more detail.

An important consequence of this processing model is that within the query string, only complete name-value pairs or value-less parameters are matched. More precisely, only complete query conjuncts in the query string are matched. As complete query conjuncts we consider any minimal substring of the query string that has ? or d before the first character and d or $ after the last character, where, as in the template regular expression, d is the query delimeter and $ is the end-of-string.

If the value of includequerycontains in Example 2-3 were changed to simply abcdef (rather than id=abcdef,group=12345) then:

Again, a POWDER processor may use alternative methods to determine whether a given name-value pair is present in a candidate IRI but the template regular expression is used in the GRDDL transform to generate the POWDER-S shown in Example 2-4. Notice that the pre-processing described here allows POWDER-S to use the same restriction on the wdrs:matchesregex data property as the other elements in Table 3.

Example 2-4: The IRI Set defined in Example 2-3 encoded in POWDER-S

<owl:Class rdf:nodeID="iriset_1">
  <owl:equivalentClass>
    <owl:Class>
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">(([^\/\?\#]*)\@)?([^\:\/\?\#\@]+\.)?(socialnetwork\.example\.com)(:([0-9]+))?\/</owl:hasValue>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?\/[^\?\#]*\?([^\#]*,)?id=abcdef(,|$)</owl:hasValue>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?\/[^\?\#]*\?([^\#]*,)?id=abcdef(,|$)</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
    </owl:Class>
  </owl:equivalentClass>
</owl:Class>

2.1.3 IRI/URI Canonicalization

Before any IRI matching can take place the candidate resource's IRI should be normalized to Form C, as defined in Character Model for the World Wide Web 1.0: Normalization [CHARMOD-NORM]. The following further steps should then be carried out, which are consistent with RFC 3986 [URIS], RFC 3987 [IRIS], URISpace [URISpace] and XForms [XFORMS].

2.1.3.1 Default values

The following table gives some examples.

Table 5: Examples of canonicalized URIs using defaults
Input IRI/URICanonical form
www.example.comhttp://www.example.com/
http://www.example.comhttp://www.example.com/
HTTPS://WWW.EXAMPLE.COM/FOOhttps://www.example.com/FOO
http://www.example.com./foohttp://www.example.com/foo
http://www.example.com:80/foohttp://www.example.com/foo
2.1.3.2 Percent-encoding conversion
Table 6: Examples of percent-encoding conversion
Input IRI/URI Canonical form
http://example.com/staff/Fran%c3%a7ois http://www.example.com/staff/François
http://example.com/my%20doc.doc http://www.example.com/my doc.doc
In this next example the %2F is a literal slash, not a path separator, and so is left as %2F
http://www.example.com/foo/his%2Fhers http://www.example.com/foo/his%2Fhers
2.1.3.3 Further Steps

There are some situations in which it is not possible to define a single canonicalization process. For example, where the IRI of the candidate resource has been generated from form input, in addition to converting %-encoded characters into the Unicode characters they represent, including the RFC 3986 [URIS] reserved characters in the query string, + signs should be replaced with a single white space. Such a statement assumes that it is knowable whether or not the IRI was generated from form input. Similarly, Internationalized Domain Names (IDNs), as defined in RFC 3490 [RFC3490], should be converted from Punycode [RFC3492] into their Unicode string representations. So that, for example:

http://www.xn--exmple-jua.org/

becomes

http://www.exåmple.org/

Again, this is the correct course of action if it is known that the candidate resource's IRI is an IDN. If a DR author is aware that conversion to Unicode may lead to ambiguity such that an IRI is included unintentionally, then he/she should specifically exclude such possibilities using the appropriate IRI constraint. Finally, relative URIs/IRIs should be supported as per Section 5.1, 'Establishing a base URI', of RFC 3986 [URIS]; namely: A base URI must be established by the parser prior to parsing URI references that might be relative.

Such factors may well be known. A real-world IRI set will be defined to include a real-world set of resources (such as an actual Web site) and a processor will exist in a known environment, such as at network level or user-interface level where the encoding of an IRI, will be known. Bearing these factors in mind the processor SHOULD make a Best Effort to canonicalize the IRI of a candidate resource and SHOULD tend towards false negatives rather than false positives. In other words, if it cannot be determined whether a candidate resource's IRI of http://www.xn--exmple-jua.org/ is an IDN or just an IRI with some unusual character sequences, and the IRI set definition comprises <includehosts>exåmple.org</includehosts>, the candidate should not be considered as a member of the IRI set.

2.1.4 Data encoding

To complement the IRI canonicalization steps described in the previous section, related processing steps must also be carried out on the strings supplied as set defining data.

Bear in mind that as the data is serialized in XML, strings specified in the IRI set definition will be escaped according to the XML syntax using entity references for specific characters (escaping < with &lt; and & with &amp; is mandatory, others may also be used). Moreover, since many IRI set definition properties take a white space-separated list of strings as their value, whenever a string contains an unescaped white space (i.e., a white space not encoded as %20), it will be substituted by %20.

The following steps should therefore be applied to each item in the list separately.

If the IRI set definition includes values related to the port then matching of the data against the candidate resource's IRI must be carried out as follows:

2.2 Grouping using Wildcards: The includeiripattern and excludeiripattern IRI constraints

Enabling Read Access for Web Resources [WAF] defines a method for encoding the domains and sub-domains from which access to resources on a given Web site should be granted or denied. The includeiripattern and excludeiripattern properties support this syntax directly. Domains and sub-domains may be substituted by a wildcard character (*) according to the following EBNF:

access-item    ::= (scheme "://")? domain-pattern (":" port)? | "*" domain-pattern ::= domain | "*." domain

It is anticipated that resource groups will typically be defined in terms of the domains and sub domains from which they are available. In order to provide as much flexibility as possible in this regard, the includeiripattern and excludeiripattern properties allow domains and sub-domains to be substituted by a wildcard character (*) according to the following EBNFABNF (originally developed by the Web Application Formats Working Group [WAF]):

item ::= (scheme "://")? domain-pattern (":" port)? | "*" domain-pattern ::= domain | "*." domain
iri-pattern    = [scheme "://"] domain-pattern [":" port-pattern] | "*"
domain-pattern = domain | "*." domain
port-pattern   = port | "*"

scheme and port are used as defined in RFC 3986 [URIS]. domain is an internationalized domain name as defined in RFC 3490 [RFC3490].

It follows that:

<includehosts>example.com</includehosts>

and

<includeiripattern>example.com</includeiripattern>

are equivalent. However, *.example.com, meaning resources on sub-domains of example.com but not on example.com itself, is not a valid value for includehosts.

In contrast to the IRI constraints shown in Table 3, includeiripattern and excludeiripattern take a single pattern, not a white space separated list of values. Note that paths and query strings MUST NOT be included in the pattern. If these are required in an IRI set definition, the relevant IRI constraints from Table 3 can be used.

Any processing method that accurately tests a candidate IRI against the value of an includeiripattern or excludeiripattern element is valid but the POWDER GRDDL transform does it in the same way as the other IRI constraints, namely by creating a restriction on the wdrs:matchesregex and wdrs:notmatchesregex properties as shown in the example below. Full details of the transformation are provided in the Formal Semantics document [FORMAL].

Example 2-5: An IRI Set defined using the includeiripattern and excludeiripattern constraints

POWDER

<iriset>
  <includeiripattern>http://example.org</includeiripattern>
  <excludeiripattern>search.example.com:81</excludeiripattern>
</iriset>

POWDER-S

<owl:Class rdf:nodeID="iriset_1">
  <owl:equivalentClass>
    <owl:Class>
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">^http\:\/\/([^\:\/\?\#\@]+\.)+example.org(\:[0-9]+)?</owl:hasValue>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#notmatchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">^[A-Za-z]+\:\/\/([^\:\/\?\#\@]+\.)*search.example.com\:81</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
    </owl:Class>
  </owl:equivalentClass>
</owl:Class>

2.3 Grouping by Regular Expression: The includeregex and excluderegex IRI constraints

It is believed that the IRI constraints discussed above will be easy to use and cover the overwhelming majority of POWDER use cases. However, the use of strings with fixed matching rules clearly presents a restriction on flexibility. To support fully flexible set definition by IRI, the includeregex and excluderegex properties take a Regular Expression and should be applied to the candidate resource's complete IRI (after following the canonicalization steps above). For POWDER-S, the regular expressions are copied verbatim as values for the wdrs:matchesregex and wdrs:notmatchesregex properties.

As noted in Section 1.4, the syntax used is defined by XML schema as modified by XQuery 1.0 and XPath 2.0 Functions and Operators [XQXP].

N.B. The value of the includeregex and excluderegex properties MUST be a single Regular Expression, not a white space-separated list.

As an example, the set of all the resources hosted either by example.org or example.net, where the path component of their IRIs starts either with foo or bar, can be defined thus:

Example 2-6: IRI set definition by regular expression (not including character escaping)

POWDER:

<iriset>
  <includeregex>^(([^:/?#]+):)//([^:/?#]+.)?example.(org|net)/(foo|bar)</includeregex>
</iriset> 

POWDER-S:

<owl:Class rdf:nodeID="iriset_1">
  <owl:equivalentClass>
    <owl:Class>
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">^(([^:/?#]+):)//([^:/?#]+.)?example.(org|net)/(foo|bar)</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
    </owl:Class>
  </owl:equivalentClass>
</owl:Class>

It is important to note that Example 2-6 does not take account of the need to escape certain characters.

The following characters are used as meta characters in regular expressions and MUST therefore be escaped if used in a pattern given as the value of the includeregex property:

. \ ? * + { } ( ) [ ]

In addition, the < (less than) character MUST always be escaped since it could be mistaken for the beginning of the closing <includeregex> tag.

As a safeguard against unintended consequences, other characters that always or typically have special meaning within IRI strings and/or XML SHOULD also be escaped, namely:

! " # % & ' , - / : ; = > @ [ ] _ ` ~

As a result, Example 2-6 should properly be written as shown in Example 2-7 below.

Example 2-7: Set definition by regular expression, including character escaping

POWDER:

<iriset>
  <includeregex>^(([^\:\/\?\#]+)\:)//([^\:\/\?\#]+\.)?example\.(org|net)/(foo|bar)</includeregex>
</iriset> 

POWDER-S:

<owl:Class rdf:nodeID="iriset_1">
  <owl:equivalentClass>
    <owl:Class>
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">^(([^\:\/\?\#]+)\:)//([^\:\/\?\#]+\.)?example\.(org|net)/(foo|bar)</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
    </owl:Class>
  </owl:equivalentClass>
</owl:Class>

2.3.1 Safe Use of includeregex

Example 2-7 uses a modified version of the regular expression given Section 2.1. This is the safest method but is not, perhaps, the most natural way to proceed. If a less rigorous approach is taken it is easy to make mistakes when specifying regular expressions, and incorrect regular expressions in set definitions will have one of two possible (and obvious) consequences

  1. the corresponding set does not include the intended resources;
  2. the corresponding set includes resources not intended to be included.

Example 2-8 below shows how this can happen.

Example 2-8: An example of a bad set definition by regular expression

<iriset>
  <includehosts>example.org</includehosts>
  <includeregex>https</includeregex>
</iriset> 

The intention of the regular expression given in Example 2-8 is probably to say "all resources on example.org with a URI beginning with https." However, as the regular expression is not anchored at either end, what this actually means is "all resources on example.org where the URI includes https". Thus this IRI set includes both of:

Adding in anchors at the beginning and end of the regular expression can have equally undesirable consequences.

Example 2-9: A second example of a bad set definition by regular expression

<iriset>
  <includehosts>example.org</includehosts>
  <includeregex>^https$</includeregex>
</iriset> 

In Example 2-9, the intention is, again probably, to define the set of "all resources on example.org fetched using https only". However, adding both the ^ and $ anchors at the beginning and end of the regular expression means that the whole IRI must be https from start to finish — which can never be true so this IRI set is equivalent to the empty set.

Example 2-10 shows one possible way to encode the intended set definition.

Example 2-10: An example of a correct set definition by regular expression

<iriset>
  <includehosts>example.org</includehosts>
  <includeregex>^https</includeregex>
</iriset> 

Whilst Example 2-10 'works', the potential dangers of using regular expressions mean that it is generally better to use component strings where possible. Example 2-10 is therefore better written as shown in Example 2-11 below.

Example 2-11: A re-write of Example 2-10 without using a regular expression

<iriset>
  <includehosts>example.org</includehosts>
  <includeschemes>https</includeschemes>
</iriset> 

2.4 Grouping by IP Address

It is noteworthy that POWDER does not define any special procedures where the host component of an IRI is expressed as an IP address. These are treated as strings, not as a sequence of digits. If the intention is to define an IRI set that encompasses a particular group of resources however they are accessed then it may be appropriate to include both the domain name and associated IP address as two space separated values in an includehosts element for example. However, this assumes that there is a one to one relationship between the domain name and the IP address which, of course, is often not the case.

As noted in Section 1.2, POWDER defines sets of IRIs, not of the resources that they identify. IRI sets must therefore be defined with care. For operational reasons, a user agent MAY perform a DNS or reverse DNS lookup to match domains names and IP addresses but this is very much application-specific.

2.5 Enumerating Elements of an IRI Set: the includeresources and excluderesources Constraints

It is useful to be able to include or exclude IRIs from sets by simple listing. The includeresources and excluderesources constraints support this, both of which take white space separated lists of IRIs. To give a simple example, the set of all resources on example.org except its stylesheet and JavaScript library can be encoded as shown in Example 2-12 below.

Example 2-12: IRI Set definition using the excluderesources constraint

<iriset>
  <includehosts>example.org</includehosts>
  <excluderesources>http://www.example.org/stylesheet.css http://www.example.org/jslib.js</excluderesources>
</iriset>

The white space separated list of values is processed as set out in the Formal Semantics document [FORMAL] to create a pattern var that can be inserted into the simple template regular expression:

^var$

Thus Example 2-12 is transformed into the following POWDER-S.

Example 2-13: The IRI Set defined in Example 2-12 encoded in POWDER-S

<owl:Class rdf:nodeID="iriset_1">
  <owl:equivalentClass>
    <owl:Class>
      <owl:intersectionOf rdf:parseType="Collection">
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#matchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">(([^\/\?\#]*)\@)?([^\:\/\?\#\@]+\.)?(example\.org)(:([0-9]+))?\/</owl:hasValue>
        </owl:Restriction>
        <owl:Restriction>
          <owl:onProperty rdf:resource="http://www.w3.org/2007/05/powder-s#notmatchesregex" />
          <owl:hasValue rdf:datatype="http://www.w3.org/2001/XMLSchema-datatypes#string">^(http\:\/\/www\.example\.org\/stylesheet\.css|http\:\/\/www\.example\.org\/jslib\.js)$</owl:hasValue>
        </owl:Restriction>
      </owl:intersectionOf>
    </owl:Class>
  </owl:equivalentClass>
</owl:Class>

As emphasized throughout this document, each constraint and its value creates a set definition of its own and the full IRI set is the intersection of those sets. Thus an alternative way of looking at Example 2-12 is to say that a candidate IRI is a member of the IRI set IF it is on example.org AND does not have the IRI http://www.example.org/stylesheet.css AND does not have the IRI http://www.example.org/jslib.js.

2.6 Complex Sets: Negation, Conjunction and Disjunction

POWDER allows a DR to express any grouping of resources whatsoever, no matter how complex.

Atomic negation is achieved by complementing each each IRI constraint that includes certain IRI components by one that excludes them, and vice versa; furthermore, all includeX and excludeX constraints are mutually exclusive. The analogous properties matchesregex and notmatchesregex properties are used in POWDER-S. Negation of complex constraints is not supported.

Conjunction of atomic propositions (both positive and negative) is inherent in the basic model - an IRI must match all the constraints if it is to be an element of the set. The GRDDL transform uses owl:intersectionOf to render in POWDER-S iriset elements with multiple constraints.

The disjunction of conjunctions of atomic propositions (both positive and negative) is also possible, as a DR may contain multiple iriset elements, and if any of them holds, then the DR holds. The GRDDL transform encodes multiple iriset elements as multiple clauses in POWDER-S.

It follows from the above, that POWDER allows the expression of Disjunctive Normal Form propositions. Since arbitrarily complex propositions can be brought into DNF (DNF Theorem), it follows that POWDER allows the expression of any proposition.

Example 2-14 shows a Description Resource defining the set of IRIs on example.com with a path beginning with /foo and those on example.org where the path starts with /bar.

Example 2-14: A Description Resource with its scope defined by the union of two IRI sets [XML]

<?xml version="1.0"?>
<powder xmlns="http://www.w3.org/2007/05/powder#" 
        xmlns:ex="http://example.org/vocab#">

  <attribution>
    <issuedby src="http://authority.example.org/company.rdf#me" />
    <issued>2007-12-14T00:00:00</issued>
  </attribution>

  <dr>
    <iriset>
      <includehosts>example.com</includehosts>
      <includepathstartswith>/foo</includepathstartswith>
    </iriset>

    <iriset>
      <includehosts>example.org</includehosts>
      <includepathstartswith>/bar</includepathstartswith>
    </iriset>

    <descriptorset>
      <ex:color>red</ex:color>
      <ex:shape>square</ex:shape>
      <displaytext>Everything on example.com where the path starts with /foo
       and everything on example.org where the path starts with /bar is red and square</displaytext>
      <displayicon>http://example.org/icon.png</displayicon>
    </descriptorset>
  </dr>

</powder>

3 Extension Mechanism

In this document we have specified various methods for defining sets of resource identifiers. The elements are clearly designed to be used with information resources available on the Web, identified by IRIs containing host names, directory paths, port numbers, and so on. The POWDER grouping vocabulary can be easily extended by new elements, defined via GRDDL transformation, which build upon the elements defined by POWDER. As examples, in Sections 3.1 and 3.2 we show how other methods of defining IRI sets that may suit particular situations can be transformed into POWDER-BASE.

Furthermore, there is no fundamental reason to constrain the domain of POWDER descriptions to HTTP IRIs, so there should not be unnecessary constraints on how the protocol works. In other words, the domain of grouping extensions does not need to be HTTP IRIs, but may be any kind of IRIs. As an example, in Section 3.3 we show such an extension for ISAN numbers.

It should be noted that the treatment of non-HTTP IRIs is one of the basic motivations behind the two-step GRDDL transform from POWDER to POWDER-BASE to POWDER-S, outlined in Section 1.4 and fully specified in the Formal Semantics document [FORMAL]. If POWDER were rendered into POWDER-S in a single direct transform, the only XML language from which to derive extensions would be POWDER, which would oblige POWDER extensions to include HTTP-specific IRI restrictions such as includehosts, even if they are meaningless for the domain of the extension.

In the intermediate POWDER-BASE language, on the other hand, all HTTP-specific elements have been rendered as regular expressions, using the includeregex and excluderegex IRI restrictions, as POWDER-BASE only requires that these two restrictions are supported. Developers of non-HTTP extensions and tools are advised to use POWDER-BASE to derive their extension from, instead of POWDER, as this relieves them of the obligation to also implement the HTTP-specific IRI restrictions in their tools.

XML elements suitable for defining sets of URIs or IRIs from schemes other than HTTP may be created and a GRDDL transform defined that renders such IRI sets in POWDER-BASE. This is an generic extension mechanism since a conformant POWDER Processor, as defined in the Description Resources document [DR], MUST be able to process POWDER-BASE. For clarity: POWDER-BASE is not a separate encoding of POWDER — it is all done in the wdr namespace — merely a restricted form of POWDER that just has the two possible child elements of iriset.

Developers of POWDER tools MAY directly implement extensions they know about, and MAY include support for transformation technologies such as XSLT so that unknown extensions can be processed.

3.1 Extension Example: Custom IRI Patterns

As an example of a service-specific extension, consider a service which uses unix shell wildcards instead of regular expressions, so that www.example.org/* means "all the resources on www.example.org fetched using HTTP." Such a system is easily used within an IRI set, only requiring the definition of a near copy of the POWDER schema [WDR] with a single IRI constraint shell:includepattern as child element of its IRI set element (good practice when defining shell:includepattern would be to also define shell:excludepattern).

A publisher of a document using shell:includepattern SHOULD define a GRDDL transform that will generate a POWDER-BASE document as shown the example below.

Example 3-1 An IRI set definition using a custom IRI pattern and the corresponding POWDER-BASE definition.

Custom IRI pattern:

<shell:iriset>
  <shell:includepattern>www.example.org/*</shell:includepattern>
</shell:iriset>

POWDER-BASE:

<iriset>
  <includeregex>http\:\/\/www\.example\.org\/.*</includeregex>
</iriset>

Note that the custom IRI pattern SHOULD NOT be used in a document with its root element in the POWDER namespace since the only valid child elements of the iriset element within a POWDER document are those defined in this document.

3.2 Extension Example: Custom Site Structure

Many content providers serve dynamic content stored in a database, so that IRIs express queries to that database. This kind of IRI will have certain structure but this is typically neither obvious nor easily human-interpreted.

As an example, consider sport.example.com, a sports news site, where IRIs look like the one shown in Example 3-2. The adopted scheme is systematic so that sport=2&countryID=16 provides a front page with news about Greek basketball and links to various Greek basketball leagues, sport=3&countryID=16 a front page about Greek volleyball, etc.

Example 3-2 Sample IRI from site serving dynamic content. sport=1 stands for football and countryID=16 stands for Greece.

http://sport.example.com/matches.asp?sport=1&countryID=16&champID=2

A POWDER document providing metadata about this Web site would have to use regular expression matching with explicit reference to the numerical values in the country and sport fields of the query. This process is error-prone, and requires extensive changes if the underlying database schema is modified or extended.

As an alternative, the site developer may provide a POWDER-like scheme that abstracts away from the specific database fields to allow reference to sports and countries, as shown in Example 3-3. Description Resource authors can then use the properties in this extension to generate POWDER-BASE documents that are valid even if the site schema is modified, as long as the site developer updates the relevant transformations.

Example 3-3 An IRI set definition using site-specific extensions and the equivalent definition using standard POWDER-BASE vocabulary.

Custom IRI constraint:

<sport:iriset>
  <wdr:includehosts>sport.example.com</wdr:includehosts>
  <sport:countries>Greece</sport:countries>
  <sport:sports>Football Basketball</sport:sports>
</sport:iriset>

Corresponding POWDER-BASE IRI set:

<iriset>
  <includeregex>(([^\/\?\#]*)\@)?([^\:\/\?\#\@]+\.)?(sport\.example\.com)(:([0-9]+))?\/</includeregex>
  <includeregex>country=16</includeregex>
  <includeregex>sport=[1|2]</includeregex>
</iriset>

3.3 Extension Example: ISAN

The International Standard Audiovisual Number [ISAN1] is a globally-unique, centrally managed and permanent numbering system for the identification of audiovisual works and versions. Following ISO 15706 [ISAN3], [ISAN3-2], the ISAN numbers are written as 24 bit hexadecimal digits in the following format [ISAN2].

-----root----- episode -version-
ISAN 1881-66C7-3420 - 0000 -7- 9F3A-0245 -U

The root segment of an ISAN number is assigned to a core work. When the core work is a serial, episodes are identified with a non null episode segment. Versions are assigned in the version segment and refer to changes in the audiovisual content, being a different language or soundtrack, subtitles, editions, promotional trailers, and so on.

Since ISAN numbers are URNs [URN], and hence IRIs of the urn: scheme [URIS], a vocabulary can readily be defined to allow IRI Sets to be defined based on ISAN numbers. The terms might be along the lines of:

includeRoots — the value of which would be a white space separated of hexadecimal digits and hyphens that would be matched against the first three blocks in the ISAN number.

includeEpisodes — a white space separated list of hexadecimal digits and hyphens that would be matched against the 4th block of 4 digits in the ISAN number.

includeVersions — a white space separated list of hexadecimal digits and hyphens that would be matched against the 5th and 6th blocks of 4 digits in the ISAN number.

The set of all audio visual resources that relate to two particular works might then be defined as shown in Example 3-4.

Example 3-4: An IRI set definition using an ISAN number pattern and the corresponding definition using standard POWDER vocabulary

Custom ISAN pattern:

<ex_isan:iriset>
  <ex_isan:includeRoots>1881-66C7-3420 1881-66C7-3421</ex_isan:includeRoots>
</ex_isan:iriset>

Corresponding POWDER-BASE IRI Set:

<iriset>
 <includeregex>^urn:isan:(1881-66C7-3420)|(1881-66C7-3421)</includeregex>
</iriset>

4 Conformance Criteria

An IRI set definition is a Conformant IRI set definition if it adheres to the specification described in this document.

More precisely:

5 References

5.1 Normative References

[GRDDL]
Gleaning Resource Descriptions from Dialects of Languages (GRDDL), D. Connolly. W3C Recommendation, 11 September 2007. This document is at http://www.w3.org/TR/grddl/
[HTTPCODE]
Part of Hypertext Transfer Protocol – HTTP/1.1, RFC 2616 Fielding, et al. This document is http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html .
[HTTPRDF]
HTTP Vocabulary in RDF J Koch, C Velasco, S Abou-Zahra. This document is at http://www.w3.org/TR/HTTP-in-RDF/
[IRIS]
RFC 3987 — Internationalized Resource Identifiers (IRIs), M. Dürst and M. Suignard, IETF, January 2005. This document is at http://www.ietf.org/rfc/rfc3987.txt
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels, S. Bradner. IETF, March 1997. This document is at http://www.ietf.org/rfc/rfc2119.
[RFC3490]
RFC 3490 — Internationalizing Domain Names in Applications (IDNA) P. Faltstrom, P. Hoffman, A. Costello. This document is at http://www.ietf.org/rfc/rfc3490.txt
[URIS]
RFC 3986 — Uniform Resource Identifiers (URI): Generic Syntax, T. Berners-Lee, R. Fielding and L. Masinter, IETF, January 2005. This document is http://tools.ietf.org/html/rfc3986.
[URN]
Official IANA Registry of URN Namespaces. This document is http://www.iana.org/assignments/urn-namespaces.
Unicode
The Unicode Consortium. The Unicode Standard, Version 4. ISBN 0-321-18578-1, as updated from time to time by the publication of new versions. The latest version of Unicode and additional information on versions of the standard and of the Unicode Character Database is available at http://www.unicode.org/unicode/standard/versions/.
[XQXP]
XQuery 1.0 and XPath 2.0 Functions and Operators, A. Malhotra, J. Melton, N. Walsh. W3C Recommendation, 23 January 2007. This document is at http://www.w3.org/TR/xpath-functions/

5.2 Sources

[CHARMOD-NORM]
Character Model for the World Wide Web 1.0: Fundamentals, F. Yergeau, M. J. Dürst, A. Phillips, M. Wolf, T. Texin. W3C Working Draft, 27 October 2005. This document is at http://www.w3.org/TR/charmod-norm/
[DR]
Protocol for Web Description Resources (POWDER): Description Resources, P Archer, K. Smith, A Perego. W3C Working Draft, 15 August 2008. This document is at http://www.w3.org/TR/powder-dr/
[FORMAL]
Protocol for Web Description Resources (POWDER): Formal Semantics, S. Konstantopoulos, P. Archer. W3C Working Draft, 15 August 2008. This document is http://www.w3.org/TR/2008/WD-powder-formal-20080815/
[ISAN1]
International Standard Audiovisual Number
[ISAN2]
ISAN FAQs: What is the ISAN? This document is at http://www.isan.org/portal/page?_pageid=166,41960&_dad=portal&_schema=PORTAL.
[ISAN3]
ISO 15706:2002, Information and Documentation – International Standard Audiovisual Number (ISAN).
[ISAN3-2]
ISO 15706-2:2007, Information and Documentation – International Standard Audiovisual Number (ISAN) – Part 2: Version identifier.
[PRIMER]
Protocol for Web Description Resources (POWDER): Primer, K. Scheppe, D. Pentecost. W3C Working Draft, 15 August 2008. This document is at http://www.w3.org/TR/powder-primer/
[Rabin]
URI Pattern Matching for Groups of Resources, J. Rabin, Draft 0.1, 17 June 2006. This document is at http://www.w3.org/2005/Incubator/wcl/matching.html
[RFC3492]
Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA) A. Costello. This document is at http://www.ietf.org/rfc/rfc3492.txt
[TESTS]
Protocol for Web Description Resources (POWDER): Test Suite, A. Kukurikos. W3C Working Draft, 15 August 2008. This document is http://www.w3.org/TR/powder-test/
[URISpace]
URISpace 1.0, M. Nottingham, W3C Note, 15 February 2001. This document is http://www.w3.org/TR/urispace
[USECASES]
POWDER: Use Cases and Requirements, P. Archer. W3C Working Group Note, 31 October 2007. This document is at http://www.w3.org/TR/powder-use-cases/
[WCL-XG]
W3C Content Label Incubator Group February 2006 - February 2007
[WDR]
Protocol for Web Description Resources (POWDER): Web Description Resources XML Schema (WDR), K. Smith, A. Perego. This document is at http://www.w3.org/2007/05/powder
[WDRS]
Protocol for Web Description Resources (POWDER): POWDER-S Vocabulary (WDRS), A. Perego, P. Archer, S. Konstantopoulos. This document is at http://www.w3.org/2007/05/powder-s
[XFORMS]
XForms 1.0 (Third Edition), J.M. Boyer. W3C Recommendation, 29 October 2007. The relevant section of this document is at http://www.w3.org/TR/xforms/#serialize-urlencode
[WAF]
Access Control for Cross-site Requests A. van Kesteren. W3C Working Draft 14 February 2008. This document is at http://www.w3.org/TR/2008/WD-access-control-20080214/#access.

6 Acknowledgments

The editors duly acknowledge the earlier work in this area carried out by Jo Rabin. Jeremy Carroll and David Booth developed the operational and formal semantics model which was further developed by Stasinos Konstantopoulos. The editors gratefully acknowledge the further contributions made by Régis Flad of ISANIA and members of the POWDER Working Group.

7 Change Log

Changes since First Public Working Draft

  1. Updated introduction to refer to vocabulary and XML data types documents. Corrected erroneous use of 'QNames'.
  2. Small addition to the introduction to Grouping by address paragraph.
  3. Update status section
  4. Renumbering of sections previous 2.2 - 2.5
  5. Insertion of Grouping using Wildcards following discussion with Web Application Formats Working Group
  6. Resolution of open question on choice of Regular Expression syntax. Now use XML Schema REs as modified by XPath/XQuery for consistency with other W3C work - the syntax more than meets POWDER's requirements. Data type to be defined in POWDER's own XML Schema
  7. Added hyperlinks to the first mention of each Class and property, pointing to its entry in the vocabulary document
  8. Removed includeUserInfo and includeFragments properties since these are not strictly part of HTTP, the former can cause security issues, especially when written as username:password, and grouping by fragments is very vague since there is no sure way to define the end of a fragment.
  9. Section 3 completely rewritten. Feature at Risk marker removed.

Changes since 31 October 2007 draft

  1. Status section updated to reflect substantial change since previous version
  2. Intro extended to include mention of primer and test suites, plus added namespace tabel etc.
  3. Section 1.2 amended and sections 1.3 adn 1.4 added to explain XML to RDF/OWL model via GRDDL, with Semantic Extension defined insection 1.4
  4. Resource Set changed to IRI set, and all mention of URI changed to IRI throughout.
  5. Regular Expressions in examples 2-3 and 2-4 corrected
  6. The section on grouping resources by the properties of those resources has been removed completely - we now only support grouping by IRI constraint
  7. As noted in the text above, the section on conjunction and disjunction needs to be rewritten to work in the POWDER/POWDER-S model. The section on logical inconsistency has been removed for now too.
  8. The extension mechanism section has been re-written
  9. Several sections have been renumbered.
  10. Acknowledgements section extended to cite Jeremy Carroll, David Booth and Stasinos Konstantopoulos

Changes since 24 March 2008 draft

  1. The document has been updated throughout the reflect the introduction of the tw-stage GRDDL transform from POWDER to POWDER-BASE to POWDER-S.
  2. Ports are now handled as strings (not as numbers) so that port ranges are no longer supported by POWDER
  3. includeexactqueries and excludeexactqueries deleted
  4. Canonicalization and data encoding sections updated following comments from Thomas Roeseller, Eric Prud'hommeaux. Text now advocates a 'Best Effort' approach tending to false negatives, rather than attempting to define a comprehensive approach to IRI canonicalization.
  5. Grouping by IP address and CIDR block - this has been dropped, largely to mainatin the simplicity of POWDER-BASE and POWDER-S. Deriving a regular expression or some other processing rules from a CIDR block is very cumbersome. Section replaced with short text explaining that IP addresses will be tteated as strings.
  6. Section on redirection deleted. DR doc will mention issue of redirection in the context of the POWDER Processor. E-mail exchange with Alan Ruttenberg and others provided important insight
  7. Section 3 (extension mechanism) tidied up.
  8. Slight change to the regular expression given in section 2.1 (the \/\/ moved to before the first ?)

Changes since 30 June 2008 draft

  1. Updated status section
  2. ID of 'candidateResource' added to first mention of this term.
  3. Words 'of resources' removed from introduction of OWL Class in POWDER-S taking the place of an IRI set
  4. Wording added to make it explicit that the processing works only for vaild IRIs following comments made on the member-only list
  5. Section 1.2 amended to ensure that any user info or fragment in a candidate resource's IRI is not lost during processing. Regular expressions in Table 4 and quoted elsewhere in the document updated accordingly. This follows an e-mail exchange concerning a semantic application..
  6. Line added to Section 2.1.3.3 to advise IRI set authors to specifically exclude any IRIs that they know may lead to false positives.
  7. Cardinality of in/excludequerycontains brought into line with other IRI constraints so they can only occur 0 or 1 time.

Changes since 15 August 2008 draft

  1. Added a “Conformance Criteria” section, before the references, referring to the conformance criteria defined in the Description Resources document. The “References”, “Acknowledgements”, and “Change Log” sections, as well as the related subsections, have been re-numbered accordingly.
  2. Revised Section 2.1.3 (“IRI/URI Canonicalization”) based on feedback from the Internationalization Core WG.
  3. Revised Section 3 to emphasize that it is a GRDDL transform that creates POWDER-S from POWDER and that XSLT is one option for doing this [1, 2, 3]
  4. Revised Section 3.3 (“Extension Example: ISAN”) based on feedback from ISAN.
  5. Corrected minor typos and fixed formatting (of text, tables, examples) throughout the whole document.
  6. Added internal links for all the references to sections, tables, and examples occuring in the running text.
  7. Changes made throughout to reflect introduction of wdrs:notmatchesregex [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. 12]

Changes since 14 November 2008 draft

  1. Slight correction to the IRI syntax diagram in Section 2.1 following comment by Tony Hammond.
  2. Minor correction to one of the template regular expression used for processing in/excludeiripattern (following testing)
  3. Reference to the work of the Web Application Formats WG related to in/excludeiripattern updated following exchange with Anne van Kesteren.

Appendix A: Summary of POWDER Elements

Element Name Content Attributes Cardinality Introduced
iriset Any of includeschemes, excludeschemes, includehosts, excludehosts, includeexactpaths, excludeexactpaths, includepathcontains, excludepathcontains, includepathstartswith, excludepathstartswith, includepathendswith, excludepathendswith, includeports, excludeports At least 1 must be a child element of a dr Section 1.3
includeschemes Token list 0 or 1 Section 2.1
excludeschemes
includehosts
excludehosts
includeexactpaths
excludeexactpaths
includepathcontains Token list any number
excludepathcontains
includepathstartswith Token list 0 or 1
excludepathstartswith
includepathendswith
excludepathendswith
includeports
excludeports
includequerycontains Single value delimiter (any single character). Default value is: & (“ampersand”) any number Section 2.1.2
excludequerycontains
includeiripattern Single value 0 or 1 Section 2.2
excludeiripattern
includeregex Single value 0 or 1 Section 2.3
excluderegex
includeresources Token list 0 or 1 Section 2.5
excluderesources