URISpace provides a flexible framework for describing how metadata is to be assigned to resources, based on URI characteristics.
This document identifies problems in the application of metadata across groups of resources, and proposes one method of addressing them. It is intended to bring such problems to the attention of the community, and to foster discussion on the appropriate approach to them.
This document is a proposal only; it may contain significant errors or omissions. These issues may be addressed by future work, depending on interest by other parties.
Please send comments to the author, and the RDF Interest mailing list.
As the Web becomes more complex, there is increasing need to manage metadata about resources. The Resource Description Framework [RDF] provides a foundation for expression and processing of such metadata.
One of the primary tasks of such a framework is the nomination of resources to which metadata properties will be assigned. In RDF, the subject can be directly named by URI (or in a Bag of URIs), inferred by an identifier or context in the container document, or nominated by use of a URI prefix.
Prefix mechanisms ("aboutEachPrefix" in RDF) allow assignment of metadata to all resources that share a prefix. This accommodates situations where it is necessary or desirable to assign metadata to a number of resources from a central location. For example, in applications where the metadata must be known before the resource is fetched (such as P3P [P3P]), metadata can be placed at a central location, in a document that describes how it should be assigned based on the URI prefix.
This type of solution is a convention in some formats that predate RDF, such as the robots.txt mechanism [ROBOTS] and some Web server configuration files. Unfortunately, the aboutEachPrefix mechanism shares many of their limitations, as well as introducing some of its own;
Rough granularity - Prefix mechanisms can only discriminate between resources based on a single, inflexible criterion. As a result, they demand that resources be arranged in a manner which allows effective assignment of metadata. This causes problems, due to technical constraints, pre-existing structure in the resource namespace, or when more than one kind of prefix-assigned metadata contests the namespace.
Difficulty of extension - By their nature, prefix mechanisms cannot be extended to consider new criteria in the selection of resources, without changing their semantics.
Ambiguous relationships - Every prefix-based resource selection mechanism must define how overlap in declared prefixes is handled. For example, if metadata were assigned to the prefix '/foo' and then the more specific '/foo/bar/', a mechanism may assign only that which is most specific, assign first the more general and then the more specific, or assign them in the order in which they are declared.
Assignment to multiple prefixes - aboutEachPrefix's value is a simple string, which is used as the prefix. As a result, separate elements containing aboutEachPrefix must be used for multiple prefixes, even if they share metadata.
URISpace is designed to be a standard, clear and extendable mechanism for assigning RDF metadata to resources based on the namespace that URIs describe, as well as optional external criteria.
Besides those mentioned, possible applications include an alternate form [WREC-KP] of proxy auto-configuration [PROXYCONF] that uses XML instead of a scripting language; a standard Web server configuration file format, configuration files for HTTP surrogates and content delivery networks, assigning RDF metadata, and any other circumstance where there is a need to declare an arbitrary collection of resources. See the end of this document for examples.
Major design goals for URISpace are to:
Issues explicitly out-of-scope in this document are:
URISpace provides a framework that an application can use to assign arbitrary metadata to an entity based on its URI namespace, optionally using additional external selection criteria. This is done by building a tree of XML [XML] elements, called selectors, to describe the URI namespace and contain the metadata.
This document describes the elements that allow selection into the namespace, demonstrates how the tree should be structured and used, illustrates how metadata might be represented, and defines ways to extend and optimize functionality. It is intended as a framework for other applications to adapt as their needs require.
A tree is typically rooted by the urispace element, in the URISpace XML namespace [NAMESPACES]. For example:
<?xml version="1.0"?> <urispace xmlns='http://www.w3.org/2000/urispace'> ... </urispace>
The urispace element may contain any number of selector elements, and metadata to be assigned based on the structure of those selectors. Selector elements are URI-related (such as scheme, path, query, etc.), but non-URI and externally defined selectors can be used as well.
Depending on the nature of an application, the tree may be implicitly rooted in another, application-specific element.
The tree is structured so that each selector element defines a context for assignment of metadata; together these selectors restrict the resources to which contained metadata will be assigned. For example, an element pair such as
<query match="foo"> ... </query>
restricts the context to those URIs that have a query argument "foo". If further selectors are contained within, they only match when both the parent element (in this case, the above query element) element and the child element match. For example,
<query match="foo"> ... <fragment match="bar"> ... </fragment> </query>
allows selection at two granularities; those URIs that contain a query argument "foo", and those URIs which both contain a query argument "foo" and a URI fragment "bar".
To determine what metadata is applied to a particular URI, the tree is traversed, starting at the root, assigning metadata as selector elements are matched.
In general, metadata assigned from the most specific selector takes precedence. That is, a metadata property in a matching child context always overrides that assigned in the parent context. So, in the example above, if the same property type were set in the query and fragment contexts, that which was set in the fragment context would be assigned after the values set in the query context, possibly overwriting them (depending on the nature of the metadata).
This assures the most specific metadata to a resource overrides more general metadata. In particular, it should be noted that ordering of metadata and sibling selectors is not significant.
For example,
<foo>1</foo> <path match="images"> <foo>2</foo> </path>
is equivalent to
<path match="images"> <foo>2</foo> </path> <foo>1</foo>
in both cases, the top-level metadata ("foo" set to "1") will be applied, and then the metadata in the path selector will be applied, if it matches.
There may be situations where two matching selectors affect the same metadata in a single context. Given the example
<path="images"> <query match="foo"> ... </query> <fragment match="bar"> ... </fragment> </path>
it is possible that a URI (such as "/images/blank.gif?foo#bar") will match both selectors in the 'images' context. This means that it is necessary to define sibling selector precedence.
If more than one sibling selector matches, precedence will be determined by ordering. That is, the first matching child selector will be applied, then the next (possibly overriding or interacting with metadata set in the first), and so forth.
If two sibling selectors of a particular type match because one (or both) specify multiple values to match, both will be applied, in order. However, if more than one sibling selector of a particular type matches because one (or more) contains a wildcard, only the most specific wildcard match will be applied, regardless of multiple values. Each wildcard mechanism should define a method of determining the most specific match, but generally, the shortest match takes precedence.
For example,
<scheme match="http"> <md:test>1</md:test> <host match="www.foo.com"> <md:test>2</md:test> </host> <host match="*.foo.com"> <md:test>3</md:test> </host> </scheme>
Here, the URI 'http://www.foo.com/' will have md:test set to 2, because the most specific match is 'www.foo.com'. 'http://example.foo.com/' will have md:test set to 3, as '*.foo.com' is the most specific match. 'http://foo.com/' will have md:test set to 1, because no specific host selector matches.
Complex sibling contexts are applied as a whole. That is, each sibling will be applied in its entirety (including any matching child selectors), and then subsequent siblings will be applied in turn, in the same fashion. For example,
<path match="a"> ... <fragment match="b"> ... </fragment> </path> <fragment match="b"> ... </fragment>
will (if all selectors match) have first the path metadata applied, then the first fragment's metadata, and finally the second fragment's.
If it is necessary to specify metadata in the case where no other instance of a particular selector matches, this can be achieved with the 'nomatch' attribute on the selector. For example,
<path match="foo"> ... </path> <path match="bar"> ... </path> <path nomatch="any"> ... </path>
allows specification of metadata in three cases; if the path segment is 'foo', if it is 'bar', and then any other case, if one of these does not match.
The value of the 'nomatch' attribute determines whether the selector element must be present; if it is 'any', any selector, or none at all, will match. If it is 'some', there must be something available for the selector to match.
URIs are normalized before being processed by URISpace. That is,
The URISpace tree is described by URI selector elements. Each element limits the context of contained elements to those resources matching its value, which is described in an XML attribute. Multiple values can be specified by separating them with whitespace. For example;
<fragment match="foo bar"> ... </fragment>
This element would match any URI having a fragment of "foo" or "bar", such as
http://www.example.com/image.gif#foo
http://www.example.com/index.html#bar
Multiple values can also be specified by using a reference to a RDF Alternates container as the selector value. In such cases, the reference can be discerned by the presence of unescaped, reserved characters; for instance:
<fragment match="#foolist">
Optionally, selectors may specify wildcard mechanisms to allow selection based on partial matches.
The Scheme element allows selection of absolute URIs based on the HTTP scheme. The 'match' attribute lists the scheme(s) to be matched. For example:
<scheme match="http https"> ... </scheme>
would match URIs beginning with 'http://' and 'https://'. The value is case-insensitive.
To accommodate the different kinds of naming authorities defined in [RFC2396], three related elements are defined. Generally, either an Authority Identifier or a host is used, optionally with additional Userinfo.
Authority identifiers are specified with the 'authority' element. the 'match' attribute defines the naming authority (or authorities) to match. For example,
<authority match="foo"> ... </authority>
Otherwise, the 'host' element is used, with the 'match' attribute used to specify a full or partial hostname.
<host match="www.example.com"> ... </host>
Partial hostnames can be denoted by use of a wildcard. The '?' wildcard matches zero or more hostname-legal characters, but not the period (i.e., it will match exactly one segment of the hostname). The '*' wildcard will match one or more contiguous period-delimited segments.
A hostname may contain at most one wildcard, and it must be the first character (i.e., it must represent the first segment(s)).
For example,
?.example.com
matches
www.example.com
foo.example.com
but not
example.com
www.foo.example.com
while
*.example.com
will match
foo.example.com
www.foo.example.com
but still not
example.com
When determining precedence, the longest specified host that matches is considered most specific. If a '?' wildcard value and a '*' wildcard value of the same length both match, the '?' value is considered more specific.
For example, if '*.foo.com' and '*.com' match, the first will take precedence. Similarly, if '*.foo.com' and '?.foo.com' both match, the latter will be considered more specific.
Optionally, a port can be specified by appending it, after a colon, to the hostname. The port attribute is optional; if it is not specified, the default port for the scheme in use will be matched. For example, if the scheme is 'http', a hostname will have port 80 implied. Wildcards do not match into the port portion of the hostname.
Port may contain exactly one integer, and may not contain wildcards. For example:
<host match="highport.example.com:8000"> ... </host>
Whether an authority identifier or host is specified, userinfo can further refine the match, using a separate 'user' element. The 'match' attribute communicates the user name to match.
<user match="bob mary"> ... </userinfo>
This element allows specification of a path segment to match, as set by the segment attribute. Note that it does not indicate a complete path, but represents exactly one segment. This allows a tree of path segments to mimic hierarchical structures, as URIs often convey, using a special relationship between the path element and its parent path elements. For example,
<path match="foo"> ... </path>
selects resources based on a path segment of 'foo' in the current context. If that context were the root of the path space, it would match URIs such as
/foo/
A path element contained by another path element further constrains matching, in the context of the container. For instance,
<path match="foo"> ... <path match="bar"> ... </path> </path>
allows selection of resources in two contexts, "/foo/" and "/foo/bar/".
The relationship between paths implies that the top-most path encountered is in the root of the path space for the resource being matched.
Segments may contain a wildcard character, '*', which will match zero or more segment-legal characters. It will not match any reserved characters, or match beyond the segment it is declared in.
Because this character is legal in URIs, it should be escape-encoded if it is wished for the literal character to be matched in a path segment. For example, if it were desirable to match
http://www.example.com/*foo/bar/
with the '*' as a literal character in the URI, that segment should be represented in URISpace as
<path match="%2afoo"> ... </path>
because if it were not escaped, the asterisk would be considered a wildcard, and would match all path segments ending in 'foo' in that position.
Only one wildcard may be specified in a particular path segment. For purposes of precedence, the most specific path segment that contains a wildcard character is that which consumes the fewest characters.
Note that the definition of path segments includes the last segment, which is often the filename of the resource. Therefore,
<path match="index.html"> ... </path>
will match the segment 'index.html' in that context. An empty segment declaration (match="") will match the case where there is an empty final segment, that is
http://www.example.com/ http://www.example.com/foo/
but not
http://www.example.com/index.html http://www.example.com/foo
If present, URI parameters should be specified in the segment attribute along with the path information.
Query elements match when the query portion of the URI contains the specified argument/value combination. For example,
<query match="foo=bar"> ... </query>
will match:
http://www.example.com/index.html?foo=bar http://www.example.com/cgi-bin/example.cgi?1=2&foo=bar&3=4
but not
http://www.example.com/index.html?foo=boo http://www.example.com/index.html?1=2&3=4
If URIs are to be matched based on the presence of an argument, rather than its value, the equality and the value should be omitted. For example,
<query match="foo"> ... </query>
will match:
http://www.example.com/index.html?foo=bar http://www.example.com/index.html?foo http://www.example.com/index.html?foo&boo http://www.example.com/index.html?foo=1&boo=2
but not:
http://www.example.com/index.html http://www.example.com/index.html?boo=foo http://www.example.com/index.html?
Queries that match based on both an argument and a value are considered more specific than those which match only on an argument, for purposes of selector wildcard precedence.
This element allows selection of URIs based on fragment. The value expressed is a whole fragment to be matched. For example:
<fragment match="1"> ... </fragment>
will match any of:
http://www.example.com/foo/bar.html#1 http://www.example.com/#1
but not:
http://www.example.com/foo/bar.html http://www.example.com/foo/bar.html#123
Some applications may need to further refine assignment based on criteria outside of the URI namespace. In such conditions, arbitrary selectors using syntax similar to the selectors introduced here may be added. One possibility is to introduce (into a new namespace) a 'select' element;
<extra:select type="foo" match="bar"> ... </extra:select>
External selectors should allow values to contain either a literal to match, or a URI Reference to an RDF Bag of literals to match. They should also support the 'nomatch' attribute, as defined in 'URI Selectors'. If they allow wildcard values, appropriate precedence rules should be defined.
Because URISpace applies metadata hierarchically, metadata operators are needed, in order to allow metadata sourced from different selectors to interact, and to allow use of certain constructs that would otherwise be impossible to express.
This is achieved through use of attributes in the URISpace namespace that modify the metadata elements. For example,
<color urispace:op="clear" />
Here, the color element is modified by the urispace:op attribute, which clears (unsets) the metadata in the current context.
The default operator is 'replace'; that is, if no metadata operator is present, more specific metadata completely replaces any pre-existing metadata. The 'clear' operator completely removes metadata from the current context.
Operators can also be used to manipulate RDF Containers, when they are used. Metadata authors should define appropriate containers for their metadata (or lack thereof), thus suggesting the appropriate operators.
If a container operator is applied to non-contained and contained metadata, the result will inherit the container type present.
Bag and Alternate containers may be modified with a group of set operations, while Sequence containers can be modified with 'append' and 'prepend' operators.
Multiple operators can be specified by using multiple metadata elements. The operators will be applied in the order that they are specified. The left operand every operator is the current value of the metadata, incorporating all previous operators. For example,
<foo><rdf:Bag> <rdf:li>1</rdf:li> <rdf:li>2</rdf:li> </rdf:Bag></foo> <foo urispace:op="difference"><rdf:Bag> <rdf:li>2</rdf:li> <rdf:li>3</rdf:li> </rdf:Bag></foo> <foo urispace:op="and" /><rdf:Bag> <rdf:li>6</rdf:li> </rdf:Bag></foo>
will difference the first and second Bags, and then and the result with the third.
To summarize (here, 'a' is the more general set, 'b' the more specific, overriding set):
Basic Operators - All Metadata | |
---|---|
replace | Completely replace a with b. Default operation. |
clear | Clear the set; unset. b should not be specified. |
Container Operators - Bag and Alternate | |
union | a OR b |
intersection | a AND b |
rev-intersection | a XOR b |
difference | a AND NOT b |
rev-difference | b AND NOT a |
Container Operators - Sequence | |
prepend | place all members of b before a |
append | place all members of b after a |
An application that uses URISpace will not necessarily allow each selector available, depending on which are relevant in its problem domain.
For example, the robots.txt example is implicitly in the relative namespace of the server that it resides upon, and therefore has no use for absolute selectors. Additionally, since fragments are normally consumed by clients, they are irrelevant for most server-side applications, and those (such as robots.txt) that operate on the granularity of a page.
It may also be useful to restrict the allowed relationships between selectors to simplify precedence issues. For example, precedence issues may be made easier to understand if a pre-set ordering is enforced, if the flexibility of ordered precedence is not needed by the application.
An example P3P Policy Reference File, using URISpace.
<?xml version="1.0"> <POLICY-REFERENCES xmlns='http://www.w3.org/2000/P3Pv1' xmlns:uri='http://www.w3.org/2000/URISpace' xmlns:web='http://www.w3.org/1999/02/22-rdf-syntax-ns#' <POLICY-REF web:about="/P3P/Policy1.xml" /> <uri:path match="catalog"><POLICY-REF web:about="/P3P/Policy2.xml" /></uri:path> <uri:path match="cgi-bin"><POLICY-REF web:about="/P3P/Policy3.xml" /></uri:path> <uri:path match="servlet"> <POLICY-REF web:about="/P3P/Policy3.xml" /> <uri:path match="unknown"><POLICY-REF uri:op="clear" /></uri:path> </uri:path> </POLICY-REFERENCES>
Example Web server configuration file. Notice that the scheme selector is used inside the path selector for the 'images' subdirectory, to authenticate only when SSL is not used.
<?xml version="1.0"> <urispace xmlns='http://www.w3.org/2000/urispace/' xmlns:urispace='http://www.w3.org/2000/urispace/' xmlns:conf='http://www.example.org/server-config' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'> <conf:docroot>/usr/httpd/html</conf:docroot> <!-- default doc root --> <path match="images"> <conf:ttl>3600</conf:ttl> <!-- set a cache-control: max-age --> <conf:auth>basic</conf:auth> <!-- authenticate users --> <conf:authDB>/usr/httpd/users.db</conf:authDB> <scheme match="https"> <conf:authenticate>off</conf:authenticate> <!-- unless they're using SSL --> </scheme> </path> <path match="cgi-bin"> <!-- cgi directory --> <conf:docroot>/usr/httpd/cgi-bin</conf:docroot> <conf:execCGI /> </path> <path match="local"> <!-- deny non-local address --> <conf:denyAccess /> <conf:match type="clientaddress" value="#allowedList"> <!-- point to a container --> <conf:denyAccess urispace:op="clear" /> </conf:match> </path> <rdf:Bag ID="allowedList"> <!-- define a Bag for access ctrl --> <rdf:li>192.168.1.0/24</rdf:li> <rdf:li>172.16.0.0/16</rdf:li> <rdf:li>10.0.0.0/12</rdf:li> </rdf:Bag> </urispace>
Example proxy auto-configuration file in URISpace format.
<?xml version="1.0"> <urispace xmlns='http://www.w3.org/2000/urispace/' xmlns:urispace='http://www.w3.org/2000/urispace/' xmlns:proxyconf='http://www.example.org/proxy-config' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'> <scheme match="https"> <proxyconf:proxy>DIRECT</proxyconf:proxy> </scheme> <scheme match="*"> <host match="*.foo.com foo.com"> <proxyconf:proxy>DIRECT</proxyconf:proxy> </host> <host match="*.au"> <proxyconf:proxy>auproxy.foo.com:8080</proxyconf:proxy> </host> <rdf:Alt> <rdf:li><proxyconf:proxy>worldproxy1.foo.com:8080</proxyconf:proxy></rdf:li> <rdf:li><proxyconf:proxy>worldproxy2.foo.com:8080</proxyconf:proxy></rdf:li> </rdf:Alt> </scheme> </urispace>