Harvesting RDF Statements from XLinks

1 Introduction

The XML Linking specification [XLink], better known as XLink, is a recently-published Candidate Recommendation that defines ways for XML documents to establish hyperlinks between resources. The Resource Description Framework (RDF) [RDF] is a W3C Recommendation for providing machine-understandable information about web resources.

Both XLink and RDF provide a way of asserting relations between resources. RDF is primarily for describing resources and their relations, while XLink is primarily for specifying and traversing hyperlinks. However, the overlap between the two is sufficient that a mapping from XLinks to statements in an RDF Model can be defined. Such a mapping allows XML Linking elements in XML documents to be harvested as a source of RDF statements. XLinks thus provide an alternate syntax for RDF information that may be useful in some situations.

This NOTE specifies such a mapping, so that Xlinks can be 'harvested' and RDF statements generated. The purpose of this harvesting is to create RDF models that, in some sense, represent the intent of the XML document. The purpose is NOT to represent the XLink structure in enough detail that a set of XLinks could be round-tripped through an RDF model.

Readers of this NOTE are assumed to be familiar with the XML Linking specification, and the RDF Model and Syntax Recommendation. Terms, such as 'starting Resource', that are defined in those specifications will not be defined here. Readers should also be familiar with the XML Base Note [XMLBase]. Familiarity with the RDF Schema Recommendation [RDFSchema] will be required if people wish to make use of the optional mappings provided that use RDF Schema Classes.

Uses of the terms 'MUST', 'SHALL', 'SHALL NOT', 'SHOULD', and 'MAY' have the meanings defined in RFC 2119 [IETF RFC 2119].

In the remainder of this Note, the term 'resource' will be used in two ways. Those two senses of the term will be distinguished by the presence or absence of an initial capital 'R'. With the initial capital 'R', Resource denotes anything identified by a URI. Without an initial capital 'R', resource denotes an XLink resource. That is, an XML element bearing an xlink:type attribute whose value is "resource".

Several places in this note make statements such as "...the value shall be 'rdf:type'" or "When the value is 'xlink:title'...". The use of specific namespace prefixes, such as 'xlink:', or 'rdf:', is an editorial convienience. As required by the namespace specification [Namespace], any prefix may be used as long as the URI it maps to is the correct one.

2 Principles of the Mapping

Simple RDF statements are comprised of a subject, a predicate, and an object. The subject and predicate are identified by URIs, the object may be a URI or a literal string. To map an XLink into an RDF statement we need to be able to determine the URIs of the subject and predicate. We must also be able to determine the object - be it a URI or a literal.

The general principle behind the mapping specified in this document is that each arc in an XLink gives rise to one RDF statement. The starting Resource of the arc is mapped to the subject of the RDF statement. The ending Resource of the arc is mapped to the object of the RDF statement. The arcrole is mapped to the predicate of the RDF statement. However, a number of corner cases arise, so see the details of the mappings.

RDF statements are typically collected together into 'models'. The details of how models are structured will be implementation dependent. This NOTE assumes that harvested statements are added to 'the current model', which is the model being constructed when the statement was harvested. But this NOTE, like the RDF Model & Syntax specification, does not specify exactly how models must be structured.

3 Terminology

Harvesting - The process of generating RDF statements from XML Linking elements. Resource - With an initial capital, Resource denotes anything identified by a URI. resource - Without an initial capital, resource, denotes an XLink resource. That is, an XML element bearing an xlink:type attribute whose value is "resource".

4 Mapping Specification

4.1 Synthesizing XPointers

RDF is based on the use of URIs for identifying Resources. XLinks will frequently make a linking element into one of the participating Resources of a link. This requires that we be able to define URIs that identify those linking elements. In order that different implementations harvest equivalent RDF statements from an XLink, the procedure in this section SHOULD be used when synthesizing XPointers for such linking elements.

The general approach used is for the synthesized XPointer to do element-wise navigation down the tree to reach the linking element. The navigation begins at the nearest identified point in the tree.

More formally, the base of the synthesized URI reference SHALL be specified as defined in the NOTE on the xml:base attribute [XMLBase].

Note:

Feedback on whether the synthesized URI references should be required to be absolute, or may be relative, is particularly sought.

The fragment identifier of the synthesized URI reference SHALL be delimited from the URI by the '#' character.

The fragment identifier of the synthesized URI reference SHALL be an XPointer[XPointer]. The initial locator term of the XPointer SHALL be an ID reference to the nearest ancestor of the linking element, including the linking element itself, that bears an attribute of type ID. If no such attribute exists on an ancestor of the linking element, the '/' character SHALL be the first linking term, indicating that navigation SHALL be from the document element.

Subsequent locator terms SHALL provide the element type and index of the navigation path down the tree of XML elements to reach the desired element.

As an example, consider a document that contained the following simple link:

In heavy trading, <org xlink:type='simple' xlink:href="http://www.foo.com/" xml:base="http://www.bar.com/report1" ID="com231">Foo Manufacturing</org> closed sharply ...

The synthesized XPointer for this linking element is http://www.bar.com/report1#xpointer(id('com231')).

4.2 Simple linking elements

If a simple link's arcrole attribute has the value "http://www.w3.org/1999/xlink/properties/linkbase", the link SHALL be harvested according to the procedure described in section 4.4 Linkbases. Otherwise the mapping defined in this section SHALL be used.

All simple links define zero or one traversal arcs. No traversal arc is specified if the xlink:href attribute is not specified. Therefore, harvesting software shall generate zero or one RDF statements, depending on whether the href attribute is specified. If the xlink:href attribute is specified, the single traversal arc SHALL be harvested to form an RDF statement. The starting Resource of the simple link SHALL be mapped to the subject of the RDF statement. Note that the starting Resource of a simple link is the linking element itself. Therefore, the harvesting software MUST synthesize a URI reference that identifies the linking element. The harvesting software SHOULD use the XPointer synthesis procedure specified in section 4.1 Synthesizing XPointers.

The ending Resource of the simple link SHALL be mapped to the subject of the RDF statement. Note that the ending Resource of a simple link is always a URI reference, provided as the value of the xlink:href attribute.

The value of the arcrole attribute, if one is given, SHALL be mapped to the predicate of the RDF statement. Note that the value of the arcrole attribute is already required, by the XLink specification, to be a URI reference.

If no arcrole attribute is specified, harvesting software MAY generate no RDF statement, or it MAY map the element type of the linking element to the predicate of the RDF statement. This SHALL only be done if the element type is namespace qualified, so that an absolute URI reference may be constructed from the namespace URI and the local part. In this case the namespace name and the local part are concatenated using the approach documented in the RDF M&S specification [RDF] in order to synthesize the absolute URI reference for the predicate.

If an xlink:role attribute is specified on the simple link, it SHALL result in at least one additional statement being added to the model. The object of that statement is the ending Resource of the simple link, its predicate is "rdf:type", and its subject is the Resource identified by the role attribute. Harvesting software MAY also generate a statement whose object is the Resource identified by the role attribute, whose predicate is "rdf:type" and whose subject is the Resource "rdfs:Class". This statement shall only be added to the model if an equivalent statement is not already part of the model.

An example of such an element is

... In a <x:extRef xlink:type="simple" xlink:href="http://www.foo.com/papers/crops.txt" xlink:arcrole="http://links.org/namespace/cite" xlink:role="http://links.org/namespace/screed" >recent paper</x:extRef>, Dr. Taylor assumes that ...

Mapping that link according to this specification, (and assuming it was the fourth <extRef> element within the third <chap> element) results in the RDF model shown below in figure 1:

Figure 1: Sample RDF Model constructed with arcrole

If the arcrole had not been specified, then the result would have been the RDF model shown in figure 2.

Figure 2: Sample RDF Model not using arcrole attribute

4.3 Extended XML Links

If an extended link contains an arcrole attribute whose value is "http://www.w3.org/1999/xlink/properties/linkbase", it shall be harvested according to the procedure in section 4.4 Linkbases. Otherwise the procedures in this section shall be used.

We first describe the rules for harvesting the components of an extended link (arcs, locators, and resources). Then we describe the rules for the extended link as a whole.

4.3.1 xlink:type="arc"

XML elements with an xlink:type attribute whose value is "arc" are known as arcs. Recall that arcs use the 'to' and 'from' attributes to specify the endpoints of 0 or more possible traversals. Also recall that the 'from' and 'to' attributes do not provide URIs, they provide labels which may appear on one or more locator or resource elements.

The number of RDF statements harvested from a single arc element is equal to the number of possible traversals specified by the arc element. That quantity is the multiplicative product of the number of resource and/or locator elements identified by the 'to' and 'from' attributes. Each RDF statement will correspond to one and only one of the traversals.

The starting Resources of the traversals SHALL be mapped to the subject of the RDF statement(s). The ending Resources of the traversals SHALL be mapped to the object of the RDF statement(s). The value of the arcrole attribute, if one is specified, SHALL be mapped to the predicate of each RDF statement.

If no arcrole attribute is specified, harvesting software MAY generate no RDF statement, or it MAY map the element type of the linking element to the predicate of the RDF statement. This SHALL only be done if the element type is namespace qualified, so that an absolute URI reference may be constructed from the namespace URI and the local part. In this case the namespace name and the local part are concatenated using the approach documented in the RDF M&S specification [RDF] in order to synthesize the absolute URI reference for the predicate.

Note that any element content of an arc is not harvested.

4.3.2 xlink:type="locator"

Each XML element whose xlink:type attribute has a value of "locator" is known as a locator. Each locator gives rise to 0 or more statements in the RDF model. The subject of all of those statements is the value of the xlink:href attribute of the locator, except as noted below.

If the locator element provides a 'role' attribute, one additional statement SHALL be added to the model. The value of the locator's xlink:href attribute SHALL be mapped to the subject of the statement. The value of the role attribute SHALL be mapped to the object, and the predicate SHALL be 'rdf:type'. Harvesting software MAY generate an additional statement whose subject is the value of the role attribute, whose predicate is 'rdf:type' and whose object is 'rdfs:Class'. The second statement SHALL NOT be added to the RDF model if an equivalent statement already exists in the model.

If the locator element provides an xlink:label attribute, an RDF statement is added to the model. The value of the href attribute SHALL be mapped to the subject of the statement. The predicate of the statement SHALL be "xlink:label". The object of the statement SHALL be the value of the label attribute.

If the locator element provides an 'xlink:title' attribute, an RDF statement SHALL be added to the model. The value of the xlink:href attribute SHALL be mapped to the subject of the statement. The predicate of the statement SHALL be "xlink:title". The object of the statement SHALL be the value of the title attribute.

If the resource element contains one or more title elements, they are harvested as described in section 4.3.4 xlink:type="title".

4.3.3 xlink:type="resource"

Each XML element whose xlink:type attribute has a value of "resource" is known as a resource. (Recall that this specification uses 'Resource' with the initial capital to mean anything identified by a URI. The lowercase 'resource' has this more restricted meaning).

Each resource gives rise to 0 or more statements in the RDF model. Unless noted otherwise, the subject of all of those statements is the resource element itself, identified by an XPointer synthesized according to the procedure described in section 4.1 Synthesizing XPointers.

If the resource element provides an 'xlink:role' attribute, one RDF statement SHALL be added to the model, and a second RDF statement MAY be added to the model. The subject of the first, required, statement is the synthesized URI reference for the resource. The value of the role attribute is mapped to the object of the statement. The predicate of the statement is 'rdf:type'. A second statement MAY be added to the model if the software supports the RDF Schema specification [RDFSchema]. The value of the role attribute is mapped to the subject of the optional statement. The predicate of the statement is 'rdf:type' and the object is 'rdfs:Class'. The second statement SHALL NOT be added to the model if an identical statement already exists in the model.

If the resource element provides an xlink:label attribute, another RDF statement SHALL be added to the model. The subject of the statement is the synthesized URI reference for the resource. The predicate of the statement is "xlink:label". The object of the statement is the value of the label attribute.

If the resource element provides an 'xlink:title' attribute, another RDF statement SHALL be added to the model. The subject of the statement is the synthesized URI reference for the resource. The predicate of the statement is "xlink:title". The object of the statement is the value of the title attribute.

If the resource element contains one or more title elements, they are harvested as described in section 4.3.4 xlink:type="title".

4.3.4 xlink:type="title"

XML elements with an xlink:type attribute whose value is "title" are known as title elements. They only have an XLink-defined meaning if they appear as a child element within an extended, locator, or resource element.

If an extended, locator, or resource element contains one or more title elements, one RDF statement SHALL be added to the model for each title element. The subject of each statement SHALL be either the value of the xlink:href attribute (in the case of a locator element) or a synthesized XPointer identifying the extended or resource element. The predicate of each statement SHALL be "xlink:title". For each RDF statement, the object of the statement SHALL be a syntesized XPointer identifying the title element. (Identifying the title element allows attributes such as xml:lang to accompany the title).

As an example, consider the following fragment of an extended link:

<annotation xlink:type='extended' ID='genid22'> <caption xlink:type='title' ID='genid23'>Recent comments</caption> <link xlink:type='arc' ...

The RDF statement harvested from the title element is shown below in figure 3: Figure 3: Sample RDF Model for title elements

4.4 Linkbases

A linkbase is, loosely speaking, a database of external links. They are used to provide sets of links for resources which, for a variety of reasons, do not directly contain the links.

More formally, a linkbase is an XML document which contains one or more extended links. A linkbase arc is a linking element (simple link or an arc) whose xlink:arcrole attribute takes the value of "http://www.w3.org/1999/xlink/properties/linkbase". The ending resource of a linkbase arc is a linkbase.

When harvesting software encounters a linkbase arc, it SHALL NOT generate an RDF statement for the arc. It SHOULD traverse the arc to retrieve the linkbase(s), and harvest the XLinks from the linkbase(s) to add to the current model using the methods specified in this NOTE.