RDF in HTML: Approaches

Since there is no one standardized approach for associating RDF compatible metadata with HTML, and since this is one of the most frequently asked questions on the RDF mailing lists, this document is provided as an outline of some RDF-in-HTML approaches that the author is aware of.

Table Of Contents

Please direct feedback to the author, preferably CCing the publically archived www-archive.

Introduction

Ever since RDF's inception, people have been wanting to embed it in their HTML documents. In fact, ever since HTML was invented, people have been wanting to embed some sort of metadata for extraction and processing by user agents and crawlers. So, theoretically, HTML and RDF is a match made in heaven (aka. the halls of the W3C's offices at MIT).

However, after many raging discussions within the W3C's RDF Interest Group and elsewhere, there is still no one standard method for associating RDF with HTML. This is an important thing for the Semantic Web community to resolve: even the author has quite recently found himself wanting to associate RDF with HTML for certain applications, but has had to put-aside the application due to the lack of a standard approach.

The original RDF FAQ contained a piece of advice telling people to simply embed the XML RDF into the XHTML (cf. Embed XML RDF Part I), but this was criticized since the approach means that the resultant XHTML/RDF soup does not validate. This issue has been noted by the RDF Core Working Group (as faq- html-compliance), and is currently "for discussion". My hope is that this document will be valuable input into the issue.

All of the approaches given in this note will suited towards particular applications. What applications are there? In general, anything that combines human and machine readable data is game; for example: RDF Schemata/namespace documents, page accessiblity evaluations, complex relationships between the document and related resources (for example, one could generate an SVG diagram showing how this document fits into the rest of the world), links to digital signatures, and possibly even advanced versioning data (cf. CVS). Only the first of the list is extant, to the best of the author's knowledge.

The Approaches

In no particular order...

>> Embed XML RDF Part I: Eschew Validation

In the "validator.w3.org be damned" approach, one would generally use the abbreviated XML RDF syntax so as to hide the contents from older browsers (which usually render the contents of any element, but not attribute values).

<head>
<title>Some Page</title>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://www.w3.org/" dc:title="W3C Homepage"/>
</rdf:RDF>
</head>

This approach is convenient for authors that know how to author XML RDF (or have had some generated for them). It's quite easy for agents to extract, too. However, it also has many disadvantages: it does not validate, may still choke some older browsers, and the fragment identifers may conflict. As the TAG have put it:-

[...] despite widely adopted specifications for XHTML and RDF, there is no specification for the interpretation of the mixture. The TAG felt that this lack, falling between the scopes of two working groups, was within its scope to fill or ask to be filled. [...] A futher problem is that the question of how to define the meaning of a URIref with fragement id wihtin such a document.
- Embedding HTML in RDF, TimBL for TAG, 2002

Sidenote: Murray Altheim wrote an excellent summary of why validation is important. Also: Nick Kew's essay on the subject.

>> Embed XML RDF Part II: Embrace Validation

This "create a new XHTML family" approach basically involves hacking up a small DTD (document type definition) using XHTML Modularization for a variant of XHTML, putting it on the Web, and then referencing it from your document. The main drawback is that the DTDs are large and relatively complex; this is not a viable approach for typical HTML authors.

<!DOCTYPE html SYSTEM "http://infomesh.net/2002/m12n/test/rdf.txt" >

<html xmlns="http://www.w3.org/1999/xhtml" 
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
   xml:lang="en" >
<head>
<title>Embedded RDF Test</title>
<rdf:RDF>
<rdf:Property rdf:about="http://purl.org/net/swn#homepage">
</rdf:Property>
</rdf:RDF>
</head>

XHTML Modularization is essentially oriented towards companies and skilled Web users that want to provide regular extensions to XHTML. It is not so good when it comes to unique extensions that need to be created on a whim.

This method has the same "what is the meaning of the fragment identifiers within such a document?" issue as embed-and-don't-validate.

>> Utilize the Object or Script Elements

HTML has two elements for including non-HTML media; <object>, and <script>. <object> is a generic element for including any external object, whereas <script> is available for embedding executable scripts.

<object>

The HTML 4.01 specification says that inline data may be supplied from a base64 encoded "data:" URI. For example:-

<head>
<title>My Document</title>
<object data="data:application/rdf+xml;base64,PHJkZjpSREYgeG1sbnM6cmRmPSJodHR
wOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjIgogICAgICAgICAgICB4bWxu
czpkYz0iaHR0cDovL3B1cmwub3JnL2RjL2VsZW1lbnRzLzEuMS8iPgogIDxyZGY6RGVzY3JpcHRpb
24gcmRmOmFib3V0PSJodHRwOi8vd3d3LnczLm9yZy8iPgogICAgPGRjOnRpdGxlPldvcmxkIFdpZG
UgV2ViIENvbnNvcnRpdW08L2RjOnRpdGxlPiAKICA8L3JkZjpEZXNjcmlwdGlvbj4KPC9yZGY6UkR
GPg=="></object>
</head>
congrats, you've found a syntax less workable thatn RDF/XML.
- Edd Dumbill, 24 seconds after this method was proposed on #rdfig.

Of course, one can also link to the RDF in an external file, although we shall be discussing using the <link> element for this a little later. Note that object allows one to cascade the referenced media, thereby offering a provision for alternate serializations: perhaps offering XML RDF, Notation3, and NTriples versions of your RDF metadata.

<script>

On the other hand, we have the script element with which to wrap some embedded XML RDF. For example:-

<head>
<title>My Document</title>
<script type="application/rdf+xml">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://www.w3.org/" dc:title="W3C Homepage"/>
</rdf:RDF>
</script>
</head>

All that the HTML 4.01 specification says about the contents of the script element is that:-

Scripts are evaluated by script engines that must be known to a user agent. [...] The syntax of script data depends on the scripting language.
- HTML 4.01: Definition of the script Element

This is suspiciously vague. Moreover, whilst I do not want to get engaged in an argument over the semantics of programming languages, I will note that people [such as Sandro Hawke? ask Sandro if he really said this] have estimated that the Notation3 superset of RDF has as much power as Prolog, a well-known highly-declarative programming language.

Since using the script element in this way is very similar to embedding the information (it's just embedding + giving the media type), one would presume that it has the same fragment-conflict problem looming over it.

Arguably the purest solution from an architectural point of view, making use of the <link> element has been the object of criticism since maintaining the metadata externally to the RDF is seen as an inconvenience. Proponents of the solution contend that CSS, JavaScript, and images are already maintained externally without fuss, and that retrieving external files does not take much more programming than extraction (in fact, possibly less so).

Here's an example:-

<head>
<title>My Document</title>
<link rel="meta" type="application/rdf+xml" href="meta.rdf"/>
</head>

or, if you want to mention it in the document body...

<body><p><a rel="meta" type="application/rdf+xml" 
href="meta.rdf">blargh</a>[...]

Note that according to the HTML 4.01 specification:-

Authors may wish to define additional link types not described in this specification. If they do so, they should use a profile to cite the conventions used to define the link types.
- Link Types in HTML 4.01

Since this recommendation is a "should" and "not" a must, and since the "meta" link relationship is not one where achieving a global consensus should be a difficulty, it is reasonable to use the link type without declaring a profile.

Another interesting point to note is that the link element does allow for a certain amount of cascading thanks to the "alternate" link relationship. For example:-

<link rel="meta" type="application/rdf+xml" href="meta.rdf"/>
<link rel="alternate meta" type="application/n3" href="meta.n3"/>
<link rel="alternate meta" type="application/ntriples" href="meta.nt"/>

This means that the XML RDF version is preferred, but that user agents may use the Notation3 and/or NTriples files as alternatives.

>> HyperRDF

Dan Connolly of the W3C published a note that outlined an ingenious method for marking up HTML in such as way as to make relatively easy to transform via. XSLT into RDF. The method relies upon binding URIs to link relationship QName prefixes via. a special profile and the <link> element, and closely resembles the XML Names binding mechanism.

Here's the basic example from DanC's proposal:-

<html xmlns="http://www.w3.org/1999/xhtml">
  <head id="rel" profile="http://www.w3.org/2000/07/hs78#">
    <title>example
    <link id="c" rel="rel:classes" href="http://www.w3.org/2000/07/hs78#" />
  </head>
  [...]
</html>

(Excerpted from HyperRDF: Using XHTML Authoring Tools with XSLT to produce RDF Schemas, Dan Connolly, 2000-08).

However, HyperRDF can never be valid XHTML 1.x since the head element does not allow an ID attribute. This can be "fixed" with modularization:-

<!-- XHTML HyperRDF 1.0 DTD -->

<!ENTITY % xhtml11.mod PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
   "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" >
%xhtml11.mod;

<!ATTLIST %head.qname; %id.attrib; >

>> Augmented Metadata for XHTML

Augmented Metadata in XHTML, Murray Altheim and Sean B. Palmer eds. With this approach, the current metadata facilities of HTML are augmented; the content model is changed so that the <meta> element may appear within the body of the XHTML document. For example:-

  <html>
    <head>
      <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
    </head>
  <body>
  <p>
    [<a href="inverts/grasshoppers.html">
      <meta name="DC.type" scheme="HTML4" content="Prev" />
      <meta name="DC.title" content="Previous Chapter" />
      <meta name="DC.language" content="en" />
      <img src="images/prev-arrow.gif" alt="Previous Chapter" />
    </a>] 
    [<a href="inverts/scorpions.html">
      <meta name="DC.type" scheme="HTML4" content="Next" />
      <meta name="DC.title" content="Next Chapter" />
      <meta name="DC.language" content="en" />
      <img src="images/next-arrow.gif" alt="Next Chapter" />
    </a>]
  </p>

This is a powerful approach, and one that is different from the others in that it adapts current HTML elements—whilst preserving their basic semantics—so that they don't necessarily have to refer to the current document. Instead, they may refer to a linked file, or the source of some cited material. In other words, it is only predicate-object pairs that are being stated.

>> Use the Profile Attribute

As outlined in my proposal on www-rdf-interest. The basic premise is that one can take the profile attribute to be a global namespace prefix for all of the rel/meta@name attributes throughout the document.

This approach is mainly for those authors that want to use a simple mechanism for producing RDF from their XHTML. It is ineffective from the point of view of anyone that wants to randomly extract RDF from XHTML, since one cannot tell whether the author wanted the assertions to be converted into the triples produced by the algorithm or not.

   <head profile="http://example.org/#">
   <meta name="myProp" value="My Object"/>
   <link rel="myOtherProp" href="http://myuri.net/"/>
   </head>

>> Making use of XML Notations

This idea was propounded by Murray Altheim, almost in passing, on www-rdf-interest. The approach involves using XML notations (and hence CDATA sections) and a custom <metadata> element to wrap the metadata in. To quote Murray:-

In the DTD we'd have something akin to:

   <!NOTATION dc PUBLIC 
       "-//DCMI//NOTATION Dublin Core Metadata Element Set V1.0//EN"" 
       "http://dublincore.org/">
   <!NOTATION rdf SYSTEM "http://www.w3.org/1999/02/22-rdf-syntax-ns#">  
   <!NOTATION blat PUBLIC "-//doctypes.org//NOTATION Blat 1.0//EN"
       "http://www.doctypes.org/blat/1.0/">
   ...
   <!ELEMENT  metadata  ( #PCDATA ) >  <!-- really, a CDATA section -->
   <!ATTLIST  metadata
       type  NOTATION  (dc|rdf|blat)
   >
   ]><!-- end of DTD -->
   ...
   <head>
   <metadata type="rdf">
   <![CDATA[
     {rdf content}
   ]]></metadata>

This means that there would be one (or a set of few) centralized and customized DTDs which could be referenced by authors all over the world. It's fairly language independent, although it does mean updating the DTD every time a new language comes along.

Issues With Embedding

Here we consider the three main issues of the (rather issue-prone) embedding approach: how current implementations deal with it, whether to embed in the head or body sections, and whether fragment identifier conflicts are a problem.

How Current Implementations Deal With Embedding

Embedding is a popular approach and has already been implemented in numerous applications, including:-

Note that all of these implementations simply extract the RDF from the XHTML, parse it, and then add it to a store: only RDFAuthor actually does anything with the triples that are returned. There are also a handful of HTML pages on the Web which have the XML RDF directly embedded within them, of which Dan Brickley's FOAF namespace/schema is a notable example.

The latest RDF Syntax Working Draft provides a bit of verbiage providing implementations of the embedding approach the basis of a solid algorithm for extracting RDF from arbitrary XML.

If the content is known to be RDF/XML by context, such as when RDF/XML is embedded inside other XML content, then the grammar can either start at Element Node RDF (only when an element is legal at that point in the XML) or at production nodeElementList (only when element content is legal, since this is a list of elements).
- RDF/XML Syntax Specification (Revised), W3C Working Draft 25 March 2002, Dave Beckett

This is especially important in light of that fact that the SVG Recommendation allows one to embed external XML dialects within a particular element allocated as the metadata construct of SVG:-

The contents of the 'metadata' [element] should be elements from other XML namespaces, with these elements from these namespaces expressed in a manner conforming with the "Namespaces in XML" Recommendation
- SVG 1.0, 21.2 The 'metadata' element

Embedding in the head vs. body

The <head> of an HTML document is a reserved space to hold metadata about the document which contains it. However, in RDF, the subject of each triple is unlimited (except that it must denoted with a URI-reference), so the RDF is independent of where it is placed within an HTML document.

In TimBL's Strawman Syntax and Altheim et al.'s augmeta proposals, however, the approach is different since only data which can be interpreted as predicate-object pairs are embedded within parts of the tree, and therefore are context sensitive. TimBL suggests using the current document as the subject in html:head, and the value of the href/cite attributes in any elements which have them.

MIME vs. MIME

This section is slightly controversial, and consists of more of the author's opinion than the rest of this note.

The semantics of a URI-reference with fragment identifier are defined by the specification of the media-type of the representation returned by a network retrieval action of the base URI. The text/html media type specification (RFC 2854) states:-

For documents labeled as text/html, the fragment identifier designates the correspondingly named element; any element may be named with the "id" attribute [...]

The language here is not strict: since it applies to the SGML version, we do not know to which namespace(s) the words "any element" apply. Moreover, the HTML 4.0 specification defines some elements which do not have id attributes, e.g. <head>.

Notwithstanding (or perhaps because of) this ambiguity in the media-type specification for HTML, popular thinking amongst Web architecture experts is that IDs in XML RDF embedded in HTML have an unknown semantics.

XHTML is a language whose extensiblity has been a major selling point: the enourmous modularization of XHTML specification is devoted to making it easier for people to create their own customized XHTML derivatives. Because of this, it would be sensible to defer the interpretation of XML IDs (and their synonyms, such as rdf:about in RDF) to the specification of the namespace of the embedded material.

TimBL has said that he thinks this solution "means that you can't use fragids to point to a generic bit of XML when just doing XML text processing". Substituting "can't necessarily" for "can't", I agree with this sentiment, but feel that it is unimportant. For example, XLink-aware applications can still move to an element with an XML ID declared; whether or not they understand the semantics of the thing denoted by the ID is irrelevant since the position is still marked with the XML ID; i.e. it does not matter whether the element is the actual thing denoted by the ID, as in HTML, or whether it describes the thing denoted by the ID, as in RDF.

There have also been concerns raised about languages such as the W3C ERT's EARL, a generic RDF-based evaluation language for which being able to identify explicit parts of XML trees is very important, and therefore for which the nature of the denotation of an XML ID must be known. However, EARL has already had to cope with this for a year or more now, and has room to overcome such problems. For example, ERT could decide to define a predicate that uses an XPath/XPointer style notation to point into the document tree.

Note that this exegesis only necessarily implies that the HTML media types be updated to make the semantics of an ID'd element depend upon the namespace of that element; it does not mean that this has to apply to every XML language, although that may be an option.

Conclusion: Which Approach Is Best?

This question is actually inappropriate: more appropriate may be "which approach, if any, is suitable for all applications of RDF associated with HTML?" and "which approach has the best ratio of implementability and architectural purity?". In other words, in order to resolve the issue, one has to look at the applications for associating RDF with HTML, and scope the approach around that.

Each of the approaches listed above have their advantages and disadvantages. Zealous pragmatics will always be around to argue that embedding the RDF straight into the XHTML is the best approach—otherwise, what is the point of having XML and namespaces, constructs that are there to enable language mixing?

Since it is not viable for the average HTML author to create a new variant of XHTML every time they want to embed some RDF, we can discount this approach immediately. Since embedding (and embedding within a <script> element) is an approach that does not validate, one can obviously not include a doctype declaration with the file. However, one may be able to specify an XSLT transformation which can be applied to the XHTML such that the result is validatable XHTML 1.x:-

<stylesheet xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
   xmlns="http://www.w3.org/1999/XSL/Transform" version="1.0" >
<template match="node()|@*">
   <copy><apply-templates select="node()|@*"/></copy>
</template>
<template match="rdf:RDF"/>
</stylesheet>

On the other hand, RDF is not only serializable as XML RDF; languages such as Notation3 and NTriples are popular. Given this situation, a language independent metadata association mechanism would be preferable—especially if it allowed one to cascade. The obvious counter argument is that having a single canonical format for associating with HTML makes sense since it minimalizes diversity and therefore increases the chances for interoperability.

Neither of the linking ("<link> to the Metadata") or embedding ("Embed XML RDF Part I: Eschew Validation", and possibly "Embedding using <script>") methods can be ruled out, in the author's opinion. Linking has the substantial advantage that it is serialization independent, may reduce file sizes when a single set of triples is often referenced (such as contact information), and provides a cascade. Embedding is useful because it is direct, there are existing implementations to deal with it, plus people will be embedding XML RDF and other languages like it into XHTML for a long time to come.

The HyperRDF, Augmeta, and generic profile attribute approaches are still valid. However, I recommend that authors of such documents combine this with the <link> element method, for example pointing to the URI of an XSLT Web service that converts the current document into XML RDF.

In conclusion—and with the strong caveat that this is the author's opinion only—both the linking and embedding options should be supported by any new implmentations that have to deal with extracting RDF from HTML. This is the path of least resistance since no one can ban anyone from linking for embedding, although it does make more work for the parser developers. It is important that the precise semantics of XML RDF embedded in HTML are made clear and published by the W3C; preferably as part of a generic language mixing note.

Further Reading

For anyone that's wondering what to do next.

Peripherally Related

Acknowledgements

Many thanks to Dave Beckett, Dan Brickley, and Dan Connolly for their early reviews and feedback. Many thanks also to Murray Altheim for his discussion of many important principles and for the augmeta approach, and William Loughborough and Dan Brickley for providing the inspiration to write this note up. Credit is also due to the many contributors to the RDF-in-XHTML threads on the RDF mailing lists: Joshua Allen, Danny Ayers, Seth Russell, Aaron Swartz, Jonathan Borden, et al.

This note was first published on: 2002-05-31; most recent update: 2002-06-02.

Sean B. Palmer