ROBUST METADATA FOR WEB CONTENT
===============================

As the web grows, so too does the volume of metadata describing it, from
offline references through established online services such as search
engines to emerging technologies such as EARL and Annotea.

For metadata to be useful other than within a closed system, we need
standards.  These are beginning to emerge, but have yet to be widely
supported.  Furthermore, there are problems with metadata exchange that
are not adequately addressed in any standard.  The purpose of this note
is to consider such issues, and propose the outline of a standard.


THE SCENARIO
============

(****Figure 1 - a producer, a consumer and a database all deal with metadata)

A producer generates metadata.  A database stores them.  A consumer uses them.

For this to be really useful requires that the consumer have satisfactory
answers to fundamental questions:
  * The addressing problem: what do the metadata refer to?
  * Dealing with content change: are the metadata still valid now?


To deal first with the addressing problem, we have several existing
mechanisms for it:

(****Figure 2 - ways to address web content)

* Domain
	Example(1): in a newspaper advert, "to take advantage of these great
	offers and many more, visit our website at www.example.com".

* URL
	Example(2): in a search engine result, we can find a list of pages
	dealing with a subject of interest such as "hotels in strelsau",
	"highland malt", or "sheet music".

* Simple Pointer
	Example(3): tools such as the W3C validator and some of the Site
	Valet tools identify validation errors in a page by pointing to
	a line and character within the page.

* XPointer
	Example(4): tools such as Annotea and Page Valet use a mechanism
	similar to XPointer to address content within a page.

In examples (1) and (2), the addressing mechanisms used are entirely
approriate to the usage, and any shortcomings in actual implementations
fall outside the scope of this note.

Example (3) is also relatively simple, because it is presenting the
information directly to a human agent.  Nevertheless, it is not
always adequate: for example, there is an issue with line numbers
in the presence of different line endings that can cause the
validator to report entirely bogus results, and there is an
issue of byte vs. character offsets when using a tool that
deals with them in a manner different to the validator.

Example (4) is altogether more problematic.  XPointer is not a new technology,
yet unlike some of its XML cousins (XPath, XSLT) it is not widely supported
nor deployed.  In the real world, there are serious obstacles to XPointer
realising its full potential.  Prominent amongst these problems is the
fact that whereas XPointer applies to well-formed XML, most of the web today
is neither XML nor well-formed.

A second problem with addressing arises from HTTP Content Negotiation.
A URL is by definition (and in practice if we set aside abuses of the
protocol) a unique resource, but the resource may itself have more than
one manifestation.  It is a reasonable (though by no means guaranteed)
premise that content negotiation will not affect the validity of the
metadata in examples (1) and (2).  But the metadata in examples (3) and (4)
will not apply across the differences between, say, the English,
Russian and Arabic versions of a multilingual page.


A PROPOSAL FOR ROBUST POINTERS
==============================

(*** reference to "Metrics for Markup Change Detection")

XPointer presents us a fully specified means of specifying a pointer
into an XML document.  HTML, and even the malformed tag-soup routinely 
served as text/html on the Web today, are sufficiently similar to XML
that we may reasonably hope to apply a similar model.  This is indeed
what Annotea and Valet are doing, and it forms the basis for our proposal.

Generalising XPointers
======================

We will not seek to generalise XPointer to work directly with SGML or HTML.
Even if we can satisfactorily do so, this does not help us with the problem
of tag-soup.  Instead, we propose the following partial definition:

	A Generalised Pointer is an XPointer into an XML normalisation
	of a Web document.

This definition splits the problem into two parts: normalisation, and
computation of the XPointer.  The second part is already fully specified,
but the normalisation remains to be defined.  The HTML Working Group has
declined to consider specifying a canonical normalisation, so we adopt
an alternative approach.

In practice, this is straightforward.  Normalisation of both HTML and tag-soup
to XML is routinely performed by software, including the parsers of widely-used
web browsers.  Annotea relies implicitly on Amaya's normalisation.  The
original ER approach was to work by trial-and-error to a normalisation
supported both by Valet and relevant Client tools such as FillyJonk and Snufkin.
However, to be more widely useful, any such normalisation must not only
exist, but must be fully specified and available to any other agent that 
needs to generate or use the metadata.  This gives us a provisional definition:

	A Specified Pointer is a Generalised Pointer, together with a
	specification of the normalisation.

What is considered an adequate specification for the purposes of this
document remains open for discussion.  **** We need to enumerate cases.
The basic requirement for producing a Specified Pointer will be to
publish a reference implementation in full, either as a webservice
or as source code that relies only on standard tools (eg ANSI C,
with no reliance on code that isn't open-source).

It must be up to a producer to specify the normlisation used.  In the
spirit of content negotiation, any agent (producer, consumer or other)
may publish a list of normalisations (or classes of normalisation) supported.

	* A normalisation webservice.  Unambiguous and universally available,
	  this is my preference.  I can offer to publish such a service.
	* Source code for normalisation, based on or including a widely-
	  supported parser.
	* "The normalisation performed by such-and-such parser under
	  such-and-such conditions".  

I would suggest that regardless of how it is specified, a producer of
Specified Pointers should be required to publish at least a reference
implementation in full, either as a universally-available webservice or
as source in a widely-supported standardised language such as ANSI C.
Multiple specifications should be permitted where applicable, so
a pointer might, for example, be represented by:

<spointer>
  <uri>http://foo.bar/xyz.html</uri>
  <pointer representation="norm">html[1]/body[1]/p[3]</pointer>

  <normalisation id="norm">
    <description>http://example.org/html-norm.html</description>
    <sourcecode>http://example.org/html-norm.tar.gz</sourcecode>
    <webservice>http://example.org/html-norm-svc</webservice>
    <webservice>http://other.example.com/other-html-norm-svc</webservice>
  </normalisation>
</spointer>


Resolving Ambiguity
===================

The above argument implicitly relies on a URI identifying a single-valued
resource.  This is not the case in the presence of content negotiation.
For a pointer to be fully specified requires that we are dealing with
a single-valued resource.  To deal with this, we should consider
specifying a single-valued resource identifier, comprising a URI
together with sufficient HTTP data to resolve content negotiation
unambiguously.  For example, we might replace <uri> in the above with:

  <svri>
    <uri>http://foo.bar/xyz.html</uri>
    <negotiation>Accept-Language,Accept-Encoding</negotiation>
    <http-request version="1.1">
	<Accept-Language>en-gb,it,de,se,en</Accept-Language>
	<Accept-Encoding />	<!-- something we don't do :-) -->
    </http-request>
  </svri>

**** This leaves a question where security is concerned - eg
content is dependent on password.  We can note the fact, but we
may not wish to store sufficient data to specify it fully.


Other Identifiers
=================

We have moved towards a provisional definition in terms of XPointer.
But this is not the only means of referencing content within a document:
many tools - such as the W3C validator - may use simpler references
such as byte, character or line/column offsets.  A general-purpose
Specified Pointer can and should encompass this kind of reference.
Where such references do not rely on a normalisation, we can omit this
element from the pointer:

<spointer>
  <uri>http://foo.bar/xyz.html</uri>
  <line>11</line>
  <column>31</column>
</spointer>

Likewise, a generalised pointer should permit references to a
whole-document (URI or SVRI without an Xpointer).

<spointer>
  <uri>http://foo.bar/xyz.html</uri>
</spointer>

At this point, we can revisit our definition:
	A Specified Pointer is an identifier guaranteed to be
	sufficient to identify the subject of a metadatum.

The structure we have identified for this is:

spointer =
	(URI|SVRI) ,
	Locator?

Locator =	whole document (default)|
		Generalised Pointer |
		ByteOffset |
		CharOffset |
		LineColumnOffset

SVRI = 	URI ,
	Negotiation ,
	HTTP Request
	(**** optionally also store HTTP response?)

Generalised Pointer =
	XPointer ,
	Normalisation Spec

Normalisation =
	(reference implementation)+ ,
	(implementation)*


ByteOffset = number
CharOffset = encoding, number
LineColumnOffset = encoding, number, number

**** This calls for a vocabulary, as well as structure.
**** Should we perhaps drop Normalisation and use Representation instead?

MARKUP METRICS AND CHANGE DETECTION
===================================

When dealing with stored metadata, we face the additional problem of
of dealing with change:
  * Has the document been changed since the metadata were generated?
  * If so, are the metadata still valid?
  * If both the above, do we also have a valid pointer?

In the absence of date information (including valid Last-Modified headers)
to tell us when a resource has changed, we need to look at document
contents to detect changes.  The simplest measure is a checksum.
 
However, we can do better than that.  A checksum tells us nothing about
the magnitude of a change, so that for example a document containing
"todays" date might be updated daily without affecting the validity
of metadata assertions.
 
Since markup implies structure, we can improve on a simple checksum
by computing hashes not on the document itself, but on a suitable
representation of it.  We can then refine our measure by considering
only certain structural elements of interest, so that a mere date
change is ignored, or (conversely) detected as distinct from a
structural change - if we are looking for a spelling mistake to be
fixed.
 
A first experiment in this is described at
<URL:http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2001Dec/0029.html>
and implemented at
<URL:http://valet.webthing.com/misc/dochash.html>
with source code at
<URL:http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2002Jan/0019.html>
This was found to be successful at tracking change at different levels of
significance, and successfully detected structural similarity over
changes in rapidly-changing news sites such as CNN.

ROBUST METADATA
===============

If we can detect whether a document change affects the validity of a
metadatum, we can improve the robustness of metadata by ignoring
irrelevant changes.  We can express this in terms of equivalence
measures on the document: the metadata are still valid if and only
if the document is unchanged modulo some equivalence relation.
The hashing experiment referenced above demonstrates the
feasibility of using equivalence relations to deal with change.

Examples of equivalence classes include:
  (1) Equivalent structure, in terms of having identical element trees.
  (2) Documents having the same sequence of HTML Headings, regardless
      of content.
  (3) The result of applying any specified XSLT transform to a specified
      normalisation gives us an equivalence class.
  (4) Documents having the same linearised text content are equivalent,
      regardless of markup.  This could be applicable to an assertion
      about the clarity of the language used.
  (5) Documents containing a table having caption "rainfall by month
      in ruritania" with axes labelled "month" and "city" are considered
      equivalent provided the axes don't change.  If a new city is added,
      change will be detected, but any change outside the table or to
      its data will be ignored.  This would be approriate to an assertion
      about the table, or about the page contents.
  (6) Documents having an element <div id="main"> and being identical
      *except for* the contents of this div are considered equivalent.
      This kind of measure helps confirm that pages have a consistent
      presentation.
  (7) As (6), but in addition to ignoring the content of <div>, we may
      apply some further normalisation to the remaining content - e.g.
      to ignore differences to a document title meta elements, date,
      and an advertising banner.

We can express equivalence by defining arbitrary equivalence classes
of markup.  We should specify a relation we can apply programmatically
and which others can replicate: basically, the same rules discussed
for normalisation apply.

For example, taking a metadatum that is invalidated if the document
element structure changes but which ignores attributes, text content,
CDATA, etc, we might:
  (1) Normalise to XML DOM.
  (2) Discard attribute nodes and text nodes.
  (3) Compute and store a Base64 hash on the result.
  (4) Trust our metadatum so long as the hash is unchanged.

<checksum value="AbCd1234">
  <sequence>
    <normalisation> <!-- as above --> </normalisation>
    <!-- now we apply some reduction to the normalised markup -->
    <reduction>
	<xslt>http://example.org/elements.xsl</xslt>
	<webservice>http://example.org/dom-elements-svc</webservice>
    </reduction>
    <hash method="base64"/>
  </sequence>
</checksum>

If we have a webservice that combines the entirity of the above, we
might reference that instead, though we should preferably also specify
the full method:

<checksum value="AbCd1234">
  <choice>
    <webservice>http://example.org/norm_elements_hash-svc</webservice>
    <description>http://example.org/norm_elements_hash.html</description>
    <sequence>
	(as above)
    </sequence>
  </choice>
</checksum>

Storing such checksums with the metadata offers us a means of ascertaining
whether the metadata are still valid after document change.  In cases
where metadata validity is not a simple binary property, we might
reference it to multiple different checksums, and regard different
combinations of pass/fail as different outcomes such as "partially
invalidated".

Experience with Site Valet's problem reporting and tracking database
is that a wide range of metadata can be usefully referenced to a
smaller number of such checksums such as the above, as the same
equivalence relations serve a range of different metadata.