ROBUST METADATA FOR WEB CONTENT =============================== As the web grows, so too does the volume of metadata describing it, from offline references through established online services such as search engines to emerging technologies such as EARL and Annotea. For metadata to be useful other than within a closed system, we need standards. These are beginning to emerge, but have yet to be widely supported. Furthermore, there are problems with metadata exchange that are not adequately addressed in any standard. The purpose of this note is to consider such issues, and propose the outline of a standard. THE SCENARIO ============ (****Figure 1 - a producer, a consumer and a database all deal with metadata) A producer generates metadata. A database stores them. A consumer uses them. For this to be really useful requires that the consumer have satisfactory answers to fundamental questions: * The addressing problem: what do the metadata refer to? * Dealing with content change: are the metadata still valid now? To deal first with the addressing problem, we have several existing mechanisms for it: (****Figure 2 - ways to address web content) * Domain Example(1): in a newspaper advert, "to take advantage of these great offers and many more, visit our website at www.example.com". * URL Example(2): in a search engine result, we can find a list of pages dealing with a subject of interest such as "hotels in strelsau", "highland malt", or "sheet music". * Simple Pointer Example(3): tools such as the W3C validator and some of the Site Valet tools identify validation errors in a page by pointing to a line and character within the page. * XPointer Example(4): tools such as Annotea and Page Valet use a mechanism similar to XPointer to address content within a page. In examples (1) and (2), the addressing mechanisms used are entirely approriate to the usage, and any shortcomings in actual implementations fall outside the scope of this note. Example (3) is also relatively simple, because it is presenting the information directly to a human agent. Nevertheless, it is not always adequate: for example, there is an issue with line numbers in the presence of different line endings that can cause the validator to report entirely bogus results, and there is an issue of byte vs. character offsets when using a tool that deals with them in a manner different to the validator. Example (4) is altogether more problematic. XPointer is not a new technology, yet unlike some of its XML cousins (XPath, XSLT) it is not widely supported nor deployed. In the real world, there are serious obstacles to XPointer realising its full potential. Prominent amongst these problems is the fact that whereas XPointer applies to well-formed XML, most of the web today is neither XML nor well-formed. A second problem with addressing arises from HTTP Content Negotiation. A URL is by definition (and in practice if we set aside abuses of the protocol) a unique resource, but the resource may itself have more than one manifestation. It is a reasonable (though by no means guaranteed) premise that content negotiation will not affect the validity of the metadata in examples (1) and (2). But the metadata in examples (3) and (4) will not apply across the differences between, say, the English, Russian and Arabic versions of a multilingual page. A PROPOSAL FOR ROBUST POINTERS ============================== (*** reference to "Metrics for Markup Change Detection") XPointer presents us a fully specified means of specifying a pointer into an XML document. HTML, and even the malformed tag-soup routinely served as text/html on the Web today, are sufficiently similar to XML that we may reasonably hope to apply a similar model. This is indeed what Annotea and Valet are doing, and it forms the basis for our proposal. Generalising XPointers ====================== We will not seek to generalise XPointer to work directly with SGML or HTML. Even if we can satisfactorily do so, this does not help us with the problem of tag-soup. Instead, we propose the following partial definition: A Generalised Pointer is an XPointer into an XML normalisation of a Web document. This definition splits the problem into two parts: normalisation, and computation of the XPointer. The second part is already fully specified, but the normalisation remains to be defined. The HTML Working Group has declined to consider specifying a canonical normalisation, so we adopt an alternative approach. In practice, this is straightforward. Normalisation of both HTML and tag-soup to XML is routinely performed by software, including the parsers of widely-used web browsers. Annotea relies implicitly on Amaya's normalisation. The original ER approach was to work by trial-and-error to a normalisation supported both by Valet and relevant Client tools such as FillyJonk and Snufkin. However, to be more widely useful, any such normalisation must not only exist, but must be fully specified and available to any other agent that needs to generate or use the metadata. This gives us a provisional definition: A Specified Pointer is a Generalised Pointer, together with a specification of the normalisation. What is considered an adequate specification for the purposes of this document remains open for discussion. **** We need to enumerate cases. The basic requirement for producing a Specified Pointer will be to publish a reference implementation in full, either as a webservice or as source code that relies only on standard tools (eg ANSI C, with no reliance on code that isn't open-source). It must be up to a producer to specify the normlisation used. In the spirit of content negotiation, any agent (producer, consumer or other) may publish a list of normalisations (or classes of normalisation) supported. * A normalisation webservice. Unambiguous and universally available, this is my preference. I can offer to publish such a service. * Source code for normalisation, based on or including a widely- supported parser. * "The normalisation performed by such-and-such parser under such-and-such conditions". I would suggest that regardless of how it is specified, a producer of Specified Pointers should be required to publish at least a reference implementation in full, either as a universally-available webservice or as source in a widely-supported standardised language such as ANSI C. Multiple specifications should be permitted where applicable, so a pointer might, for example, be represented by: http://foo.bar/xyz.html html[1]/body[1]/p[3] http://example.org/html-norm.html http://example.org/html-norm.tar.gz http://example.org/html-norm-svc http://other.example.com/other-html-norm-svc Resolving Ambiguity =================== The above argument implicitly relies on a URI identifying a single-valued resource. This is not the case in the presence of content negotiation. For a pointer to be fully specified requires that we are dealing with a single-valued resource. To deal with this, we should consider specifying a single-valued resource identifier, comprising a URI together with sufficient HTTP data to resolve content negotiation unambiguously. For example, we might replace in the above with: http://foo.bar/xyz.html Accept-Language,Accept-Encoding en-gb,it,de,se,en **** This leaves a question where security is concerned - eg content is dependent on password. We can note the fact, but we may not wish to store sufficient data to specify it fully. Other Identifiers ================= We have moved towards a provisional definition in terms of XPointer. But this is not the only means of referencing content within a document: many tools - such as the W3C validator - may use simpler references such as byte, character or line/column offsets. A general-purpose Specified Pointer can and should encompass this kind of reference. Where such references do not rely on a normalisation, we can omit this element from the pointer: http://foo.bar/xyz.html 11 31 Likewise, a generalised pointer should permit references to a whole-document (URI or SVRI without an Xpointer). http://foo.bar/xyz.html At this point, we can revisit our definition: A Specified Pointer is an identifier guaranteed to be sufficient to identify the subject of a metadatum. The structure we have identified for this is: spointer = (URI|SVRI) , Locator? Locator = whole document (default)| Generalised Pointer | ByteOffset | CharOffset | LineColumnOffset SVRI = URI , Negotiation , HTTP Request (**** optionally also store HTTP response?) Generalised Pointer = XPointer , Normalisation Spec Normalisation = (reference implementation)+ , (implementation)* ByteOffset = number CharOffset = encoding, number LineColumnOffset = encoding, number, number **** This calls for a vocabulary, as well as structure. **** Should we perhaps drop Normalisation and use Representation instead? MARKUP METRICS AND CHANGE DETECTION =================================== When dealing with stored metadata, we face the additional problem of of dealing with change: * Has the document been changed since the metadata were generated? * If so, are the metadata still valid? * If both the above, do we also have a valid pointer? In the absence of date information (including valid Last-Modified headers) to tell us when a resource has changed, we need to look at document contents to detect changes. The simplest measure is a checksum. However, we can do better than that. A checksum tells us nothing about the magnitude of a change, so that for example a document containing "todays" date might be updated daily without affecting the validity of metadata assertions. Since markup implies structure, we can improve on a simple checksum by computing hashes not on the document itself, but on a suitable representation of it. We can then refine our measure by considering only certain structural elements of interest, so that a mere date change is ignored, or (conversely) detected as distinct from a structural change - if we are looking for a spelling mistake to be fixed. A first experiment in this is described at and implemented at with source code at This was found to be successful at tracking change at different levels of significance, and successfully detected structural similarity over changes in rapidly-changing news sites such as CNN. ROBUST METADATA =============== If we can detect whether a document change affects the validity of a metadatum, we can improve the robustness of metadata by ignoring irrelevant changes. We can express this in terms of equivalence measures on the document: the metadata are still valid if and only if the document is unchanged modulo some equivalence relation. The hashing experiment referenced above demonstrates the feasibility of using equivalence relations to deal with change. Examples of equivalence classes include: (1) Equivalent structure, in terms of having identical element trees. (2) Documents having the same sequence of HTML Headings, regardless of content. (3) The result of applying any specified XSLT transform to a specified normalisation gives us an equivalence class. (4) Documents having the same linearised text content are equivalent, regardless of markup. This could be applicable to an assertion about the clarity of the language used. (5) Documents containing a table having caption "rainfall by month in ruritania" with axes labelled "month" and "city" are considered equivalent provided the axes don't change. If a new city is added, change will be detected, but any change outside the table or to its data will be ignored. This would be approriate to an assertion about the table, or about the page contents. (6) Documents having an element
and being identical *except for* the contents of this div are considered equivalent. This kind of measure helps confirm that pages have a consistent presentation. (7) As (6), but in addition to ignoring the content of
, we may apply some further normalisation to the remaining content - e.g. to ignore differences to a document title meta elements, date, and an advertising banner. We can express equivalence by defining arbitrary equivalence classes of markup. We should specify a relation we can apply programmatically and which others can replicate: basically, the same rules discussed for normalisation apply. For example, taking a metadatum that is invalidated if the document element structure changes but which ignores attributes, text content, CDATA, etc, we might: (1) Normalise to XML DOM. (2) Discard attribute nodes and text nodes. (3) Compute and store a Base64 hash on the result. (4) Trust our metadatum so long as the hash is unchanged. http://example.org/elements.xsl http://example.org/dom-elements-svc If we have a webservice that combines the entirity of the above, we might reference that instead, though we should preferably also specify the full method: http://example.org/norm_elements_hash-svc http://example.org/norm_elements_hash.html (as above) Storing such checksums with the metadata offers us a means of ascertaining whether the metadata are still valid after document change. In cases where metadata validity is not a simple binary property, we might reference it to multiple different checksums, and regard different combinations of pass/fail as different outcomes such as "partially invalidated". Experience with Site Valet's problem reporting and tracking database is that a wide range of metadata can be usefully referenced to a smaller number of such checksums such as the above, as the same equivalence relations serve a range of different metadata.