Re: Preparing for metadata architecture discussion at the F2F from Noah Mendelsohn on 2010-10-15 (www-tag@w3.org from October 2010)

From: Noah Mendelsohn <nrm@arcanedomain.com>
Date: Fri, 15 Oct 2010 13:28:50 -0400
To: Jonathan Rees <jar@creativecommons.org>
CC: Larry Masinter <LMM@acm.org>, "www-tag@w3.org" <www-tag@w3.org>
Message-ID: <4CB88F52.60900@arcanedomain.com>
A few n6tes 6n the attached.

 > 1. sadly, representations and metadata subjects do not generally have
 >     their own URIs,


On purely architectural grounds, this always seemed a strange asymmetry to 
me, given that the general Web philosophy us: identify everything of 
interest with URIs.  I understand why in practice this might typically be 
overkill, but it's interesting to see it emerge as a shortcoming here too.

> However, any metadata assertion (author, title, etc.) stated using a
> URI should be approached with caution, as the metadata subject you
> would see now might not be the one to which that metadata originally
> applied.

A few years ago I toyed with the thought there might be some way of 
explicitly indicating, probably in HTTP headers, representations that were 
guaranteed to be invariant for all time, in the sense that subsequent 
retrievals would in some well-specified ways be "the same" (though clearly 
not the same in all properties, such as "time of last retrieval").  Anyway, 
the idea of offering such an HTTP header seemed to land with a pretty big 
thud, so I won't pursue it unless there is new interest on the part of 
others.  I do think it makes the Web a bit more rigorously applicable in 
situations like this.

Noah



On 10/6/2010 11:01 AM, Jonathan Rees wrote:
> On Tue, Oct 5, 2010 at 2:43 PM, Noah Mendelsohn<nrm@arcanedomain.com>  wrote:
>> Larry:
>>
>> For some time, the TAG has had open an ACTION-282 on Jonathan:
>>
>> ACTION-282 : on - Jonathan Rees - Draft a finding on metadata architecture.
>> - Due: 2010-10-21 - OPEN
>>
>> On our call of 16 Sept. 2010, it was agreed that we would discuss at the
>> upcoming F2F, and you generously offered [1]:
>>
>> "Larry: May be good to have a reading list... I will send mail"
>>
>> Accordingly, I was assigned:
>>
>> ACTION-465 : on - Noah Mendelsohn - Schedule F2F discussion of ACTION-282,
>> "which metadata mechanisms to use when". Get reading list from Larry and
>> www-tag. - Due: 2010-10-05 - OPEN
>>
>> So, Larry, it would be very helpful if you would prepare the reading list,
>> for me to include in the set of required readings for the F2F.  Can you give
>> me an ETA?  Thank you!
>>
>> Noah
>>
>>
>> [1] http://www.w3.org/2001/tag/2010/09/16-minutes.html#item05
>
> Larry, here are some of my notes on the subject. These are off the
> cuff and in a full treatment would have to be combined with other
> material on the subject.   -Jonathan
>
> -------
>
> Because this is the TAG list I'll use "resource", "representation",
> and "identification" per AWWW in spite of my dislike of its
> definitions of those words.  Ordinary people should substitute "thing"
> for "resource", "bag of bits" for "representation", and "naming" or
> "designation" for "identification".
>
> There is confusion about what "metadata" is.  In the wider world, and
> the library community specifically, it means "data about data" or
> "data about documents".  Unfortunately there is a second sense
> circulating; on occasion "metadata" is applied to
> information pertaining to just about any kind of entity.  For example,
> a person's date of birth is sometimes called "metadata" about the
> person.  To avoid confusion, and to help preserve the meaningfulness
> of the word "metadata", I advise restricting "metadata" to the former
> use, and applying a more general term such as "data" or "descriptive data" in
> the latter situation.
>
> The word "document" suffers from overuse so I will say "metadata
> subject" for something that metadata can be about.  For me these are
> things that you might put in a library or other document repository.
> Their identity is preserved through acts of reproduction.
> They don't change in significant ways - any significant change leads
> to a different metadata subject, not to a change in the original one.
> Whether a change is "significant" is always a matter of judgment but
> mainly what's meant is that reformatting (DOC to PDF, etc.) is not
> usually significant; if a library has to reformat its holdings to make
> obsolete formats accessible to current readers that's not considered a
> threat to the identity of a metadata subject.
>
> In the context of web architecture we are concerned with both metadata
> and (other) descriptive data, because not all "resources" are metadata
> subjects.
>
> To understand metadata on the web you need to distinguish resources
> from representations, and concomitantly descriptive data for resources
> from metadata for metadata subjects.
>
> For example, consider the resource<http://news.google.com/>.
> Properly speaking this is not a metadata subject.  Descriptive data
> for this resource might include that it is currently provided by
> Google Inc., or that the information it yields is updated frequently,
> or that on 6 October 2010 it linked to an article entitled "Scientists
> Win Chemistry Nobel for Carbon Atom Link".
>
> However, any particular "representation" of this resource would be a
> perfectly good metadata subject, with metadata such as publication
> date, language, word count, and subject matter.
>
> Metadata that properly belongs to a representation is often asserted
> instead on a resource that has such a representation.  There are
> several reasons for this:
>
> 1. sadly, representations and metadata subjects do not generally have
>     their own URIs, so specifying the subject of metadata assertions
>     is hard, and we just pick the nearest plausible URI  (cf. duri:)
>
> 2. the metadata might be sufficiently invariant across representations
>     (varying through conneg, session, time, etc.) to justify overloading
>     the resource's URI to mean either the resource or "any representation
>     of the resource"
>
> 3. because writing it is so concise, the base URI provides a tempting
>      subject for use in assertions about the representation
>
> Thus, one might say that Roy Fielding is an author of the resource
> <http://www.w3.org/TR/webarch/>, even though what's really meant is
> that he is an author of the (current) representation(s) of
> <http://www.w3.org/TR/webarch/>.
>
> We might even take the URI as a name
> not for a potentially changing resource, but for a particular metadata
> subject (with "representations" varying only in inconsequential ways).
> Example: based on known site policy, we might take
>
>      http://www.w3.org/TR/2004/REC-webarch-20041215/
>
> to refer to the 15 December 2004 version of the webarch
> recommendation, and use this URI to name it in, say, a scholarly
> references list or bibliographic database.
>
> However, any metadata assertion (author, title, etc.) stated using a
> URI should be approached with caution, as the metadata subject you
> would see now might not be the one to which that metadata originally
> applied.  Expectations in this regard need to be set through some out-of-band
> mechanism such as application architecture or articulated site
> stability policy.
>
> Where does one find metadata on the web?
>
> We currently have a number of options, among which are:
>
> - bibliographic databases and "landing pages"
>        examples: openurl, OAI-ORE, pubmed
> - embedded in a "representation" in various ways
>        examples: XHTML+RDFa,<title>,<meta>,<link>, XMP
> - HTTP entity-headers such as Content-language:
> - following a link provided by a Link: header
>        (see "new opportunities" blog post)
> - .well-known/host-meta + link-template
>        (see "new opportunities" blog post)
>
> In principle metadata can be given directly in a<link>  element, Link:
> header, or host-meta template, but I think we're recommending that
> there be a single Link: (etc) that directs you to a second document
> whose purpose is to describe the resource (as "resource description").
>
> Like any metadata source, when a resource URI is available, a resource
> description could contain descriptive data for the resource, or
> invariant metadata for its representations, or both.
>
> Related to this are linked data practices around GET/303, fragid +
> RDF.  The RDF context is more general than metadata subjects; a set of
> axioms with a shared subject could be metadata but only if that shared
> subject is a metadata subject.
>
> If two sources of metadata conflict, which one gets priority?
>
> The cynical answer is that every chunk of metadata has its own
> provenance.  You have to just know the characteristics of the metadata
> source, and figure out for yourself which source is more likely to
> give you the right answer. The question is similar to: If two web pages
> disagree, which one is right?
>
> An answer given by the LRDD draft is that it's an error if Link: metadata
> conflicts with link-template metadata.  What you get from
> the two sources must be the same.
> The motivation for this is to allow clients to stop looking for
> metadata as soon as it is found at one location.  The requirement that
> the metadata be identical frees the client from any need to (at
> considerable cost in network bandwidth) examine the other source.
>
> (Link: and link-template are not yet deployed for metadata discovery
> as far as I know. They may be in use for other purposes such as OpenID.)
>
> Larry has suggested that any particular metadata source could
> communicate - perhaps through choice between two different Link:
> relations - whether it intends the provided metadata to override some
> other source (such as embedded metadata) or not.  For example, if a
> "representation" has embedded metadata asserting that the author is
> Roy Fielding, but the server (via Link:) asserts that the author is
> Larry Masinter, there could be two cases: either the server would say
> that the embedded metadata is more likely to be accurate than what it
> is providing (i.e. Link: is giving a sort of default), or it might
> believe that the information it's providing is more likely to be right
> than embedded metadata (maybe the server's metadata was subject to
> better QC than the embedded metadata).
>
Received on Friday, 15 October 2010 18:28:24 UTC