Re: Memo on persistent reference - TAG please read before F2F discussion from Jonathan Rees on 2010-10-16 (www-tag@w3.org from October 2010)

From: Jonathan Rees <jar@creativecommons.org>
Date: Fri, 15 Oct 2010 22:10:25 -0400
To: Noah Mendelsohn <nrm@arcanedomain.com>
Cc: www-tag@w3.org
Message-ID: <AANLkTi=hGJW6=zkqfoth9nwpnq_VsWJ6Hu3nbBH-GFhc@mail.gmail.com>
On Fri, Oct 15, 2010 at 12:40 PM, Noah Mendelsohn <nrm@arcanedomain.com> wrote:
> These are comments on my readthrough of [1].  Overall, I find it to be very
> helpful, well written, and a very useful foundation for our work in this
> area.  So, here I'll concentrate on areas where I have quibbles or concerns.
>  All quoted text is from the draft.
>
>> The scenario under discussion is that of a general user or robot with a
>> reference in hand, using a well-known apparatus or method to "chase" the
>> reference and obtain the target document. The reference was not created with
>> that particular user in mind; the reference might be seen by anyone (or
>> anyone in some substantial community), so no special knowledge can be
>> assumed. Of course some knowledge must be assumed, such as how to read the
>> Latin alphabet or ASCII or use a browser; but not special knowledge peculiar
>> to the user or reference.
>
> I think we can do a bit better on this.  In general, the person using the
> reference must know, or at least successfully infer, the specifications
> (written or informal) that apply to resolution of the identifier.

Really? I bet the number of users who even know the specifications
exist is vanishingly small, compared to the number of users. The
normal person just turns on their computer and starts using a browser.

> In the
> case of the Web, the TAG has written with some care on this in its finding
> The Self-Describing Web.  That finding points out that everything one needs,
> not just to dereference a URI, but in the case of HTTP, to properly
> interpret information retrieved from the resulting resource, can be found
> directly or indirectly from RFC 3986.

Ditto. What matters is whether someone can get the information they
need. Specifications are only a part of a means to that end - a part
of today's "apparatus" - and their survivability is unclear, since
technologies change.

For example, if I try to GET a particular URI, and get a 404, I may be
able to find the document I need in the Internet Archive or the Google
cache.  That's not following the finding or any particular
specification; it's just doing what has to be done.

> Also, we should acknowledge that in many cases, a priori knowledge of the
> identification mechanism is required to interpret an identifier.  Someone
> finding the identifier:
>
>        http://example.org/somedoc.xml
>
> on the side of a bus would could guess with at least moderate reliability
> that RFC 3986 applies, but if a someone found the identifier
>
>        somedoc.xml
>
> on a piece of paper on a programmer's desk, then it might be a (relative)
> URI, or it might be a filename, perhaps some other identifier, or perhaps
> not an identifier at all.

Sorry, I thought I covered that in the intro.  Will review to make
sure this is clear.

>> I'll define persistence as survival beyond events that you would expect
>> would imply extinction,
>
> (editorial) this seems a slightly odd way to put it.  If I'm smart enough,
> then I know that the events don't imply extinction after all;  does that
> mean there's no persistence. Also, you don't say survival of what?  Might it
> be better to try something along the lines of:

I took the definition from one of the online dictionaries, and it made
sense to me.

> "I'll define persistence as survival if the reference and the information
> needed for its interpretation, for a very long time, typically tens of years
> or centuries [I would have thought millenia?], and in the face of a broad
> range of potential threats (technical failures of systems; organizational
> failures or death of responsible individuals; natural disasters; malicious
> attempts to hijack "ownership" of the reference; etc."

Seems too specific, but I'll see what I can do.

>> The ideal reference is both fast
>
> Really?  For the range of contexts your talking about?  We'd in all cases
> prefer millisecond access from NVRAM to tablets carved in stone?  I would
> have thought it depended on the use case.

What use case requires a reference that *can't* be chased quickly?

Maybe I can be more clear what I mean by "ideal". Will look into
expanding the section intro.

>> Failure modes
>
> Shouldn't there be failure modes for specifications being unclear, lost over
> time, etc?  That seems the common case even with materials written on, say,
> magnetic media from the 1950s and 1960s, where nobody can find or properly
> interpret the specifications used to encode them.  In principle,
> specifications and implementations could be hijaked over time:  e.g. someone
> could rewrite the specs for DNS resolution to insert a government agency
> into the lookup process, or the deployed infrastructure could do that, in
> violation of the (unmodified) specification.  In both cases, references are
> no longer correctly associated with targets over time.

Yes, this is a major concern in the digital archiving world, and
libraries are burned by itall the time. This isn't a problem I really
care to deal with, but I'll include it in a future version.

>> Placing bets
>
> I like the list, but I think there might be a fifth characteristic along
> with Ubiquity, etc., and that's "Self-checking".  What I have in mind is the
> possibility of using things like digital signatures to enforce end-to-end
> checking.

I covered this in the remediations section. Self-checking-ness is a
means to an end, so ought to be a way that a system shows that it
meets some other goal, not an evaluation criterion in itself. Will
review this.

> To invent an example, let's say I had a normative convention for putting a
> digital signature of the resource into the URI (or other reference) itself.
>  Now, no matter which of the other problems hit us, I know with very high
> confidence when I have or have not successfully found the intended target.
>  That seems a useful axis to explore. Maybe or maybe not that fits under
> your "safety net", but it feels different.

I considered adding signatures and PKI as a possible remediation, but
I have no idea how to analyze their suitability in archival contexts.
I think I would want to do more research (i.e. someone must have
already thought about this) before even mentioning it.

Thanks for your careful reading.

Jonathan
Received on Saturday, 16 October 2010 02:10:53 UTC