RE: Centralization, redundancy, and survivability from Young,Jeff (OR) on 2011-03-24 (public-lld@w3.org from March 2011)

From: Young,Jeff (OR) <jyoung@oclc.org>
Date: Wed, 23 Mar 2011 23:33:23 -0400
To: "Simon Spero" <ses@unc.edu>, "public-lld" <public-lld@w3.org>
Message-ID: <52E301F960B30049ADEFBCCF1CCAEF590BE3B5E4@OAEXCH4SERVER.oa.oclc.org>
Sorry I haven't been more involved for the past week or so. I blame the W3C for standardizing XML Schemas and thus dooming old programmers like me to be maintenance slaves. :-(

Simon Spero wrote:
> A few brief notes:
> 
> 1.  Assigning identifiers that are guaranteed not to have been assigned
>  by another agent does not require a centralized repository of all
> identifiers; it merely requires a partitioning of the name space such
> that no part of the name space is delegated to more than one agent.

I agree, but would cut to the chase and tell people to use the http URI scheme to identify everything. Ed blogged about this recently: <http://inkdroid.org/journal/2011/03/22/geeks-bearing-gifts/>.

> 
> 2.  Worldcat contains a great deal of duplication, where multiple
> records exist for the same manifestation.  Usually these records will
> be brought into the same work-set, but sometimes a record is so badly
> constructed that a work-set will be split such that an ISBN corresponds
> to items in multiple work-sets.  I first encountered this the first
> time I tried using the experimental XISBN service;  the item I scanned
> the ISBN from was a paperback edition of... the AACR2.    Worldcat
> record numbers do not satsify the Unique Name Assumption.

Unfortunately, OCLC numbers identify bibliographic records, not manifestations. The primary difference is "language of cataloging". This is presumably a many-to-one relationship. If xml:lang had been around back in the day we presumably could have avoided splitting this hair. 

It's easier to believe that ISBNs identify manifestations. My recollection, though, is that ISBNs have been reassigned often enough that they can't be automatically trusted. This means that other information in the record needs to be taken into account. It sounds like the additional information in this case is so badly mangled that our "works" algorithm was forced to make the more conservative assumption that they are different. Such weightings are an art and tuning them needs to be done carefully to avoid breaking ten to fix one. 

Also note that OWL doesn't make the Unique Name Assumption, but it provides a solution in the form of owl:sameAs and owl:differentFrom. I agree it would be best if the records were fixed, though: <http://www.oclc.org/worldcat/support/bibins.htm>.

> 
> 3.  The economic model used for Worldcat has the unfortunate side
> effect of encouraging the proliferation of records that are
> sufficiently bad that they avoid de-duplication.  Usually the best
> record dominates the holdings count, but there are still enough
> careless people who download the incorrect records to make purging them
> (and merging them in with the correct record) problematic.

WorldCat does have a processing/reporting model for merging records (called XREF), but recognizing duplicates in the MARC data model isn't easy for humans or machines. In principle, FRBR WEM should help funnel human/machine attention. Switching properties from owl:DatatypeProperty (i.e. literals) to owl:ObjectProperty (i.e. things) should also help.

> 
> 4. Because there is no automatic way of  synchronizing and pushing
> corrections from OCLC to member libraries with incorrect holdings,
> these errors can make it harder to make optimal choices when selecting
> targets for ILL requests.

I wonder if OCLC is really the bottleneck here. In principle, transporting MARC XML from OCLC to member libraries via HTTP in bulk should be easy. Integrating those records into local systems may not be, though. Another issue is keeping OCLC holdings current so we know which members hold what. I actually did my MLIS research paper on this (thanks Marcia!), but I forget what I found. ;-)

> 
> 5. The amount  of bibliographic information is relatively small
> compared to that processed in data intensive scientific disciplines.
>  ~7M LC bibliographic records on average take up less 180 bytes  of
> disk space each when bzip2 compressed.  200M records could fit
> comfortably on a smartphone flash card.  This number of bits is easy to
> massively replicate.
> 
> 6. For most normal people, the reason for wanting a bibliographic
> record to add to a catalog is because they have a copy of the item that
> they wish to process and put on the shelf.

This sentence gives my Linked Data senses the willies. :-)

>  For non-unique items, the
> probability of being the first to acquire a copy of a new manifestation
> is rather low (unless you're LC) .  However, the first records
> produced,especially if they're ONIX derived, are typically of rather
> poor quality.  A model of sharing that requests (but not requires) one
> to re-enter a value for a field chosen based on how likely it is that
> there may be an error in that field may be more likely to
> asymptotically approach awesome than a model that treats entire records
> as valuable objects in their own right, to be bought and sold as
> chattels.

It sounds like part of the problem with ONIX (as it is with MARC) is that "a value for a field" is typically a literal. Some of these literals may be codes (identifiers), but I'm guessing they typically aren't http identifiers and thus aren't easy to verify. Other literals are presumably names of reusable things that SHOULD be given http URIs but probably aren't because ONIX/MARC-based systems aren't automatically expected to coin and support http URIs. Trusting humans to enter literals consistently is bound to be disappointing. Identifying the pieces with http URIs would help break up the record making things more reusable and consistent.

> 
> 7.  The advantage of a distributed system with local replicas as
> opposed to a centralized system with one-time downloads becomes
> especially apparent as records undergo continuous improvement.   This
> is where systems like worldcat local and its successors can shine.

It would be nice if centralized systems used 303 URIs to identify more of the reusable THINGS that are lurking in the shadows of their web service APIs and web documents. Speaking of XML, it's time for me to crawl back to my XML Schema hellhole.

Jeff
Received on Thursday, 24 March 2011 03:39:58 UTC