Centralization, redundancy, and survivability from Simon Spero on 2011-03-23 (public-lld@w3.org from March 2011)

From: Simon Spero <ses@unc.edu>
Date: Wed, 23 Mar 2011 18:39:07 -0400
To: public-lld <public-lld@w3.org>
Message-ID: <AANLkTinZJAhEVuK6VCd41fy_NQeRwA950euoiLMUPoB6@mail.gmail.com>

A few brief notes:

1.  Assigning identifiers that are guaranteed not to have been assigned  by
another agent does not require a centralized repository of all identifiers;
it merely requires a partitioning of the name space such that no part of the
name space is delegated to more than one agent.

2.  Worldcat contains a great deal of duplication, where multiple records
exist for the same manifestation.  Usually these records will be brought
into the same work-set, but sometimes a record is so badly constructed that
a work-set will be split such that an ISBN corresponds to items in multiple
work-sets.  I first encountered this the first time I tried using the
experimental XISBN service;  the item I scanned the ISBN from was a
paperback edition of... the AACR2.    Worldcat record numbers do not satsify
the Unique Name Assumption.

3.  The economic model used for Worldcat has the unfortunate side effect of
encouraging the proliferation of records that are sufficiently bad that they
avoid de-duplication.  Usually the best record dominates the holdings count,
but there are still enough careless people who download the incorrect
records to make purging them (and merging them in with the correct record)
problematic.

4. Because there is no automatic way of  synchronizing and pushing
corrections from OCLC to member libraries with incorrect holdings, these
errors can make it harder to make optimal choices when selecting targets for
ILL requests.

5. The amount  of bibliographic information is relatively small compared to
that processed in data intensive scientific disciplines.  ~7M LC
bibliographic records on average take up less 180 bytes  of disk space each
when bzip2 compressed.  200M records could fit comfortably on a smartphone
flash card.  This number of bits is easy to massively replicate.

6. For most normal people, the reason for wanting a bibliographic record to
add to a catalog is because they have a copy of the item that they wish to
process and put on the shelf.  For non-unique items, the probability of
being the first to acquire a copy of a new manifestation is rather low
(unless you're LC) .  However, the first records produced,especially if
they're ONIX derived, are typically of rather poor quality.  A model of
sharing that requests (but not requires) one to re-enter a value for a field
chosen based on how likely it is that there may be an error in that field
may be more likely to asymptotically approach awesome than a model that
treats entire records as valuable objects in their own right, to be bought
and sold as chattels.

7.  The advantage of a distributed system with local replicas as opposed to
a centralized system with one-time downloads becomes especially apparent as
records undergo continuous improvement.   This is where systems like
worldcat local and its successors can shine.

Received on Wednesday, 23 March 2011 22:39:40 UTC