Re: Open Library and RDF

On Mon, Aug 16, 2010 at 4:21 AM, Thomas Baker <tbaker@tbaker.de> wrote:
> On Sun, Aug 15, 2010 at 05:10:29PM -0700, Karen Coyle wrote:
>>               The maximal ontological commitment definitely furthers
>> the sharing, and may be essential for it. In fact, libraries today are
>> organized in highly complex networks of data sharing that they depend
>> on for their survival in this difficult economic climate. Although it
>> is a bit of an exaggeration, I often say that libraries have achieved
>> an amazing efficiency -- that a book is published, then cataloged and
>> the data keyed once (usually by the relevant national library), and
>> every other library downloads that data into their catalog. There is
>> much more metadata re-use than metadata creation.
>
> (I sometimes wonder if it is still optimally efficient, in
> 2010, to create lots of redundant copies of catalog records
> in lots of local databases instead of just linking to a
> central record, but that would be a different discussion...)
>
>> I think we must start with that as our reality, and look at ways to
>> integrate library metadata into a wider universe without losing the
>> underlying precision of that data as it exists in library systems.
>
> Agreed. I'm not questioning the need for precise metadata.
> My point is that precision can be attained in different ways.
> One way is by defining a strongly specified ontology that
> enforces the logical consistency of data.  It will enforce
> that not just for data producers but also for data consumers.

We should be careful to manage expectations here - even the most
'strongly specified' ontology can only express certain kinds of
constraint, in RDFS/OWL at least. High quality, precise metadata
requires additional discipline that comes it at quite another level. I
think you know this very well but just to spell it out for the record!

Ontologies are more like legal reference literature than police; they
don't directly enforce anything. This is partly why in the Dublin Core
community we found a need for 'application profiles'; not just to
combine other's work rather than always have to be defining new terms,
but to be able to talk explicitly about the structure of descriptions,
as well as about the structure of the world those descriptions
describe.

Regardly the 'Semantic Web', we call it 'semantics' because the rules
in our ontologies are generalisations about the world; sometimes
however we want the rules to be talking more directly about the data:
'when you mention a person and they are no longer alive, mention their
date of death'; 'if you mention a group and you know its founder,
mention their name and give an identifying description of them', etc.
[top of my head examples]. These sorts of rule aren't directly about
people or groups, but about the structure of certain kinds of
description. It is possible to bend and twist semantic technology to
work like this (some of us have used/abused SPARQL, others OWL eg.
nice work from clarkparsia recently) but the main thing to emphasise
is that - fresh out of the box - even 'strong' world-describing
ontologies don't express these kinds of rules. They'll tell you about
types of things, types of property and relationship, alongside
patterns of agreed meaning for talking and reasoning about them,
including bundles of facts that can't [or that must] be simultaneously
true. So they'll tell you about the world but they won't tell you how
to talk about the world. If we leave things there, the data might be
logically consistent, expressing no contradictions, but it can still
be low quality. This is because there are a thousand ways to screw up
data beyond being contradicting yourself.


> Another way is by strongly controlling the consistency of
> data when it is created -- e.g., with application profiles,
> using criteria that can form the basis of syntactic validation,
> quality control, and consistency checks (and of course with
> training of the catalogers in the proper application of the
> conceptual system).

Yup. These kinds of checks can be conducted by consumers and syndicators too.

>           However, for the data to be good and
> consistent, it does not follow that the underlying vocabularies
> themselves must necessarily carry heavy ontological baggage.

Measuring complexity is hard, it's like a lump in the carpet that pops
up somewhere else when you try to flatten it away. For many years,
FOAF contained the rule "nothing can be both a document and a person".
If that rule is removed, have we made things simpler or more complex?
And 'things' is a vague metaphor; *what* exactly became simpler or
more complex? In the old version you know that any true document can
only be describing a thing that is either a person or a document,
because it is simply modeled as 'impossible' to have any single entity
that is considered to be both. If that disjointness rule is dropped,
what exactly gets more complex or simpler? Switching back to Tom's
metaphor - who has more 'baggage' to handle?

>> A second goal is to increase the interaction of library data with
>> other resources on the Web. This is one of the reasons why the
>> Metadata Registry created a hierarchy of properties, the highest level
>> of which are not bound by the FRBR entities. This allows data to be
>> exchanged without regard to strict FRBR definitions. The resulting
>> metadata, however, will still be more detailed than is desired (or
>> even understood) by non-library communities. Therefore I think we need
>> to work on defining classes and properties that can be used to
>> integrate library data to non-library, non-specialist resources and
>> services. FRBR and RDA jump right into the level of detail that the
>> library community relates to without looking at how those details
>> might fit into a larger picture. We need to work on developing that
>> picture.
>
> I agree that this is the challenge, and a layered approach
> sounds reasonable.  Is this the approach currently being followed
> by the FR and RDA committees?
>
>> What this all comes down to is that if we take the view that library
>> metadata must embrace different principles than it does today in order
>> for libraries to interact on the Web, then we've got a non-starter.
>> Library data is precise, meets particular needs, is principles based,
>> and is anything but arbitrary.
>
> As I see it, record sharing in the library world has been
> based on the sort of validation that one might express in
> an application profile, and the consistency of intellectual
> approach embodied in those records has been ensured by
> the training of experts in the proper application of
> internationally recognized standards.  I do not see this
> changing.

There's a huge middle ground between 'nothing must change' and
'everything must change'. The library world is a fine castle built on
shifting sands, since the publishing industry, literature and reading
itself are all changing. It's happening slowly enough that we don't
need to panic, but fast enough that I'm wary about asserting that
anything will remain unchanged.

> My point is that it is not necessarily strongly specified
> ontologies that will buy that precision, whereas strongly
> specified ontologies _will_ impose their ontological baggage
> on any downstream consumers of that data.

Depends what you mean by 'strongly specified ontologies'. Sorry to
keep blabbing about FOAF but I can share some experience maybe. Lots
of consumers of FOAF data don't even read the human-facing spec, let
along parse the RDF schema to discover the 'strong' OWL claims
(disjointness, inverses) or weaker RDFS information (domain/range,
subclass). They often don't even use a 'proper' XML parser, let alone
get triples via RDF/XML parsing. So the claim that using rich
ontological modelling upstream makes work for those downstream, I
don't buy. Usually the underlying ontologies are ignored, and people
take the data at some kind of face value. There is a variant issue
though: if your model in terms of entities and relationships is
rich/strong/complex/powerful, it probably makes a pile of distinctions
that show up in your data even if consumers aren't using RDF/OWL, in
that there will be more terms, identifiers etc (class and property
names and however they manifest themselves syntactically, in XML,
HTML, JSON, CSV etc). But that's much more about levels of detail than
about the 'strength' of some ontological content.

> Where should the precision get defined and enforced -- in the process
> of creating and validating data, or does it get hard-wired
> into the underlying vocabularies themselves?  Designing an
> RDF vocabulary need not be like designing an XML schema --
> the RDF approach offers different ways to separate underlying
> semantics from specific constraints.

Yup

> My question is whether the FR and RDA process is considering
> that some of the desired precision might be defined not in
> the underlying vocabularies, but in application profiles that
> use those vocabularies.  An approach which pushes some of the
> precision into application profiles could provide flexibility
> without sacrificing rigor.  Are application profiles (possibly
> under a different name) an important part of the discussion?

This seems an important thing to discuss. I've seen things previously
ascribed to FRBR which sound like they are data integrity / discipline
rules, but are rather awkwardly manifesting themselves as deeper
ontological claims (eg. about persons, subjects, and which can be
what...). As rules about what you might find in a certain kind of
FRBR-approved description, those rules are very valuable; considered
as observations about a world that will be further described by other
independent parties, they can seem quirky since they assume a kind of
closed world in which FRBR is the only party who gets to make
ontological rules. Sorry not to back this up with a detailed example -
I think I'm thinking of some of the issues Karen has previously
blogged on.

cheers,

Dan

Received on Monday, 16 August 2010 07:36:03 UTC