Good grammar and proper footnotes for data

Dear all,

Some thoughts on my own motivation for pursuing the cause
of linked data -- "unexamined assumptions" expressed here in
strong terms to encourage discussion :-)

I'm wondering how many of you agree that RDF is a language of
data -- the only such language we have with any traction --
and that URIs are the footnotes for data in the Web age?

Science and scholarship are founded on footnotes, and in
a sense, libraries were built to support the integrity and
longevity of footnotes.  Good grammar and proper footnotes --
what's not to like?  Can we agree on enough of the principles
here to work them into the case for library linked data?

Tom



RDF is the grammar for a language of data.  URIs are the words
of that language.  As in natural language, these words (i.e.,
the URIs) belong to grammatical categories.  RDF properties
(such as "isReferencedBy") function a bit like verbs, RDF
classes like nouns.

As in natural languages, where utterances are meaningful only
if they follow a sentence grammar, RDF statements follow a
simple and consistent three-part grammar of subject, predicate,
and object.  Analogously to paragraphs, RDF statements are
aggregated into RDF graphs.

Aside from being words in the language of data, URIs double
as footnotes.  As footnotes they indicate the maintenance
responsibility for words by way of ownership of the domain
names under which the URIs were coined, as recorded in the
globally managed Domain Name Service (DNS).  Inasmuch the URIs
of words lead to documentation of official definitions, the
Web itself provides the language of data with its dictionary.

The fifteen elements of Dublin Core have been likened to a
"pidgin" -- a lexicon of generic predicates good enough for the
sort of rudimentary but serviceable communication that occurs
between speakers of different languages.  Just as pidgins
are inadequate for more subtle or differentiated expression,
a healthy ecosystem of RDF vocabularies needs to include
more specialized vocabularies for use by social or scholarly
communities of discourse among themselves.

RDF is a language designed by humans for processing
by machines.  The RDF language -- the grammar together
with available RDF vocabularies -- does not itself solve
the difficulties of human communication any more than
the prevalence of English guarantees world understanding.
However, RDF does support the process of connecting dots --
of creating "knowledge" -- by providing a linguistic basis for
expressing and linking data.  

Just as English as a second language provides a basis for
communication among non-native English speakers, RDF provides
a common second language into which local data formats can be
translated and exposed.  Just as English is useful without
being the best of all possible grammars, RDF happens to be
what we currently have -- the only general-purpose language
for data with any traction.  But just as English grammar
follows deep linguistic structures determined by the human
capacity for language, it is likely that RDF, if re-invented,
would end up strongly resembling what we currently have.

Aside from supporting data interchange in the here and now, RDF
provides a response to the ongoing and inevitable obsolescence
of computer applications and customized data formats by
expressing knowledge using a well-understood grammar and citing
publicly documented vocabularies and resource URIs.  In this
sense, it supports data that does not require additional
out-of-band information for its interpretation, i.e., data
that "speaks for itself".  This assumes, of course, that
our cultural memory institutions will deploy robust methods
for preserving the parts of the Web where the underlying RDF
vocabularies and resource identifiers are documented.

We are in the midst of a rapid shift from a world in which
information was predominantly print-based to one in which it is
predominantly digital.  The scale and speed of transformation
virtually guarantees that any computer applications and user
interfaces we use today will at some point, probably soon,
be superseded.  Data that cannot speak for itself will be more
vulnerable to becoming irrelevant.

Not only is data expected to be linkable in the present,
but we hope they will be remain intelligible in the future.
In 2010, to put information into ad-hoc data formats in
the absence of well-defined interpretations as RDF triples
is like making statements without grammar.  Creating data
without URIs is like writing without proper footnotes.
This is okay for information with a short shelf life --
i.e., most information -- but information of lasting cultural
significance deserves better.  Cultural memory institutions
live by the ethos of scholarship, by which things like good
grammar and proper footnotes should really matter. The language
of RDF represents the application of that ethos to data itself.


-- 
Tom Baker <tbaker@tbaker.de>

Received on Monday, 18 October 2010 02:41:52 UTC