An RDF wishlist

(rejigged subject line)

On Thu, Jul 1, 2010 at 4:35 AM, Pat Hayes <phayes@ihmc.us> wrote:
>> Pat, I wish you had been there.  ;)
>
> I have very mixed views on this, I have to say. Part of me wanted badly to
> be present. But after reading the results of the straw poll, part of me
> wants to completely forget about RDF,  never think about an ontology or a
> logic ever again, and go off and do something completely different, like art
> or philosophy.

I have mixed feelings about missing the workshop too. Having been
pushing this wheelbarrow uphill for far too long, it does seem a shame
to have missed such an event. On the other hand, it is hard to know
what to make of the workshop outcomes since the participants form an
unusually specialist subset of humanity, and the problem of what W3C
next does with its RDF standard such a small part of the larger
problem.

It's clear that many workshop participants were aware of the risk of
destabilizing the core technologies just as we are gaining some very
promising real-world traction. That was a relief to read. For those
who have invested time and money in helping us get this far, and who
had the resources to participate, this concern was probably enough to
motivate participation. It's clear also that participants were aware
of many of the little annoyances that bring friction and frustration
to those working with RDF. What I'm less sure of is how to represent
the perspective of those who have explored RDF and walked away. Over
the years, many bright people have investigated RDF enthusiastically,
and left disappointed. Those folk didn't come to the workshop, they
didn't write a position paper, and they probably don't particularly
care about its outcomes. But they're just the kind of people who will
need to enjoy using RDF if we are to succeed.

Is RDF hard to work with? I think the answer remains 'yes', but we
lack consensus on why. And it seems even somehow disloyal to admit it.
If I had to list reasons, I'd leave nits like 'subjects as literals'
pretty low down. Many of the reasons I think are anavoidable, and
intrinsic to the kind of technology and problems we're dealing with.
But there are also lots of areas for improvement. Most of these are
nothing to do with fixups to W3C standards documentation. And finally,
we can lesson the perception of pain by improving the other side:
getting more decent linked data out there, so the suffering people go
through is "worth it".

Some reasons why RDF is annoying and hard (a mildly ordered list):

* RDF data is gappy, chaotic, full of unexpected extensions and
omissions - BY DESIGN
* RDF toolkits each offer different items from a large menu (syntaxes,
storage, inference facilities), so even when you're getting a lot, you
probably don't appreciate what you're getting and we have no common
checklist that help non-guru developers understand this.
* RDF toolkit / library immaturity; eg1. I wasted half a weekend
recently trying to find a decent Javascript system. eg2. I work in
Python using the popular rdflib library, whose half-finished SPARQL
support was recently removed and put into an 'extras' package; nobody
seems quite sure how well it works. The Ruby landscape remains messy
although the public-rdf-ruby list have recently been collaborating
actively to improve things. Broken old and abandoned code litters the
Web; good stuff remains on the bleeding edge and unpackaged. Great
ideas, code and algorithms remain trapped in a single implementation
language rather than transliterated to other widely deployed
languages. Almost every toolkit's SQL backend is represented
differently. Only a few serializers bother to prettify RDF/XML nicely,
despite there being opensource code out there that could easily be
copied.
* RDF is good for aggregation of externally managed data; managing
data *as* RDF comes with certain complexities since edit/delete
operations on a connected graph aren't as intuitive as on a closed
tree structure. If I delete a certain node from the graph, which
others should be cleaned up too? Named graphs help somewhat there but
good habits aren't yet understood, much less documented.
* As a community, we have some standards for documenting the atomic
terms in our vocabularies (ie. RDFS/OWL) but we tend to stop there,
and not to document the larger graph patterns that are needed to
really communicate using these structures, or the underlying use cases
that motivated them in the first place. We also don't do nearly enough
analytics and stats over the actual data out there to make it easier
to consume, and for publishers to gravitate towards existing idioms
rather than make up similar-but-different graph patterns that'll
confuse the landscape further.
* Our small community (we are outnumbered by Visual Basic enthusiasts,
let alone Javascripters) is fragmented and grumpy. OWL and Linked Data
enthusiasts too often talk and think disparagingly about each others'
work, or not-so-secretly wish the others would just go away and stop
messing things up. And all this foolish posturing despite the fact
that Linked Data is a massive deployment of OWL-documented
vocabularies, and that the essential but annoying gappy chaotic nature
of RDF can at least partly be patched up by techniques that help us
figure out when two different RDF expressions are saying the same
thing, aka inference.
* Enthusiasm sometimes borders on a religious zeal that would be
better spent on toolkit polish than on overloading mailing lists; or
on prolonging petty wars ('x is not semantic enough', 'y isn't really
Linked Data'...) with other folk who prefer for whatever reason to use
different technologies to publish, share and link data.

So, what do we do?

A few years ago, Edd Dumbill turned the XML Europe conference into the
XTech conference, transforming it from a nose-too-close-to-the-screen
event for markup nerds, into an event that brought together browser
people, XML markup experts, open data advocates (creative commons
etc.), and forward-thinking creative technologists of every kind.
XTech is no longer with us, although I expect Edd's work at OSCON
(http://www.oscon.com/oscon2010 which I'd happily have attended over
any RDF/SemWeb event) shows similar insight. XTech was important as it
provided a meeting place for technologists with different technical
favourites, while also tapping into the larger themes that motivate
much of the passion in the first place. It helped people identify
themselves with a larger effort, rather than with some specific
technology tool.  I think we can learn a lot from XTech.

RDF enthusiasts share 99.9% of their geek DNA with the microformats
community, with XML experts, with OWL people, ... but time and again
end up nitpicking on embarrassing details. Someone "isn't really"
publishing Linked Data because their RDF doesn't have enough URIs in
it, or they use unfashionable URI schemes. Or their Apache Web server
isn't sending 303 redirects. Or they've used a plain XML language or
other standard instead. This kind of partisan hectoring can shrink a
community passionate about sharing data in the Web, just at a time
when this effort should be growing more inclusive and taking a broader
view of what we're trying to achieve.

The formats and protocols are a detail. They'll evolve over time. If
people do stuff that doesn't work, they'll find out and do other
things instead. The thing that keeps me involved is the common passion
for sharing information in the Web. If we keep that as an anchor point
rather than some flavour of some version of RDF, I think a lot of the
rest falls into place. I love
http://www.w3.org/Illustrations/LetsShare.ai.gif "Let's Share What We
Know" - an ancient slogan of the early Web project. If we take "Let's
share what we know" as a central anchor, rather than triples, we can
evaluate different technical strategies in terms of whether they help
by making it easier to "share what we know" using the Web.

Going back to my list, I think the reason to use RDF will simply be
that others have also chosen to use it. Nothing more really, it's
about the data, above all. Sure the reason we can all choose to use it
and gain value from each others' parallel decision, is the emphasis on
linking, sharing, mixing, decentralisation. But when choosing whether
to bother with RDF, I think for future decision makers it'll all be
about the data not the implementation techniques.

 The reason is *not* the tooling, the fabulous parsers, awe-inspiring
inference engines, expressive query languages or cleverly designed
syntaxes. Those are all means-to-an-end, which is sharing information
about the world. Or getting hold of cheap/free and bulky background
datasets, if you prefer to couch it in less idealistic terms.

And why would anyone care to get all this semi-related, messy Web
data? Because problems don't come nicely scoped and packaged into
cleanly distinct domains. Whenever you try to solve one problem, it
borders on a dozen others that are a higher priority for people
elsewhere. You think you're working with 'events' data but find
yourself with information describing musicians; you think you're
describing musicians, but find yourself describing digital images; you
think you're describing digital images, but find yourself describing
geographic locations; you think you're building a database of
geographic locations, and find yourself modeling the opening hours of
the businesses based at those locations. To a poet or idealist, these
interconnections might be beautiful or inspiring; to a project manager
or product manager, they are as likely to be terrifying.

Any practical project at some point needs to be able to say "Enough
with all this interwingularity! this is our bit of the problem space,
and forget the rest for now". In those terms, a linked Web of RDF data
provides a kind of safety valve. By dropping in identifiers that link
to a big pile of other people's data, we can hopefully make it easier
to keep projects nicely scoped without needlessly restricting future
functionality. An events database can remain an events database, but
use identifiers for artists and performers, making it possible to
filter events by properties of those participants. A database of
places can be only a link or two away from records describing the
opening hours or business offerings of the things at those places.
Linked Data (and for that matter FOAF...) is fundamentally a story
about information sharing, rather than about triples. Some information
is in RDF triples; but lots more is in documents, videos,
spreadsheets, custom formats, or [hence FOAF] in people's heads.

Looked at in these terms, my RDF wishlist would be based on looking at
things from the consumer side. Publishing RDF is fiddly, but do-able.
And it only takes a few lines of [perl|java|ruby|prolog|xslt...] to
expose massive amounts of information in the Web. The linked data
scene in recent years has started to do just this on an impressive
scale. But consuming RDF remains pretty annoying, a hurdle to be
crossed to get at the good stuff, the data. Even while RDF remains our
single best story for how such a Web of data can be broken down into a
largely uncoordinated global division of labour, RDF itself remains
... annoying. So my RDF wishlist would be about making RDF less
annoying or risky to consume. While a lot of that is about tool
maturity, there is a lot around data licensing and dealing with a
natural waryness of depending too much on others, and on making sure
RDF and the 'linked data' idea is presented in a more inclusive manner
that respects that fact that most of the world's information isn't
going to gain much from being put into URI-based triples.

The very nature of RDF makes it somewhat annoying to work with. RDF
data is always going to be a kind of frankenstein's data monster,
patched together from bits and pieces that can just about be made to
fit together. Fortunately, we have at our fingertips a world wide Web
that lets us share an awful lot of these bits; the more we can get
re-usable RDF datasets out there, the less people will worry about the
pain of using it, and the more likely it'll be that there will be
genuinely useful, relevant data on hand when someone goes looking for
it.

All the time we run around evangelizing RDF while not admitting that
it is also kind of annoying, we raise expectations that will be dashed
when people actually try using it. All the time we spend ages writing
long emails when we could be fixing and improving RDF software or
datasets, we're probably also prolonging the problem. For my part in
that last one, and for this over long mail, ... sorry :)

cheers,

Dan

Received on Thursday, 1 July 2010 08:47:03 UTC