Re: skos in billion-triple-challenge data

From: Dan Brickley <danbri@danbri.org>
Date: Sat, 11 Sep 2010 07:40:01 +0000
Message-ID: <AANLkTi=sFP=kt-q3p73=so3zNfMy9W39hK90JpErow5+@mail.gmail.com>
To: Ed Summers <ehs@pobox.com>
Cc: public-esw-thes@w3.org
On Sat, Sep 11, 2010 at 2:45 AM, Ed Summers <ehs@pobox.com> wrote:
> On a Friday whim (prompted by Dan Brickley) I downloaded the 2010
> Billion Triple Challenge dataset to look and see how many SKOS
> assertions there are in it, and from what domains. If you are
> interested the results can be found at:
>  http://gist.github.com/574700

This is great, thanks for doing this! I'm also having similar
conversation with the Sindice team, and will be offering suggestions
for how they can map out the SemWeb vocabulary/data landscape. Now
would be a very good time for the SKOS community to figure out what
else they might want to learn about large scale SKOS deployment
patterns. SKOS is interesting in this regard, since it is a bit like a
domain vocabulary (dc, foaf, creative commons) and a bit like an
infrastructural vocabulary (rdfs, owl, ...).  General RDF stats that
help dc, foaf, creative commons etc understand their deployment,
aren't so directly helpful for individual SKOS scheme creators, since
eg. 'UKAT in SKOS' or 'LCSH in SKOS' show up as very similar triples
in RDF.

So what would we like to know about SKOS?

For example -

(general questions)
 - what non-skos properties most commonly point to things of type skos:Concept?
- what non-skos properties most commonly apply to skos:Concepts?
- which bits of SKOS are heavily used; are not used; are still used,
even though removed from the final spec?
- are people subclassing, superclassing SKOS classes eg. skos:Concept?
- are there sub/super-properties declared for SKOS properties?
- how are the internationalisation features of SKOS being used in practice?
- URI patterns: # vs / URIs, 303 redirects; are these being used?
- is SKOS for Web publication of 'traditional thesauri' used
differently (data patterns) from SKOS used to capture information from
users (tags, blog categories, wikipedia)?
- how long are prefLabel and other SKOS strings? (some graphs here
could help Web designers creating UI to display SKOS content)
- what common mistakes can we find in the data?

(scheme-specific questions)
- given a SKOS scheme/dataset, eg. UKAT, we might re-ask some of the
above questions, eg. what properties point to concepts from that
scheme; or what domains are using it.
- what are the RDF types and most common properties found on objects
that have any property whose value is a link to a id.loc.gov LCSH skos

OK that's just off the top of my head. I'm sure others here must have
questions they'd be interested to see answers for. I'm emphasising
data questions that involve large aggregations of RDF data, not
analytics you'd do on your own local SKOS repository...

No promises that any of these questions can be answered, but finding
out what we want to know would be a useful first step.


