Re: Looking for pedagogically useful data sets

One of my favorite parts of this classic book

http://www.amazon.com/gp/product/0201517523/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0201517523&linkCode=as2&tag=honeymediasys-20&linkId=DA6EAFVQC6QUHS5F

is the explanation of why you will hit the wall with human-readable terms
in a large-scale knowledge base plus the way human readable terms confuse
people as to what computers can understand about them.

That said,  I know I got a lot of resistance from people when the first
version of :BaseKB used mid identifiers for everything,  even though I know
I got 100% relational integrity that way and also it meant I could use 32
bit ints for identifiers in my code.

For large scale systems,  thus,  you are going to have to deal with
predicates such as "V96" and "P1158" not to mind "human readable" terms
which are sometimes as bad or worse,  and ultimately you need tooling to
help with that.

In terms of the problem in front of me,  I have 10 minutes to say something
meaningful to people who (i) don't know anything about RDF and SPARQL and
(ii) aren't necessarily convinced about it's value.  This is not so bad as
it seems because this is going to be pre-recorded so it can be scripted,
 edited and so forth.

Definitely there is not going to be any special tooling,  visualization,
 inference,  or so forth.  The goal is to show that you can do the same
things you do with a relational database,  and maybe *just* a little bit
more.  To pull this off I need to have as many thing "experience near" as
possible (see http://tbips.blogspot.com/2011/01/experience-near.html)

Certainly FOAF data would be good for this,  it really is a matter of
finding some specific file that is easy to work with.

On Thu, Mar 12, 2015 at 3:54 AM, Sarven Capadisli <info@csarven.ca> wrote:

> On 2015-03-12 00:13, Paul Houle wrote:
>
>> Hello all,
>>
>>        I am looking for some RDF data sets to use in a short presentation
>> on
>> RDF and SPARQL.  I want to do a short demo,  and since RDF and SPARQL will
>> be new to this audience,  I was hoping for something where the predicates
>> would be easy to understand.
>>
>>       I was hoping that the LOGD data from RPI/TWC would be suitable,  but
>> once I found the old web site (the new one is down) and manually fixed the
>> broken download link I found the predicates were like
>>
>> <http://data-gov.tw.rpi.edu/vocab/p/1525/v96>
>>
>> and the only documentation I could find for them (maybe I wasn't looking
>> in
>> the right place) was that this predicate has an rdf:label of "V96".)
>>
>> Note that an alpha+numeric code is good enough for Wikidata and it is
>> certainly concise,  but I don't want :v96 to be the first things that
>> these
>> people see.
>>
>> Something I like about this particular data set is that it is about 1
>> million triples which is big enough to be interesting but also small
>> enough
>> that I can load it in a few seconds,  so that performance issues are not a
>> distraction.
>>
>> The vocabulary in DBpedia is closer to what I want (and if I write the
>> queries most of the distracting things about vocab are a non-issue) but
>> then data quality issues are the distraction.
>>
>> So what I am looking for is something around 1 m triples in size (in terms
>> of order-of-magnitude) and where there are no distractions due to obtuse
>> vocabulary or data quality issues.  It would be exceptionally cool if
>> there
>> were two data sets that fit the bill and I could load them into the triple
>> store together to demonstrate "mashability"
>>
>> Any suggestions?
>>
>>
>
> re: "predicates would be easy to understand", whether the label is V96 or
> some molecule, needless to say, it takes some level of familiarity with the
> data.
>
> Perhaps something that's familiar to most people is Social Web data. I
> suggest looking at whatever is around VCard, FOAF, SIOC for instance. The
> giant portion in the LOD Cloud with the StatusNet nodes (in cyan) use FOAF
> and SIOC. (IIRC, unless GnuSocial is up to something else these days.)
>
>
> If statistical LD is of interest, check out whatever is under
> http://270a.info/ (follow the VoIDs to respective dataspaces). You can
> reach close to 10k datasets there, with varying sizes. I think the best bet
> for something small enough is to pick one from the
> http://worldbank.270a.info/ dataspace e.g., GDP, mortality, education..
>
> Or take an observation from somewhere, e.g:
>
> http://ecb.270a.info/dataset/EXR/Q/ARS/EUR/SP00/A/2000-Q2
>
> and follow-your-nose.
>
> You can also approach from a graph exploration POV, e.g:
>
> http://en.lodlive.it/?http://worldbank.270a.info/classification/country/CA
>
> or a visualization, e.g., Sparkline (along the lines of how it was
> suggested by Edward Tufte):
>
> http://stats.270a.info/sparkline
>
> (JavaScript inside SVG building itself by poking at the SPARQL endpoint)
>
> If you want to demonstrate what other type of things you can do with this
> data, consider something like:
>
> http://stats.270a.info/analysis/worldbank:SP.DYN.
> IMRT.IN/transparency:CPI2011/year:2011
>
> See also "Oh Yeah?" and so on..
>
>
> Any way... as a starting point, social data/vocabs may be easier to get
> across, but then you always have to (IMHO) show some applications or
> visualizations for the data to bring the ideas back home.
>
> -Sarven
> http://csarven.ca/#i
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com
http://legalentityidentifier.info/lei/lookup

Received on Thursday, 12 March 2015 21:39:29 UTC