<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf
<HarryH> 412 623 is me - Harry Hochheiser Pittsburgh
<mscottm> Brian Lowe: Developer on VIVO project, Stella Mitchell also
<mscottm> Harry Hochheiser - University of Pittsburgh, interested in HCLS
<mscottm> Brian Lowe: Developer on VIVO project, Stella Mitchell also
<ram> Ram from Metaome - We have a life science search engine called
<scribe> scribe: Jun
<mscottm> Chimezie Ogbuji - Cleveland Clinic, Case Western, Recently
<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf
Scott: introduce Janos' talk: it's important to differentiate
<mscottm> VIVO - scientific research network ontology
Janos: one of the members of CTSA Connect graduate programme, to
<chimezie> yes, I do
<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf
Slide 1: a lot of further work. this just presents a start
slide 2
Janos: Semantic Web is based on RDF, a graph-based data model
<mscottm> CTSA Connect: http://www.ctsaconnect.org/about-us
Janos: more flexible than relational DBs by allowing parallel edges
slide 3
Janos: a paper submitted to the Triple Challenge 2010
... they did some quantification of datasets, looking into the
... drew some of the approaches of this paper
... took a look of the datasets of the challenge, and did some
slide 4
Janos: a basic python library to parse n-triples. it's a memory
... PyPy for just-in-time compiling. speed up the processing
<Amit> conference is full! cannot join by voice
Janos: just some basic statistical analysis, then started to do
... each file is treated as its own graph. didn't use Named Graphs
Q: on scalability
Janos: largest one is LinkedCT
... 28 millions triples. took 30% of a 64G memory
... SPARQL1.1 might provide better performance promises
slide 5
scribe: started with some basic counts
slide 6
Janos: do some simple fractions calculations
... e.g, how many literals in your triples
... how many literals are unique?
... how many objects are unique?
... structure measurement, by taking out the typing sort of
... subject/object coverage, more pointing or more pointed?
... more concrete examples to follow
slide 7
<mscottm> scribenick: Jun
Janos: computed it against a couple of LOD datasets, 4 of the
... BioGrid database: an open access DB on Protein and Genetic
... BioPAX: pathways in BioPAX format
... bioGrid can be downloaded via OWL format
... VIVO: NIH funded project for scientific networking
... got n-triples for VIVO dataset
... go through by the number of triples desc
slide 8
Janos: top subjects, top classes, predicates, etc
... give you a good idea of how people use ontologies
... LinkCT: 40% are literals, objects have 80% repetition
... three dominant classes
Michael: have you done this analysis on the GO ontology?
Janos: not yet
Michael: expecting more diverse coverage
Janos: would be interesting to look at
slide 9
Janos: BioGrid in BioPAX
... 50MB in owl but 40 millions triples in n-triple format
... again, subject, object coverage, and top classes. they are not LOD yet
... get a good sense of what's actually in the content
slide 10
Janos: RxNorm
... only 6 classes. pretty small
... quite a bit of literals. structure data is higher than other datasets
Q: do you see a big structure differences from these datasets?
Janos: TBD
slide 11
Janos: 1.2 million triples
... data about publications, such as Authorship, Person ...
... publication is dominant data source there. pretty good
slide 12
Janos: it has a lot of links to outside datasets, have a much
slide 13
Janos: top predicate: owl:sameAs. again has a lot of links to
mscottm: any idea about how one type of metric could be more
slide 14
Janos: there are a lot of tools for graph vis and analysis, but
slide 15
Janos: the twist is to allow multiple paths between 2 nodes
slide 16
Janos: there are ways to collapse the parallel edges, or put RDF
slide 17
Janos: show some examples
... get co-authors that are only members of a site, to get a
slide 18
Janos: do some basic graph analysis using Mathematica
... basic in-degrees, out-degrees, histograms, one/two degree
<mscottm> Nice!
slide 19
Janos: Gephi doesn't support parallel edges. you have to do some
slide 20
Janos: some links
<michael> thanks, janos, i need to drop off
Eric: any further analysis on some of the results, like the
[mscottm, I have to leave for another meeting]
<mattgamble> First how do you work out which metrics are useful?
<egombocz> Our Knowledge Explorer also provides metrics for weighing
<mscottm> Chime - would you please jot your comment/question into IRC?
<chimezie> My question was whether he had considered using rdflib
<mscottm> CTSA Connect - ISF - Integrated Semantic Framework: core is
<HarryH> Thanks , Janos - very interesting!
<ram> Thanks Janos
<Stella> thanks all, bye
<mscottm> bye all