W3C

- DRAFT -

HCLS

22 May 2012

Attendees

Present
Tony, tlebo, Scott_Marshall, Chimezie, EricP, Amit, egombocz, HarryH, Janos, Jun, mattgamble, michael, ram, Stella
Regrets
Chair
Scott_Marshall
Scribe
Jun

Contents


Quantifying RDF data sets, Janos G. Hajagos

<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf

<HarryH> 412 623 is me - Harry Hochheiser Pittsburgh

<mscottm> Brian Lowe: Developer on VIVO project, Stella Mitchell also

<mscottm> Harry Hochheiser - University of Pittsburgh, interested in HCLS

<mscottm> Brian Lowe: Developer on VIVO project, Stella Mitchell also

<ram> Ram from Metaome - We have a life science search engine called

<scribe> scribe: Jun

<mscottm> Chimezie Ogbuji - Cleveland Clinic, Case Western, Recently

<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf

Scott: introduce Janos' talk: it's important to differentiate

<mscottm> VIVO - scientific research network ontology

Janos: one of the members of CTSA Connect graduate programme, to

<chimezie> yes, I do

<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf

Slide 1: a lot of further work. this just presents a start

slide 2

Janos: Semantic Web is based on RDF, a graph-based data model

<mscottm> CTSA Connect: http://www.ctsaconnect.org/about-us

Janos: more flexible than relational DBs by allowing parallel edges

slide 3

Janos: a paper submitted to the Triple Challenge 2010
... they did some quantification of datasets, looking into the
... drew some of the approaches of this paper
... took a look of the datasets of the challenge, and did some

slide 4

Janos: a basic python library to parse n-triples. it's a memory
... PyPy for just-in-time compiling. speed up the processing

<Amit> conference is full! cannot join by voice

Janos: just some basic statistical analysis, then started to do
... each file is treated as its own graph. didn't use Named Graphs

Q: on scalability

Janos: largest one is LinkedCT
... 28 millions triples. took 30% of a 64G memory
... SPARQL1.1 might provide better performance promises

slide 5

scribe: started with some basic counts

slide 6

Janos: do some simple fractions calculations
... e.g, how many literals in your triples
... how many literals are unique?
... how many objects are unique?
... structure measurement, by taking out the typing sort of
... subject/object coverage, more pointing or more pointed?
... more concrete examples to follow

slide 7

<mscottm> scribenick: Jun

Janos: computed it against a couple of LOD datasets, 4 of the
... BioGrid database: an open access DB on Protein and Genetic
... BioPAX: pathways in BioPAX format
... bioGrid can be downloaded via OWL format
... VIVO: NIH funded project for scientific networking
... got n-triples for VIVO dataset
... go through by the number of triples desc

slide 8

Janos: top subjects, top classes, predicates, etc
... give you a good idea of how people use ontologies
... LinkCT: 40% are literals, objects have 80% repetition
... three dominant classes

Michael: have you done this analysis on the GO ontology?

Janos: not yet

Michael: expecting more diverse coverage

Janos: would be interesting to look at

slide 9

Janos: BioGrid in BioPAX
... 50MB in owl but 40 millions triples in n-triple format
... again, subject, object coverage, and top classes. they are not LOD yet
... get a good sense of what's actually in the content

slide 10

Janos: RxNorm
... only 6 classes. pretty small
... quite a bit of literals. structure data is higher than other datasets

Q: do you see a big structure differences from these datasets?

Janos: TBD

slide 11

Janos: 1.2 million triples
... data about publications, such as Authorship, Person ...
... publication is dominant data source there. pretty good

slide 12

Janos: it has a lot of links to outside datasets, have a much

slide 13

Janos: top predicate: owl:sameAs. again has a lot of links to

mscottm: any idea about how one type of metric could be more

slide 14

Janos: there are a lot of tools for graph vis and analysis, but

slide 15

Janos: the twist is to allow multiple paths between 2 nodes

slide 16

Janos: there are ways to collapse the parallel edges, or put RDF

slide 17

Janos: show some examples
... get co-authors that are only members of a site, to get a

slide 18

Janos: do some basic graph analysis using Mathematica
... basic in-degrees, out-degrees, histograms, one/two degree

<mscottm> Nice!

slide 19

Janos: Gephi doesn't support parallel edges. you have to do some

slide 20

Janos: some links

<michael> thanks, janos, i need to drop off

Eric: any further analysis on some of the results, like the

[mscottm, I have to leave for another meeting]

<mattgamble> First how do you work out which metrics are useful?

<egombocz> Our Knowledge Explorer also provides metrics for weighing

<mscottm> Chime - would you please jot your comment/question into IRC?

<chimezie> My question was whether he had considered using rdflib

<mscottm> CTSA Connect - ISF - Integrated Semantic Framework: core is

<HarryH> Thanks , Janos - very interesting!

<ram> Thanks Janos

<Stella> thanks all, bye

<mscottm> bye all

Summary of Action Items

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.133 (CVS log)
$Date: 2008-01-18 18:48:51 $