Re: How do you explore a SPARQL Endpoint? from David Booth on 2015-02-04 (semantic-web@w3.org from February 2015)

From: David Booth <david@dbooth.org>
Date: Wed, 04 Feb 2015 16:25:43 -0500
To: Michael F Uschold <uschold@gmail.com>, Lushan Han <lushan1@umbc.edu>
CC: Pavel Klinov <pavel.klinov@uni-ulm.de>, Bernard Vatant <bernard.vatant@mondeca.com>, Juan Sequeda <juanfederico@gmail.com>, Semantic Web <semantic-web@w3.org>, public-lod public <public-lod@w3.org>
Message-ID: <54D28E57.2080707@dbooth.org>
The RDF Pipeline Framework also has a perl script that reads RDF 
(Turtle), figures out the data's implied schema -- classes and 
predicates -- and outputs a summary.  The code is open source (Apache 
2.0 licensed) and resides on github:
https://github.com/dbooth-boston/rdf-pipeline/blob/master/tools/summarize-rdf

It is *not* very efficient, so at present it is not suitable for large 
RDF datasets.  (It could be made more efficient, but no effort has been 
put into that yet.)  An opening comment in the code explains the output:
[[
# Runtime: ~30 minutes / 600k triples on a 2012 laptop (quad processor)
#
# EXAMPLE INPUT:
# 1. @prefix p: <http://purl.org/pipeline/ont#> .
# 2. @prefix : <http://localhost/node/> .
# 3. :max a p:FileNode . # No updater -- update manually.
# 4. :odds a p:FileNode ;
# 5. p:inputs ( :max ) ;
# 6. p:updater "odds-updater" .
# 7. :mult a p:FileNode ;
# 8. p:inputs ( :odds <http://localhost/node/multiplier.txt> ) ;
# 9. p:updater "mult-updater" .
# 10. :addone a p:FileNode ;
# 11. p:inputs ( :mult ) ;
# 12. p:updater "addone-updater" .
# 13. p:URI <http://www.w3.org/2000/01/rdf-schema#subClassOf> p:Node .
#
# EXAMPLE OUTPUT:
# 1. ===== Input Summary =====
# 2. Parsing turtle: /tmp/jin.ttl
# 3. Total triples: 19
# 4. Nodes by kind: BLANK 4 LITERAL 3 URI 7
# 5. Literals by datatype: UNTYPED 3
# 6.
# 7. ===== Predicates by Subject Class =====
# 8. p1:FileNode 4
# 9. p1:inputs 3 -> { rdf:List 3 } 3
# 10. p1:updater 3 -> { (UNTYPED) 3 } 3
# 11. rdf:type 4 -> { rdfs:Class 1 } 1
# 12.
# 13. rdf:List 4
# 14. rdf:first 4 -> { p1:FileNode 3 UNKNOWN 1 } 4
# 15. rdf:rest 4 -> { rdf:List 2 } 2
# 16.
# 17. rdfs:Class 1
# 18. rdfs:subClassOf 1 -> { rdfs:Class 1 } 1
# 19.
# 20. * Indicates a root class, whose instances are never objects.
# 21.
# 22. ===== Namespaces =====
# 23. PREFIX p1: <http://purl.org/pipeline/ont#>
# 24. PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
# 25. PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
#
# EXPLANATION OF OUTPUT:
# Numbers are instance counts. Braces group a list of classes,
# because things can have more than one class.
#
# For brevity, class and predicate names have been shortened by
# stripping a presumed namespace (though not necessarily a namespace
# that you declared using @prefix). Namespace prefixes are listed
# at the end (on lines 23-25), but they are not necessarily the
# same as the prefixes that were used in the input turtle, because
# the original prefixes are lost in parsing.
#
# Line 4 shows the number of distinct blank nodes, literals and URIs.
#
# Line 5 breaks down the literals by datatype, showing the number
# of distinct instances for each datatype.
#
# Line 8 indicates that there were 4 distinct p1:FileNode instances
# in the subject position of a triple.
#
# Line 9 indicates that the domain of p1:inputs included p1:FileNode,
# range included rdf:List, there were 3 triples having a p1:FileNode
# instance in the subject position and p1:inputs as predicate, and
# there were 3 distinct rdf:List values in the object position of a triple.
#
# Line 10 indicates that the range of p1:updater was a set of
# untyped literal values, and there were 3 distinct literals.
# A datatype range (as opposed to a class range) is indicated
# in parentheses. It also indiates that there were 3 triples
# with subject class p1:FileNode and predicate p1:updater.
#
# Line 14 indicates that the rdf:first predicate has a range that
# includes both the p1:FileNode class (having 3 instances) and
# an unknown class (having 1 instance), for a total of 4
# distinct instances. In this case, the unknown class was due
# to <http://localhost/node/multiplier.txt> on input line 8,
# as it was not declared with any rdf:type . Remember that
# rdf:first" and rdf:rest are auto-generated from Turtle
# list syntax ( ... ).
#
# Line 17 indicates that there was one distinct instance of class
# rdf:Class as a subject.
#
# Line 20: In this example there were no root classes, i.e.,
# classes whose instances never appear in the object position,
# but if there had been, then each would have been marked
# with an asterisk.
]]

David Booth

On 02/04/2015 03:49 PM, Michael F Uschold wrote:
> Sorry, ignore priore email, it was sent prematurely.
>
> We had occasion to need the ability to eplore a triple store in an
> application we were building for a client using a triple store (TS).
> Triples were being created using scripts and being updated into the
> TS,we also had an application that allowed users to enter information
> which added more triples.  All of this was backed by an ontology that
> was evolving. It was pretty tricking knowing what parts of the ontology
> were being exercised and which were not.  So we wrote some SPARQL
> queries that produced a table where each row said something like this:
> There are 543 triples where the subject is  of type Person and the
> predicate is employedBy and the object is of type Organization.
> The table looked a bit like this:
>
> Subject        Predicate         Object         Count
> Person          hasEmployer    Organization 2344
> Organization locatedIn        GeoRegion       432
>
> We found this to be extremely useful, not only to see exactly what was
> being used, but also how much as well as what was NOT being used, which
> were candidates for removing from the ontology.  The SPARQL queries are
> not simple to write, but they are not too bad either. Some of the other
> responses spoke of similar things.
>
> This is more specialized than the original question, which was to find
> out what the ontology was.   Here were were more concerned about which
> parts of the ontology were being used.
>
> Michael
>
>
> On Wed, Feb 4, 2015 at 12:42 PM, Michael F Uschold <uschold@gmail.com
> <mailto:uschold@gmail.com>> wrote:
>
>     We had occasion to need this ability on an application we were
>     building for a client using a triple store (TS). Triples were being
>     created using scripts and being updated into the TS,we also had an
>     application that allowed users to enter information which added more
>     triples.  All of this was backed by an ontology that was evolving.
>     It was pretty tricking knowing what parts of the ontology were being
>     exercised and which were not.  So we wrote some SPARQL queries that
>     produced a table where each row said something like this:
>     There are 543 triples where the subject is  of type Person and the
>     predicate is employedBy and the object is of type Organization.
>     A row looked like this:
>
>     Subject
>
>     On Wed, Feb 4, 2015 at 11:35 AM, Lushan Han <lushan1@umbc.edu
>     <mailto:lushan1@umbc.edu>> wrote:
>
>         This work [1] might be helpful to some people. It automatically
>         learns a "schema" from a given RDF dataset, including most
>         probable classes and properties and most probable
>         relations/paths between given classes and etc. Next, it can
>         automatically translate a casual user's intuitive graph query or
>         schema-free query to a formal SPARQL query using the learned
>         schema and statistical NLP techniques, like textual semantic
>         similarity.
>
>         [1]
>         http://ebiquity.umbc.edu/paper/html/id/658/Schema-Free-Querying-of-Semantic-Data
>
>
>         Cheers,
>
>         Lushan
>
>         On Sun, Jan 25, 2015 at 11:32 PM, Pavel Klinov
>         <pavel.klinov@uni-ulm.de <mailto:pavel.klinov@uni-ulm.de>> wrote:
>
>             On Sun, Jan 25, 2015 at 11:44 PM, Bernard Vatant
>             <bernard.vatant@mondeca.com
>             <mailto:bernard.vatant@mondeca.com>> wrote:
>             > Hi Pavel
>             >
>             > Very interesting discussion, thanks for the follow-up.. Some quick answers
>
>             > below, but I'm currently writing a blog post which will go in more details
>             > on the notion of Data Patterns, a term I've been pushing last week on the DC
>             > Architecture list, where it seems to have gained some traction.
>             > Seehttps://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind1501&L=dc-architecture
>             > for the discussion.
>
>             OK, thanks for the link, will check it out. I agree that the
>             patterns
>             is perhaps a better term than "schema" since by the latter
>             people
>             typically mean explicit specification. I guess it's my use
>             of the term
>             "schema" which created some confusion initially.
>
>             >> ... which reflects what the
>             >> data is all about. Knowing such structure is useful (and often
>             >> necessary) to be able to write meaningful queries and that's, I think,
>             >> what the initial question was.
>             >
>             >
>             > Certainly, and I would rewrite this question : How do you find out data
>             > patterns in a dataset?
>
>             I think it's a more general and tough question having to do
>             with data
>             mining. Not sure that anyone would venture into finding out data
>             patterns against a public endpoint just to be able to write
>             queries
>             for it.
>
>             >
>             >>
>             >> When such structure exists, I'd say
>             >> that the dataset has an *implicit* schema (or a conceptual model, if
>             >> you will).
>             >
>             >
>             > Well, that's where I don't follow. If data, as it happens more and more, is
>             > gathered from heterogeneous sources, the very notion of a conceptual model
>             > is jumping to conclusions.
>
>             A merger of structures is still a structure. By anyways,
>             I've already
>             agreed to say patterns =)
>
>             > In natural languages, patterns often precede the
>             > grammar describing them, even if the patterns described in the grammar at
>             > some point become prescriptive rules. Data should be looked at the same way.
>
>             Not sure. I won't immediately disagree since I don't have
>             statistics
>             regarding structured/unstructured datasets out there.
>
>             >>
>             >> What is absent is an explicit representation of the schema,
>             >> or the conceptual model, in terms of RDFS, OWL, or SKOS axioms.
>             >
>             >
>             > When the dataset gathers various sources and various vocabularies, such a
>             > schema does not exists, actually.
>
>             Not necessarily. Parts of it may exist. Take yago, for
>             example. It's
>             derived from a bunch of sources including Wikipedia and
>             GeoNames and
>             yet offers its schema for a separate download.
>
>             >> However, when the schema *is* represented explicitly, knowing it is a
>             >> huge help to users which otherwise know little about the data.
>             >
>             >
>             > OK, but the question is : which is a good format for exposing this
>             > structure?
>             > RDFS/OWL ontology/vocabulary, Application Profiles, RDF Shapes / whatever it
>             > will be named, or ... ?
>
>             I think this question is a bit secondary. If the need were
>             recognized,
>             this could be, at least in theory, agreed on.
>
>             >>
>             >> PPS. It'd also be correct to claim that even when a structure exists,
>             >> realistic data can be messy and not fit into it entirely. We've seen
>             >> stuff like literals in the range of object properties, etc. It's a
>             >> separate issue having to do with validation, for which there's an
>             >> ongoing effort at W3C. However, it doesn't generally hinder writing
>             >> queries which is what we're discussing here.
>             >
>             >
>             > Well I don't see it as a separate issue. All the raging debate around RDF
>             > Shapes is not (yet) about validation, but on the definition of what a
>             > shape/structure/schema can be.
>
>             OK, won't disagree on this.
>
>             Thanks,
>             Pavel
>
>              >
>              >
>              >>
>              >> > Since the very notion of schema for RDF data has no
>             meaning at all,
>              >> > and the absence of schema is a bit frightening, people
>             tend to give it a
>              >> > lot
>              >> > of possible meanings, depending on your closed world
>             or open world
>              >> > assumption, otherwise said if the "schema" will be
>             used for some kind of
>              >> > inference or validation. The use of "Schema" in RDFS
>             has done nothing to
>              >> > clarify this, and the use of "Ontology" in OWL added a
>             layer of
>              >> > confusion. I
>              >> > tend to say "vocabulary" to name the set of types and
>             predicates used by
>              >> > a
>              >> > dataset (like in Linked Open Vocabularies), which is a
>             minimal
>              >> > commitment to
>              >> > how it is considered by the dataset owner, bearing in
>             mind that this
>              >> > "vocabulary" is generally a mix of imported terms from
>             SKOS, FOAF,
>              >> > Dublin
>              >> > Core ... and home-made ones. Which is completely OK
>             with the spirit of
>              >> > RDF.
>              >> >
>              >> > The brand new LDOM [1] or whatever it ends up to be
>             named at the end of
>              >> > the
>              >> > day might clarify the situation, or muddle those
>             waters a bit more :)
>              >> >
>              >> > [1] http://spinrdf.org/ldomprimer.html
>              >> >
>              >> > 2015-01-23 10:37 GMT+01:00 Pavel Klinov
>             <pavel.klinov@uni-ulm.de <mailto:pavel.klinov@uni-ulm.de>>:
>              >> >>
>              >> >> Alright, so this isn't an answer and I might be
>             saying something
>              >> >> totally silly (since I'm not a Linked Data person,
>             really).
>              >> >>
>              >> >> If I re-phrase this question as the following: "how
>             do I extract a
>              >> >> schema from a SPARQL endpoint?", then it seems to pop
>             up quite often
>              >> >> (see, e.g., [1]). I understand that the original
>             question is a bit
>              >> >> more general but it's fair to say that knowing the
>             schema is a huge
>              >> >> help for writing meaningful queries.
>              >> >>
>              >> >> As an outsider, I'm quite surprised that there's
>             still no commonly
>              >> >> accepted (i'm avoiding "standard" here) way of doing
>             this. People
>              >> >> either hope that something like VoID or LOV
>             vocabularies are being
>              >> >> used, or use 3-party tools, or write all sorts of ad
>             hoc SPARQL
>              >> >> queries themselves, looking for types, object properties,
>              >> >> domains/ranges etc-etc. There are also papers written
>             on this subject.
>              >> >>
>              >> >> At the same time, the database engines which host
>             datasets often (not
>              >> >> always) manage the schema separately from the data.
>             There're good
>              >> >> reasons for that. One reason, for example, is to be
>             able to support
>              >> >> basic reasoning over the data, or integrity
>             validation. Just because
>              >> >> in RDF the schema language and the data language are
>             the same, so
>              >> >> schema and data triples can be interleaved, it need
>             not (and often
>              >> >> not) be managed that way.
>              >> >>
>              >> >> Yet, there's no standard way of requesting the schema
>             from the
>              >> >> endpoint, and I don't quite understand why. There's
>             the SPARQL 1.1
>              >> >> Service Description, which could, in theory, cover
>             it, but it doesn't.
>              >> >> Servicing such schema extraction requests doesn't
>             have to be mandatory
>              >> >> so the endpoints which don't have their schemas right
>             there don't have
>              >> >> to sift through the data. Also, schemas are typically
>             quite small.
>              >> >>
>              >> >> I guess there's some problem with this which I'm
>             missing...
>              >> >>
>              >> >> Thanks,
>              >> >> Pavel
>              >> >>
>              >> >> [1]
>              >> >>
>              >> >>
>             http://answers.semanticweb.com/questions/25696/extract-ontology-schema-for-a-given-sparql-endpoint-data-set
>              >> >>
>              >> >> On Thu, Jan 22, 2015 at 3:09 PM, Juan Sequeda
>             <juanfederico@gmail.com <mailto:juanfederico@gmail.com>>
>              >> >> wrote:
>              >> >> > Assume you are given a URL for a SPARQL endpoint.
>             You have no idea
>              >> >> > what
>              >> >> > data
>              >> >> > is being exposed.
>              >> >> >
>              >> >> > What do you do to explore that endpoint? What
>             queries do you write?
>              >> >> >
>              >> >> > Juan Sequeda
>              >> >> > +1-575-SEQ-UEDA
>              >> >> > www.juansequeda.com <http://www.juansequeda.com>
>              >> >>
>              >> >
>              >> >
>              >> >
>              >> >
>              >
>              >
>              > --
>              > Bernard Vatant
>              > Vocabularies & Data Engineering
>              > Tel : + 33 (0)9 71 48 84 59
>             <tel:%2B%2033%20%280%299%2071%2048%2084%2059>
>              > Skype : bernard.vatant
>              > http://google.com/+BernardVatant
>              > --------------------------------------------------------
>              > Mondeca
>              > 35 boulevard de Strasbourg 75010 Paris
>              > www.mondeca.com <http://www.mondeca.com>
>              > Follow us on Twitter : @mondecanews
>              > ----------------------------------------------------------
>
>
>
>
>
>     --
>
>     Michael Uschold
>     Senior Ontology Consultant, Semantic Arts
>     http://www.semanticarts.com <http://www.semanticarts.com/>
>         LinkedIn:http://tr.im/limfu
>         Skype, Twitter: UscholdM
>
>
>
>
>
>
> --
>
> Michael Uschold
> Senior Ontology Consultant, Semantic Arts
> http://www.semanticarts.com <http://www.semanticarts.com/>
>     LinkedIn:http://tr.im/limfu
>     Skype, Twitter: UscholdM
>
>
>
Received on Wednesday, 4 February 2015 21:26:18 UTC