Re: How do you explore a SPARQL Endpoint? from Pavel Klinov on 2015-01-26 (semantic-web@w3.org from January 2015)

From: Pavel Klinov <pavel.klinov@uni-ulm.de>
Date: Mon, 26 Jan 2015 08:32:09 +0100
To: Bernard Vatant <bernard.vatant@mondeca.com>
Cc: Pavel Klinov <pavel.klinov@uni-ulm.de>, Juan Sequeda <juanfederico@gmail.com>, Semantic Web <semantic-web@w3.org>, public-lod public <public-lod@w3.org>
Message-ID: <CAG5JQxXj-4c2Z2hgtQ6iaYPonbWCYT64ztX21qmW-ys6wX5nrA@mail.gmail.com>
On Sun, Jan 25, 2015 at 11:44 PM, Bernard Vatant
<bernard.vatant@mondeca.com> wrote:
> Hi Pavel
>
> Very interesting discussion, thanks for the follow-up. Some quick answers
> below, but I'm currently writing a blog post which will go in more details
> on the notion of Data Patterns, a term I've been pushing last week on the DC
> Architecture list, where it seems to have gained some traction.
> See https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind1501&L=dc-architecture
> for the discussion.

OK, thanks for the link, will check it out. I agree that the patterns
is perhaps a better term than "schema" since by the latter people
typically mean explicit specification. I guess it's my use of the term
"schema" which created some confusion initially.

>> ... which reflects what the
>> data is all about. Knowing such structure is useful (and often
>> necessary) to be able to write meaningful queries and that's, I think,
>> what the initial question was.
>
>
> Certainly, and I would rewrite this question : How do you find out data
> patterns in a dataset?

I think it's a more general and tough question having to do with data
mining. Not sure that anyone would venture into finding out data
patterns against a public endpoint just to be able to write queries
for it.

>
>>
>> When such structure exists, I'd say
>> that the dataset has an *implicit* schema (or a conceptual model, if
>> you will).
>
>
> Well, that's where I don't follow. If data, as it happens more and more, is
> gathered from heterogeneous sources, the very notion of a conceptual model
> is jumping to conclusions.

A merger of structures is still a structure. By anyways, I've already
agreed to say patterns =)

> In natural languages, patterns often precede the
> grammar describing them, even if the patterns described in the grammar at
> some point become prescriptive rules. Data should be looked at the same way.

Not sure. I won't immediately disagree since I don't have statistics
regarding structured/unstructured datasets out there.

>>
>> What is absent is an explicit representation of the schema,
>> or the conceptual model, in terms of RDFS, OWL, or SKOS axioms.
>
>
> When the dataset gathers various sources and various vocabularies, such a
> schema does not exists, actually.

Not necessarily. Parts of it may exist. Take yago, for example. It's
derived from a bunch of sources including Wikipedia and GeoNames and
yet offers its schema for a separate download.

>> However, when the schema *is* represented explicitly, knowing it is a
>> huge help to users which otherwise know little about the data.
>
>
> OK, but the question is : which is a good format for exposing this
> structure?
> RDFS/OWL ontology/vocabulary, Application Profiles, RDF Shapes / whatever it
> will be named, or ... ?

I think this question is a bit secondary. If the need were recognized,
this could be, at least in theory, agreed on.

>>
>> PPS. It'd also be correct to claim that even when a structure exists,
>> realistic data can be messy and not fit into it entirely. We've seen
>> stuff like literals in the range of object properties, etc. It's a
>> separate issue having to do with validation, for which there's an
>> ongoing effort at W3C. However, it doesn't generally hinder writing
>> queries which is what we're discussing here.
>
>
> Well I don't see it as a separate issue. All the raging debate around RDF
> Shapes is not (yet) about validation, but on the definition of what a
> shape/structure/schema can be.

OK, won't disagree on this.

Thanks,
Pavel

>
>
>>
>> > Since the very notion of schema for RDF data has no meaning at all,
>> > and the absence of schema is a bit frightening, people tend to give it a
>> > lot
>> > of possible meanings, depending on your closed world or open world
>> > assumption, otherwise said if the "schema" will be used for some kind of
>> > inference or validation. The use of "Schema" in RDFS has done nothing to
>> > clarify this, and the use of "Ontology" in OWL added a layer of
>> > confusion. I
>> > tend to say "vocabulary" to name the set of types and predicates used by
>> > a
>> > dataset (like in Linked Open Vocabularies), which is a minimal
>> > commitment to
>> > how it is considered by the dataset owner, bearing in mind that this
>> > "vocabulary" is generally a mix of imported terms from SKOS, FOAF,
>> > Dublin
>> > Core ... and home-made ones. Which is completely OK with the spirit of
>> > RDF.
>> >
>> > The brand new LDOM [1] or whatever it ends up to be named at the end of
>> > the
>> > day might clarify the situation, or muddle those waters a bit more :)
>> >
>> > [1] http://spinrdf.org/ldomprimer.html
>> >
>> > 2015-01-23 10:37 GMT+01:00 Pavel Klinov <pavel.klinov@uni-ulm.de>:
>> >>
>> >> Alright, so this isn't an answer and I might be saying something
>> >> totally silly (since I'm not a Linked Data person, really).
>> >>
>> >> If I re-phrase this question as the following: "how do I extract a
>> >> schema from a SPARQL endpoint?", then it seems to pop up quite often
>> >> (see, e.g., [1]). I understand that the original question is a bit
>> >> more general but it's fair to say that knowing the schema is a huge
>> >> help for writing meaningful queries.
>> >>
>> >> As an outsider, I'm quite surprised that there's still no commonly
>> >> accepted (i'm avoiding "standard" here) way of doing this. People
>> >> either hope that something like VoID or LOV vocabularies are being
>> >> used, or use 3-party tools, or write all sorts of ad hoc SPARQL
>> >> queries themselves, looking for types, object properties,
>> >> domains/ranges etc-etc. There are also papers written on this subject.
>> >>
>> >> At the same time, the database engines which host datasets often (not
>> >> always) manage the schema separately from the data. There're good
>> >> reasons for that. One reason, for example, is to be able to support
>> >> basic reasoning over the data, or integrity validation. Just because
>> >> in RDF the schema language and the data language are the same, so
>> >> schema and data triples can be interleaved, it need not (and often
>> >> not) be managed that way.
>> >>
>> >> Yet, there's no standard way of requesting the schema from the
>> >> endpoint, and I don't quite understand why. There's the SPARQL 1.1
>> >> Service Description, which could, in theory, cover it, but it doesn't.
>> >> Servicing such schema extraction requests doesn't have to be mandatory
>> >> so the endpoints which don't have their schemas right there don't have
>> >> to sift through the data. Also, schemas are typically quite small.
>> >>
>> >> I guess there's some problem with this which I'm missing...
>> >>
>> >> Thanks,
>> >> Pavel
>> >>
>> >> [1]
>> >>
>> >> http://answers.semanticweb.com/questions/25696/extract-ontology-schema-for-a-given-sparql-endpoint-data-set
>> >>
>> >> On Thu, Jan 22, 2015 at 3:09 PM, Juan Sequeda <juanfederico@gmail.com>
>> >> wrote:
>> >> > Assume you are given a URL for a SPARQL endpoint. You have no idea
>> >> > what
>> >> > data
>> >> > is being exposed.
>> >> >
>> >> > What do you do to explore that endpoint? What queries do you write?
>> >> >
>> >> > Juan Sequeda
>> >> > +1-575-SEQ-UEDA
>> >> > www.juansequeda.com
>> >>
>> >
>> >
>> >
>> >
>
>
> --
> Bernard Vatant
> Vocabularies & Data Engineering
> Tel :  + 33 (0)9 71 48 84 59
> Skype : bernard.vatant
> http://google.com/+BernardVatant
> --------------------------------------------------------
> Mondeca
> 35 boulevard de Strasbourg 75010 Paris
> www.mondeca.com
> Follow us on Twitter : @mondecanews
> ----------------------------------------------------------
Received on Monday, 26 January 2015 07:32:42 UTC