Re: My best practices for Ontology versioning for http://nlp2rdf.org, was Re: Versioning system for ontologies from Sebastian Hellmann on 2013-04-22 (semantic-web@w3.org from April 2013)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Mon, 22 Apr 2013 15:30:06 +0200
To: Daniel Garijo <dgarijo@fi.upm.es>
CC: Prateek <prateek@knoesis.org>, "semantic-web@w3.org Web" <semantic-web@w3.org>
Message-ID: <51753B5E.2070105@informatik.uni-leipzig.de>
Hi Daniel,
well this matter is quite simple: the .htaccess rules need to be changed.
At least from:
|RewriteRule ^nif-core$ 
/nlp2rdf/ontologies/nif-core/version-1.0/nif-core.ttl [R=303,L]|
to
|RewriteRule ^nif-core/ 
/nlp2rdf/ontologies/nif-core/version-1.0/nif-core.ttl [R=303,L]|

The behaviour might not be what you would normally expect, as you would 
get a superset (i.e. all tripels and not just those starting with the 
URI). Would this pose a problem? A client would expect to get only 10 
triples, but would receive the whole ontology.

Would everybody with a result like this?
curl  -IL 
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/String
 > HTTP/1.1 303 See Other
 > Location: 
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/version-1.0/nif-core.ttl


 From a practical perspective it wouldn't matter for small ontologies. 
For large files with thousands of terms using a store (e.g. virtuoso) 
returning just the CBD[1] is probably the best practice. I wouldn't 
expect that naive and ad-hoc crawler implementations implement a cache  
in a good way (i.e. caching the Location of the 303 redirect) . Chances 
are better, that the cache works better for '#' uris.

So as a guideline, I will add, that anybody should calculate the worst 
case download traffic per client, i.e. my two file are:
nif-core.owl -> 15610 byte
nif-core.ttl -> 4050 byte

with 22 unique subjects:
rapper -g nif-core.ttl | cut -f1 -d '>' | sort -u | wc -l

so a badly implemented crawler can the following traffic damage:
22 * 15610 =~ 343.5kb
22 *  4050 =~ 89.1kb

which is still acceptable. For larger files, this becomes infeasable. 
'/' with triple store and  CBD might be the best option for larger 
files. But it is not as easy to set up (as it requires more than an 
Apache web server ).

As I see it, the Gene Ontology [2] is using '#' and is practically 
unpublishable as linked data:
39292 unique subjects with over 97MB = 3.8 GB

On my webserver, I only have Apache and .htaccess, so a triple store is 
not an option. Maybe I should write a script, which splits the triples 
into separate files? But on the other hand my university has unlimited 
traffic ;)

I will introduce the above rule of thumb in the README, when I have time.

All the best,
Sebastian



[1] http://www.w3.org/Submission/CBD/
[2] http://archive.geneontology.org/latest-termdb/go_daily-termdb.owl.gz

Am 22.04.2013 14:09, schrieb Daniel Garijo:
> Hi Sebastain,
> I'm glad I could help you.
> However, I still don't get why the workflow wouldn't work for 
> ontologies with "/". Are any of the
> tools not appropriate?
> Thanks for sharing the VAD link. I was not aware of the tool.
> Best,
> Daniel
>
>
> 2013/4/22 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de 
> <mailto:hellmann@informatik.uni-leipzig.de>>
>
>     Ah, yes, you are right. Thanks for your help.
>     I was confused, because DBpedia also shows the instance data for
>     the classes in the HTML interface (i.e. inbound triples):
>     http://dbpedia.org/ontology/PopulatedPlace
>
>     But of course, this is just a nice add-on for the HTML view. There
>     is actually no "Get all instances of a class" via Linked Data,
>     only via SPARQL.
>
>     I have also updated
>     https://github.com/NLP2RDF/persistence.uni-leipzig.org#-vs--uris with:
>
>>     There has been an ongoing debate about '#' vs. '/' . We focus on
>>     ontologies with '\#' here with URIs like:
>>     http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#String
>>
>>     Note that ontologies with '/' URIs need to published differently
>>     (description not included here).
>
>     By the way, DBpedia only uses something that looks like Pubby,
>     i.e. the DBpedia VAD, which is written in vsp[1].
>
>     Thanks again,
>     Sebastian
>
>     [1]https://github.com/dbpedia/dbpedia-vad-i18n
>
>
>     Am 22.04.2013 12:17, schrieb Daniel Garijo:
>>     Hi, I'm not sure I see the issue here.
>>
>>
>>     2013/4/22 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de
>>     <mailto:hellmann@informatik.uni-leipzig.de>>
>>
>>         Hm, no actually, this issue is quite easy, when it comes to
>>         large databases.
>>
>>         curl -H "Accept: text/turtle"
>>         "http://dbpedia.org/ontology#PopulatedPlace"
>>         <http://dbpedia.org/ontology#PopulatedPlace>
>>         is pretty much the same as:
>>         curl -H "Accept: text/turtle" "http://dbpedia.org/ontology"
>>         <http://dbpedia.org/ontology>
>>
>>     But here you are not asking for any instance. You are asking for
>>     a document
>>     where the ontology is defined.
>>
>>
>>         So my questions are:
>>
>>         1. What do you think is the expected output of
>>         http://dbpedia.org/ontology ? 300 million triples as turtle?
>>
>>     No. You would see the description of the ontology. In DB-pedia
>>     they haven't done such redirection because
>>     they are exposing both terms and classes with Pubby. But note
>>     that when you look for a term, no instances
>>     are returned.
>>
>>         2. How do you query all instances of type
>>         db-ont:PopulatedPlace via Linked Data ?
>>
>>     Via a SPARQL query:
>>     select ?instance where{
>>     ?instance a db-ont:PopulatedPlace.
>>     }
>>     If you don't want all the instances, then add a "LIMIT". That is
>>     why they have a public endpoint, right?
>>
>>     Another example. The recent PROV-O Ontology (with namespace URI
>>     http://www.w3.org/ns/prov#).
>>     If I have an endpoint with many prov:Entities published and I
>>     want them, I can perform a query
>>     as the one I did above. If I want to see the documentation of the
>>     term, then I would ask for
>>     http://www.w3.org/ns/prov#Entity and I would be redirected to it.
>>     Doing an accept request for turtle to an ontology term would
>>     return the owl file of the ontology,
>>     not the instances of that term.
>>
>>     Best,
>>     Daniel
>>
>>
>>         q.e.d from my point of view, as you wouldn't get around these
>>         practical problems.
>>
>>         -- Sebastian
>>
>>         Am 22.04.2013 11:50, schrieb Daniel Garijo:
>>>         Dear Sebastian,
>>>         This statement:
>>>         "When you publish ontologies without data, you can use '#' .
>>>         However, if you want to query instances via Linked Data in a
>>>         database, you have to use '/' as DBpedia does for classes:
>>>         http://dbpedia.org/ontology/PopulatedPlace"
>>>
>>>         is not correct. You can use "#" to query instances via
>>>         Linked Data databases. That is just the URI of the type. In
>>>         fact if DBpedia had chosen
>>>
>>>         "http://dbpedia.org/ontology#PopulatedPlace
>>>         <http://dbpedia.org/ontology/PopulatedPlace>" instead of its
>>>         current URI it would still be fine. It doesn't affect the query.
>>>
>>>         I'm not going to enter in the debate of "# vs /", but
>>>         normally it is a design decission that has to do more with
>>>         the size of vocabularies than the
>>>         instances.
>>>
>>>         Best,
>>>         Daniel
>>>
>>>
>>>
>>>         2013/4/22 Sebastian Hellmann
>>>         <hellmann@informatik.uni-leipzig.de
>>>         <mailto:hellmann@informatik.uni-leipzig.de>>
>>>
>>>             Dear all,
>>>
>>>             personally, I have been working on this for quite a
>>>             while and for me the best and easiest way is as
>>>             documented here:
>>>             https://github.com/NLP2RDF/persistence.uni-leipzig.org#readme
>>>
>>>             They are simple and effective and I couldn't imagine
>>>             anything more.
>>>
>>>             Note that I have also secured persistent hosting for the
>>>             URIs (also an important point).
>>>             Feedback welcome, of course.
>>>
>>>             All the best,
>>>             Sebastian
>>>
>>>
>>>                   Ontology:
>>>                   http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
>>>
>>>
>>>                   # vs /
>>>
>>>             When you publish ontologies without data, you can use
>>>             '#' . However, if you want to query instances via Linked
>>>             Data in a database, you have to use '/' as DBpedia does
>>>             for classes: http://dbpedia.org/ontology/PopulatedPlace
>>>
>>>
>>>                   <https://github.com/NLP2RDF/persistence.uni-leipzig.org#workflow>Workflow
>>>
>>>              1. I edit the ontologies in turtle syntax with the
>>>                 Geany text editor (or a Turtle editor
>>>                 http://blog.aksw.org/2013/xturtle-turtle-editing-the-eclipse-way
>>>                 ), This allows me to make developers comments using
>>>                 "#" directly in the source, see e.g.
>>>                 nlp2rdf/ontologies/nif-core.ttl
>>>              2. When I am finished I use rapper
>>>                 (http://librdf.org/raptor/rapper.html) to convert it
>>>                 to rdfxml ( nlp2rdf/ontologies/nif-core.owl )
>>>              3. I am versioning the ontologies in a folder with the
>>>                 version number, e.g. version-1.0 If somebody wants
>>>                 to find old ontologies, she can find them in the
>>>                 GitHub repository, which is linked from the
>>>                 ontology. I assume this is not often required, but
>>>                 it is nice to keep old versions. The old versions
>>>                 should be linked to in the comment of the ontology,
>>>                 see the header of nif-core.ttl
>>>              4. Then I use git push to push the changes to our server
>>>              5. (not yet) I use a simple OWL2HTML generator, e.g.
>>>                 https://github.com/specgen/specgen
>>>              6. add yourself to http://prefix.cc, see e.g.
>>>                 http://prefix.cc/nif
>>>              7. The versions are switched and published by these
>>>                 .htaccess rules, e.g. ||
>>>                 |RewriteRule .(owl|rdf|html|ttl|nt|txt|md)$ - [L]
>>>                 # (in progress) RewriteCond %{HTTP_ACCEPT} text/html
>>>                 # (in progress) RewriteRule ^nif-core$
>>>                 /nlp2rdf/ontologies/nif-core/version-1.0/nif-core.html
>>>                 [R=303,L]
>>>
>>>                 RewriteCond %{HTTP_ACCEPT} application/rdf+xml
>>>                 RewriteRule ^nif-core$
>>>                 /nlp2rdf/ontologies/nif-core/version-1.0/nif-core.owl [R=303,L]
>>>
>>>                 RewriteRule ^nif-core$
>>>                 /nlp2rdf/ontologies/nif-core/version-1.0/nif-core.ttl [R=303,L]|
>>>
>>>
>>>
>>>
>>>
>>>
>>>             Am 19.04.2013 16:05, schrieb Prateek:
>>>>             Hello all,
>>>>
>>>>             I am trying to identify a system which will provide
>>>>             versioning and revision control capabilities
>>>>             specifically for ontologies. Does anyone have any
>>>>             experience and idea about which systems can help out or
>>>>             if systems like SVN, CVS can do the job?
>>>>
>>>>             Regards
>>>>
>>>>             Prateek
>>>>
>>>>             -- 
>>>>
>>>>             - - - - - - - - - - - - - - - - - - -
>>>>             Prateek Jain, Ph. D.
>>>>             RSM
>>>>             IBM T.J. Watson Research Center
>>>>             1101 Kitchawan Road, 37-244
>>>>             Yorktown Heights, NY 10598
>>>>             Linkedin: http://www.linkedin.com/in/prateekj
>>>>
>>>
>>>
>>>             -- 
>>>             Dipl. Inf. Sebastian Hellmann
>>>             Department of Computer Science, University of Leipzig
>>>             Projects: http://nlp2rdf.org ,
>>>             http://linguistics.okfn.org ,
>>>             http://dbpedia.org/Wiktionary , http://dbpedia.org
>>>             Homepage:
>>>             http://bis.informatik.uni-leipzig.de/SebastianHellmann
>>>             Research Group: http://aksw.org
>>>
>>>
>>
>>
>>         -- 
>>         Dipl. Inf. Sebastian Hellmann
>>         Department of Computer Science, University of Leipzig
>>         Projects: http://nlp2rdf.org , http://linguistics.okfn.org ,
>>         http://dbpedia.org/Wiktionary , http://dbpedia.org
>>         Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
>>         Research Group: http://aksw.org
>>
>>
>
>
>     -- 
>     Dipl. Inf. Sebastian Hellmann
>     Department of Computer Science, University of Leipzig
>     Projects: http://nlp2rdf.org , http://linguistics.okfn.org ,
>     http://dbpedia.org/Wiktionary , http://dbpedia.org
>     Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
>     Research Group: http://aksw.org
>
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Received on Monday, 22 April 2013 13:30:49 UTC