Re: Announcement: Bio2RDF 0.3 released from Peter Ansell on 2009-03-22 (public-semweb-lifesci@w3.org from March 2009)

From: Peter Ansell <ansell.peter@gmail.com>
Date: Sun, 22 Mar 2009 10:04:15 +1000
To: Michel Dumontier <michel.dumontier@gmail.com>
Cc: bio2rdf@googlegroups.com, w3c semweb hcls <public-semweb-lifesci@w3.org>, "public-lod@w3.org" <public-lod@w3.org>, Paul Roe <p.roe@qut.edu.au>, James Hogan <j.hogan@qut.edu.au>, Lawrence Buckingham <l.buckingham@qut.edu.au>
Message-ID: <a1be7e0e0903211704q7aeaea87u53b58d6abedf18e@mail.gmail.com>
2009/3/20 Michel Dumontier <michel.dumontier@gmail.com>:
> Hi Peter - Great work!
>   I have a question - why are there so many namespaces for these resources:

The namespaces are created based on the nature of the dataset. For
DBpedia there are specific types of URI's that are resources, some
more for properties and some more for classes. Hence, to avoid having
to do dbpedia:class/ClassName, I made up dbpedia_class:ClassName.
Ideally, the specific identifier portion should be the only piece on
the right hand side of the colon so that the namespace can be
transferred to another place without having to munge the identifier,
which should ideally never change between the different URI encoding
syntaxes

Similarly for LinkedCT, the data is very normalised as typical in
relational databases, so to avoid having linkedct:intervention/1, I
made up linkedct_intervention:1. Unlike some other datasets which
expose their data in a way that ensures that each item has a unique
identifier regardless of its record nature (ie, relational table),
those in linkedct overlap. There is an intervention/1, and a
collabagency/1, so there is no choice but to qualify them in some way.
I would rather qualify them using an extended namespace, ie,
linkedct_TABLE that can be directly referenced and recognised without
having to rely on a portion of the identifier being a certain fixed
content in order to recognise the context.

Interestingly. when I went through with ecocyc to convert the data to
RDF, they use unique identifiers throughout the dataset, so there is
only one namespace for the items they describe.

>> * DBpedia - dbpedia, dbpedia_property, dbpedia_class
>> * LinkedCT - linkedct_ontology, linkedct_intervention,
>> linkedct_trials, linkedct_collabagency, linkedct_condition,
>> linkedct_link, linkedct_location, linkedct_overall_official,
>> linkedct_oversight, linkedct_primary_outcomes, linkedct_reference,
>> linkedct_results_reference, linkedct_secondary_outcomes,
>> linkedct_arm_group
>> * Dailymed - dailymed_ontology, dailymed_drugs,
>> dailymed_inactiveingredient, dailymed_routeofadministration,
>> dailymed_organization
>> * DrugBank - drugbank_ontology, drugbank_druginteractions,
>> drugbank_drugs, drugbank_enzymes, drugbank_drugtype,
>> drugbank_drugcategory, drugbank_dosageforms, drugbank_targets
>> * Diseasome - diseasome_ontology, diseasome_diseases, diseasome_genes,
>> diseasome_chromosomallocation, diseasome_diseaseclass
>> * Neurocommons - Uses the equivalent Bio2RDF namespaces, with live
>> owl:sameAs links back to the relevant Neurocommons namespaces. Used
>> for pubmed, geneid, taxonomy, mesh, prosite and go so far
>> * Flyted/Flybase etc not converted yet, only direct access provided
>
>
>
>>
>> Provide live owl:sameAs references which match those used in SPARQL
>> queries to keep linkages to the original databases without leaving the
>> database:identifier paradigm, so if people know the DBPedia, etc.,
>> URI's, the link to their current knowledge is given
>>
>> * Some http://database.bio2rdf.org/database:identifier URI's are given
>> by this, but these aren't standard, and are only shown where there is
>> still at least one SPARQL endpoint available which uses them. People
>> should utilise the http://bio2rdf.org/database:identifier versions
>> when linking to Bio2RDF.
>>
>> Integrated Semantic Web Pipes (pipes.deri.org) (version 0.7) so the
>> pipes runtime engine can be utilised on the same server as bio2rdf.
>> The main servers have a limited number of pipes available so far, but
>> more can be included by people wishing to contribute their pipes. The
>> URL syntax is /pipes/PIPEID/parameter1=value1/parameter2=value2 . This
>> provides a method for people wanting to utilise complex mashup
>> scenarios and provide them back to the community, as by default the
>> bio2rdf engine only knows how to do simple integration of RDF sources
>> into a single output document
>>
>> The two currently available pipes are:
>> * /pipes/bio2rdf_basic/database=DATABASE/identifier=IDENTIFIER Mirrors
>> /database:identifier functionality
>> *
>> /pipes/bio2rdf_subject_object_slicing/database=DATABASE/identifier=IDENTIFIER
>> Combines /database:identifier and /links/database:identifier
>> functionality into one operation
>
> I didn't know about DERI pipes - looks fantastic! Thanks!

It is quite useful for doing complex operations, and doesn't have
quite as many intimate data transformations that some workflow systems
have.

>>
>> Namespace synonyms can be implemented, with the first example that of
>> taxon and taxonomy for NCBI taxonomy as so far there hasn't been a
>> clear bias towards one or the other, and together with interlinked
>> owl:sameAs statements the synonyms will provide resolution to a
>> standard URI no matter which one is provided in the URI.
>>
>> * http://bio2rdf.org/taxon:identifier will return information in the
>> form http://bio2rdf.org/taxonomy:identifier currently, with an
>> owl:sameAs link back to the taxon version. This can be switched if
>> people in general prefer the taxon version as the default, although in
>> general this is an issue still as it is difficult to make up SPARQL
>> queries outside of the Bio2RDF server for these heterogeneous sources
>
> ok, which other sources are providing NCBI taxonomy info? and what namespace
> prefix do they use?

It is a tough choice, because the majority of datasets and LSRN use
taxon:ID from what I have seen, but we don't have a taxon.bio2rdf.org
server ;) Not that that couldn't be fixed quite easily. I guess it
seems a bit irrational given the number of datasets that use taxon,
that taxonomy.bio2rdf.org/sparql endpoint, Uniprot, and Neurocommons
use taxonomy.

A poll anyone to decide this? Three big RDF sets using taxonomy and a
lot of small ones using taxon!

>>
>> Provide live statistics to diagnose some network issues without having
>> to look at log files. The URL is /admin/stats
>>
>> * Shows the last time the internal blacklist reset, indicating how
>> much activity is being displayed as the statistics are reset everytime
>> the blacklist is reset.
>> * By default shows the IP's accessing the server, with an indication
>> of the number and duration of their queries. Can be configured in low
>> use and private situations to also show the queries being performed
>> * Shows the servers which have been unresponsive since the last
>> blacklist reset including a basic reason, such as an HTTP 503 or 400
>> error
>>
>> Implement true RDF handling in the background to provide consistency
>> of output and the potential to support multiple output formats such as
>> NTriples and Turtle, although the only output currently supported is
>> RDF/XML. The Sesame library is being used to provide this
>> functionality.
>>
>> Provide more RDFiser scripts as part of the source distribution,
>> including Chebi, GO, Homologene, NCBI Geneid, HGNC, OBO and Ecocyc
>>
>> Provide more links to HTML provider URL's for given databases to
>> provide the link between the Bio2RDF RDF interface and currently
>> available HTML interfaces. The URL syntax for this is
>> /html/database:identifier
>>
>> Provide links to licence providers, so the applicable license for a
>> database may be available by following a URL. The URL syntax for this
>> is /license/database:identifier . It was easier to require the
>> identifier to be present than to not have it. So far, the identifier
>> portion is not being used, so it merely has to be present for the URL
>> resolution to occur, but in future there is the allowance to have
>> different licenses being given based on the identifier, which is
>> useful for databases which are not completely released under a single
>> license.
>>
>> Provide countlinks and countlinksns which count the number of reverse
>> links to a particular item from globally, or from within a given
>> database. Currently these only function on virtuoso endpoints due to
>> their use of aggregation extensions to SPARQL. The URL syntax is
>> /countlinks/database:identifier and
>> /countlinksns/targetdatabase/database:identifier
>>
>> Provide search and searchns, which attempt to search globally using
>> SPARQL (aren't currently linked to the rdfiser search pages which may
>> be accessed using searchns), or search within a particular database
>> for text searches. The searches are all performed using the virtuoso
>> fulltext search paradigm, ie, bif:contains, and other sparql endpoints
>> haven't yet been implemented even with regex because it is reasonably
>> slow but it would be simple to construct a query if people thought it
>> was necessary. The URL syntax is /search/searchTerm and
>> /searchns/targetdatabase:searchTerm
>>
>> If anyone has any SPARQL queries on biology related databases that
>> they regularly execute that can either be parameterised or turned into
>> Pipes then it would be great to include them in future distributions
>> for others to use.
>
> absolutely!
> -=Michel=-

Sounds good. Look forward to creating new query aliases to suit your
common operations (ie, make up new /links/ or /search/ operations
based on other forms of querying)

Cheers,

Peter Ansell
Received on Sunday, 22 March 2009 00:04:55 UTC