Very rough prototype implementation of DataCatalog/Dataset schema.org markup in InterMine from Justin Clark-Casey on 2017-05-16 (public-bioschemas@w3.org from May 2017)

From: Justin Clark-Casey <jc955@cam.ac.uk>
Date: Tue, 16 May 2017 15:54:15 +0100
To: public-bioschemas@w3.org
Message-ID: <5475073a-bf16-281b-f47d-2b782c7b0117@cam.ac.uk>

Hi all.  In advance of the Bioschemas meeting next week, I've hacked up a very rough implementation of schema.org markup in InterMine [1].  Specifically, this 
is in an installation of InterMine called Synbiomine [2], a data warehouse for synthetic biology that I've been working on.  This compiles information from many 
sources (EBI, NCBI, etc.) into integrated biological object reports (genes, proteins, parts, etc.).

In lieu of of 'proper' Bioschemas structures, I've put in DataCatalog and Dataset.  In fact, I'm abusing Dataset to represent integrated objects (e.g. protein 
Q816S6_BACCR) but I wanted to experiment with linking structures (in this case DataCatalog and Dataset).  The front page embeds the DataCatalog and individual 
report pages (e.g. [3]) embed Dataset.  You can see the Google Structured Data Testing Tool (GSDTT) analysis of the front page at [4] and a particular report 
pages at [5].

My top 5 immediate observations:

* Embedding JSON-LD itself is not hard.  More challenging is interpreting which schema.org properties to use and how to use them (e.g. CreativeWork.about or 
Thing.description)?

* Being able to link DataCatalog and Dataset (via dataset and includedInDataCatalog attributes) feels like a big win to embed standardized structure in a 
website.  In my case, however, I have 2m+ 'datasets' and this may cause issues embedding in a single DataCatalog structure (in my implementation I've 
artificially limited this to 500).  This may be due to my abuse of Dataset but the same problem could crop up in other contexts.

* Also in linking DataCatalog and Dataset, I am just embedding the Dataset url in the DataCatalog, for instance, and assuming software will navigate to the 
Dataset and extract more information from that page.

* The GSDTT is essential for checking the markup and having some implementation for Bioschemas specifications will be very useful.

* The GSDTT for some reason does not show multiple entries for the same property (e.g. shows only one citation in [5] even though there are many).  I presume 
this is just a GSDTT limitation.

Overall, imo, it feels really nice to embed structured bio information directly in the website and this could be really valuable if all the markup is 
consistent.  Tooling here like GSDTT may be a big help.

[1] http://intermine.org/
[2] http://beta.synbiomine.org/synbiomine/begin.do
[3] http://beta.synbiomine.org/synbiomine/report.do?id=112968868
[4] https://search.google.com/structured-data/testing-tool#url=http%3A%2F%2Fbeta.synbiomine.org%2Fsynbiomine%2Fbegin.do
[5] https://search.google.com/structured-data/testing-tool#url=http%3A%2F%2Fbeta.synbiomine.org%2Fsynbiomine%2Freport.do%3Fid%3D112968868

Regards,

--
Justin Clark-Casey, Synbiomine/InterMine Developer
http://synbiomine.org
http://twitter.com/justincc

Received on Tuesday, 16 May 2017 15:13:01 UTC