Fwd: Very rough prototype implementation of DataCatalog/Dataset schema.org markup in InterMine from J. Clark-Casey on 2017-05-17 (public-bioschemas@w3.org from May 2017)

From: J. Clark-Casey <jc955@cam.ac.uk>
Date: Wed, 17 May 2017 07:16:40 +0100
To: public-bioschemas@w3.org
Message-ID: <d89ea51e7d5c668b3a639d1fe7205222@cam.ac.uk>
To acknowledge, this bit of prototyping work was done under Wellcome 
Trust support (for InterMine) and EPSRC/Flowers Consortium support (for 
Synbiomine).

Just as a quick additional note, many of the citation URLs in the 
Dataset structure are currently null only because it was a bit difficult 
for me to get at all the info in the quick prototype.  However, this 
does seem a very good way of embedding standardized structured 
attribution (and perhaps later citation) information.

-------- Original Message --------
Subject: Very rough prototype implementation of DataCatalog/Dataset 
schema.org  markup in InterMine
Date: 2017-05-16 15:54
 From: Justin Clark-Casey <jc955@cam.ac.uk>
To: public-bioschemas@w3.org

Hi all.  In advance of the Bioschemas meeting next week, I've hacked up 
a very rough implementation of schema.org markup in InterMine [1].  
Specifically, this is in an installation of InterMine called Synbiomine 
[2], a data warehouse for synthetic biology that I've been working on.  
This compiles information from many sources (EBI, NCBI, etc.) into 
integrated biological object reports (genes, proteins, parts, etc.).

In lieu of of 'proper' Bioschemas structures, I've put in DataCatalog 
and Dataset.  In fact, I'm abusing Dataset to represent integrated 
objects (e.g. protein Q816S6_BACCR) but I wanted to experiment with 
linking structures (in this case DataCatalog and Dataset).  The front 
page embeds the DataCatalog and individual report pages (e.g. [3]) embed 
Dataset.  You can see the Google Structured Data Testing Tool (GSDTT) 
analysis of the front page at [4] and a particular report pages at [5].

My top 5 immediate observations:

* Embedding JSON-LD itself is not hard.  More challenging is 
interpreting which schema.org properties to use and how to use them 
(e.g. CreativeWork.about or Thing.description)?

* Being able to link DataCatalog and Dataset (via dataset and 
includedInDataCatalog attributes) feels like a big win to embed 
standardized structure in a website.  In my case, however, I have 2m+ 
'datasets' and this may cause issues embedding in a single DataCatalog 
structure (in my implementation I've artificially limited this to 500).  
This may be due to my abuse of Dataset but the same problem could crop 
up in other contexts.

* Also in linking DataCatalog and Dataset, I am just embedding the 
Dataset url in the DataCatalog, for instance, and assuming software will 
navigate to the Dataset and extract more information from that page.

* The GSDTT is essential for checking the markup and having some 
implementation for Bioschemas specifications will be very useful.

* The GSDTT for some reason does not show multiple entries for the same 
property (e.g. shows only one citation in [5] even though there are 
many).  I presume this is just a GSDTT limitation.

Overall, imo, it feels really nice to embed structured bio information 
directly in the website and this could be really valuable if all the 
markup is consistent.  Tooling here like GSDTT may be a big help.

[1] http://intermine.org/
[2] http://beta.synbiomine.org/synbiomine/begin.do
[3] http://beta.synbiomine.org/synbiomine/report.do?id=112968868
[4] 
https://search.google.com/structured-data/testing-tool#url=http%3A%2F%2Fbeta.synbiomine.org%2Fsynbiomine%2Fbegin.do
[5] 
https://search.google.com/structured-data/testing-tool#url=http%3A%2F%2Fbeta.synbiomine.org%2Fsynbiomine%2Freport.do%3Fid%3D112968868

Regards,

--
Justin Clark-Casey, Synbiomine/InterMine Developer
http://synbiomine.org
http://twitter.com/justincc
Received on Wednesday, 17 May 2017 06:17:11 UTC