W3C home > Mailing lists > Public > public-bioschemas@w3.org > September 2018

RE: DataRecord and Dataset Search

From: Shimoyama, Mary <shimoyama@mcw.edu>
Date: Mon, 10 Sep 2018 15:31:24 +0000
To: "Clark, Timothy W." <TWCLARK@mgh.harvard.edu>, Rafael Jimenez <rafael.jimenez@elixir-europe.org>
CC: Leyla Garcia <ljgarcia@ebi.ac.uk>, "Gray, Alasdair J G" <A.J.G.Gray@hw.ac.uk>, Dan Brickley <danbri@google.com>, "public-bioschemas@w3.org" <public-bioschemas@w3.org>, Natasha Noy <noy@google.com>, Vicki Tardif Holland <vtardif@google.com>
Message-ID: <c84bd25f4b254b1ca5ff465acf84db08@MCWMB3c.mcwcorp.net>
Absolutely agree with (b) – the elements on a web page change regularly – for most of the MODs, these can change daily, weekly, monthly. He is right that the elements displayed on a webpage are integrated from queries and present views of data from multiple datasets within the overall structure of the resource. As for (c) on a gene page there are elements that each have unique identifiers

From: Clark, Timothy W. [mailto:TWCLARK@mgh.harvard.edu]
Sent: Monday, September 10, 2018 10:06 AM
To: Rafael Jimenez <rafael.jimenez@elixir-europe.org>
Cc: Shimoyama, Mary <shimoyama@mcw.edu>; Leyla Garcia <ljgarcia@ebi.ac.uk>; Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk>; Dan Brickley <danbri@google.com>; public-bioschemas@w3.org; Natasha Noy <noy@google.com>; Vicki Tardif Holland <vtardif@google.com>
Subject: Re: DataRecord and Dataset Search

ATTENTION: This email originated from a sender outside of MCW. Use caution when clicking on links or opening attachments.
(a) Pointing out that since a set may have cardinality = 1, a data record is certainly a data set.

(b) I wonder if using the concept “record” to mean the contents of a web page could be problematic when pages are constructed by queries and views on underlying data resources and assembled not based on normalization rules but for best UX purposes and contain a melange of many elements some of which are repeating.

(c)  For example, supposing we assign FOO:0010 to identify a web page containing some information, all of which is not in 1st normal form, i.e. it contains some unique attributes and some repeating groups? And those group elements have their own identifiers assigned, e.g FOO:0001, FOO:0002, etc?  What are we looking at ? Does FOO:0010 identify a dataset or a data record?

(d) But if you stick with dataset “all the way down” you may be better off, FOO:0010, FOO:0001, and FOO:0002 are all datasets.

Something to consider.


On Sep 10, 2018, at 10:40 AM, Rafael C. Jimenez <rafael.jimenez@elixir-europe.org<mailto:rafael.jimenez@elixir-europe.org>> wrote:

        External Email - Use Caution

To me, a data record could belong to one or more datasets. It depends on the structure and organisation of the data resource. Data records could be organised in datasets in many different ways. For instance, by the species they belong to, the disease they have been classified to (ie. cardiomegaly) or the experiment they were identified in.

To give some examples of data records, below some links pointing to different types of data records:

  - Protein record in UniProt: http://identifiers.org/uniprot:P69905<https://urldefense.proofpoint.com/v2/url?u=http-3A__identifiers.org_uniprot-3AP69905&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=fcvRDFkHxKmxBT5NqR3zy23AhHgrcqxJhBrr1YQIxbc&e=>
  - Protein record in PDB: http://identifiers.org/pdb:4n7n<https://urldefense.proofpoint.com/v2/url?u=http-3A__identifiers.org_pdb-3A4n7n&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=6bhzIha2rtBDN1b-UapJW6wNsUDYmlpTTc7RkVIwJGk&e=>
  - Chemical record in ChEBI: http://identifiers.org/CHEBI:27732<https://urldefense.proofpoint.com/v2/url?u=http-3A__identifiers.org_CHEBI-3A27732&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=3euSTTFX7v7hVnJZ_pNaJt_UyX-pFPSl_lvgBzDh91M&e=>
  - Gene record in ENSEMBL: http://identifiers.org/ensembl:ENSG00000244734<https://urldefense.proofpoint.com/v2/url?u=http-3A__identifiers.org_ensembl-3AENSG00000244734&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=LOx1vWRdWE3jr5V1dFSg55lp6iGm8dl6NKwu5e0xsnI&e=>

I like in Bioschmeas we are trying to annotate all our data resources using few types and relationships: DataCatalog -> DataSet -> DataRecord[BioChemEntity]. Some of our data resources like EGA or OmicsDI will have a high number of datasets, but I think the majority of our resources (UniProt, PDB, ChEBI or ENSEMBL mentioned above) will have a high number of data records and few datasets. Sometimes some of our data resources might even have just one dataset for all their data records.

The alternative Alasdair is talking about is to use the DataSet type for the concept of DataRecord. It would be to change to  DataCatalog -> DataSet -> DataSet[BioChemEntity]. Though for some people it might not be semantically that correct I think this approach has some advantages: 1. We do not need to propose a new type DataRecord type to schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=>, 2.- The properties we wanted to use for DataRecord are already in the DataSet type, 3.- Our data records will also be displayed in the Google dataset search, 4.- It does not really change much the way we have been working in Bioschemas.

Bringing back the question from Alisdair, which I think it is important. Should we push for a new DataRecord type in Schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__Schema.org&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=0dpHewweLhkTKo1BgBEW71rBPWIKeH7QrO3-TDdQ7yI&e=> or should we re-use the DataSet type instead?


On Mon, 10 Sep 2018 at 15:19, Shimoyama, Mary <shimoyama@mcw.edu<mailto:shimoyama@mcw.edu>> wrote:
Is there some thought to the idea of a data record belonging to multiple datasets? For example, there is an annotation for the rat A2m gene indicating it is associated with cardiomegaly. Does this A2m-cardiomegaly record belong to the dataset of the A2m gene and all of the data related to A2m, does it belong to the dataset of Cardiomegaly and all of the genes associated with cardiomegaly, does it belong to the dataset of  all the annotations and data taken from PMID:12494268/RGDID:1549856, does it belong to the dataset of all rat genes and their disease annotations or does it belong to the dataset of the entire RGD corpus of data?

-----Original Message-----
From: Clark, Timothy W. [mailto:TWCLARK@mgh.harvard.edu<mailto:TWCLARK@mgh.harvard.edu>]
Sent: Monday, September 10, 2018 8:04 AM
To: ljgarcia <ljgarcia@ebi.ac.uk<mailto:ljgarcia@ebi.ac.uk>>
Cc: Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>>; Dan Brickley <danbri@google.com<mailto:danbri@google.com>>; public-bioschemas@w3.org<mailto:public-bioschemas@w3.org>; Natasha Noy <noy@google.com<mailto:noy@google.com>>; Vicki Tardif Holland <vtardif@google.com<mailto:vtardif@google.com>>; Shimoyama, Mary <shimoyama@mcw.edu<mailto:shimoyama@mcw.edu>>
Subject: Re: DataRecord and Dataset Search

ATTENTION: This email originated from a sender outside of MCW. Use caution when clicking on links or opening attachments.

Just adding in Mary Shimoyama PI of RGB to this discussion.

> On Sep 10, 2018, at 8:35 AM, ljgarcia <ljgarcia@ebi.ac.uk<mailto:ljgarcia@ebi.ac.uk>> wrote:
>       External Email - Use Caution
> Hi Alasdair,
> I sounds to me you have covered it all. Maybe just some more information about how we link sdo:Dataset, bs:DataRecord and bs:BioChemEntity. sdo:Dataset sdo:hasPart bs:DataRecord (DataRecord actually extends from Dataset) and then sdo:DataRecord sdo:isPartOf sdo:Dataset. A sdo:DataRecord has sdo:maiEntity bs:BioChemEntity and then a bs:BioChemEntity is sdo:mainEntityOfPage of a sdo:DataRecord.
> DataRecord include two additional properties:
> * sdo:additionalProperty because we want everybody to be able to add
> no-named properties as needed
> * bs:seeAlso so ther can be links to related data records in other datasets, this one is very important in Life Sciences.
> Note: I am using sdo for schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=> and bs for bioschemas, although bioschemas types along with their properties should go to schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=> at some point (hopefully soon).
> Regards,
> On 2018-09-09 19:03, Gray, Alasdair J G wrote:
>> Hi Dan
>> In the life sciences datasets, the individual records tend to get
>> their own web page, i.e. each concept in the database would have its
>> own page. The idea for the DataRecord is to be able declare that the
>> page about a concept is part of a Dataset.
>> I believe the approach is agnostic to the underlying storage, i.e.
>> the page could be generated from a relational database which pulls
>> data about the concept from multiple tables, a triplestore, or some
>> other form of database. It is more about the granularity of this
>> being about a single concept, e.g. row in a relational database with
>> its foreign keys.
>> Leyla, Rafa, Susanna, what do you think? Have I characterised this
>> correctly or are there things in Dan’s email that I am missing.
>> Alasdair
>>> On 7 Sep 2018, at 18:12, Dan Brickley <danbri@google.com<mailto:danbri@google.com>> wrote:
>>> (+Natasha Noy, +Vicki Tardif Holland) On Fri, 7 Sep 2018 at 15:54,
>>> Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>> wrote:
>>>> Hi Dan,
>>>> Great to see the announcement this week about the Google Dataset
>>>> search. Here is a link to a blog post for anyone who has not seen
>>>> it yet
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.blog.google_

>> products_search_making-2Dit-2Deasier-2Ddiscover-2Ddatasets_&d=DwIGaQ&
>> c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5Tsee
>> hzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&
>> s=X7OaasRJiIqJhU4v5NnlNJGHFRGBPnsqrNJMduz-DKQ&e=
>>>> Within Bioschemas, we have been building up a profile usage of
>>>> DataCatalog containing Dataset(s) which themselves contain
>>>> DataRecords. A DataRecord is something that we would be proposing
>>>> as an addition to schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=> [1]. The idea is that a DataRecord is
>>>> contained within a Dataset and would specify the types of entity
>>>> that the record is about, e.g. Protein.
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__bioschemas.org_

>>>> types_DataRecord_specification_&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iq
>>>> Fn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m
>>>> =9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=VQXoaBLgxbCy_Qxi4h8R
>>>> bqij_biYI-o3xrRcqvYMSPg&e= We would like to understand whether
>>>> DataRecord is an idea to which the schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=> [1] community would
>>>> be receptive. An alternative approach would be to use Dataset for
>>>> both records within a Dataset and the Dataset itself.
>>> It is certainly a direction worth exploring and discussing.
>>> One issue to think through (and I think I raised this at a
>>> bioschemas f2f last year) is that "Dataset" is a very broad notion.
>>> Some but not all datasets are tabular for example. And tabular (e.g.
>>> csv, sql) structures have non-trivial mappings to "entity"-oriented
>>> and "record"-oriented representations. Other formats will have
>>> different (and possibly simpler) ideas about "records". Thinking
>>> about tabular first, there are complex mapping languages like D2RQ
>>> or
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.w3.org_TR_r

>>> 2rml_&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=MYcr4sn8940aywRFbBWENNFVPxseMcirke2j3PEHUcM&e= and the RDF graph it generates versus a rows-as-records view, how would your draft design deal with multi-table datasets?
>>> Nearby in this world are specs like W3C CSVW, Data Cube, ... lots of
>>> overlaps. It would be great to work through some examples in
>>> detail...
>>> Dan
>>>> Thanks
>>>> Alasdair
>>>> --
>>>> Alasdair J G Gray
>>>> Associate Professor in Computer Science, School of Mathematical and
>>>> Computer Sciences Heriot-Watt University, Edinburgh, UK.
>>>> Email: A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>
>>>> Web:
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.macs.hw.ac.

>>>> uk_-7Eajg33&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&
>>>> r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANN
>>>> Tw2d0tisoNx7KJQZ1xegXqyg&s=g-Y7L58vpqNcKEE1Av3OwMNwrCN0DZuOoxkll837
>>>> 5ZY&e=
>>>> ORCID:
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__orcid.org_0000-

>>>> 2D0002-2D5711-2D4872&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQ
>>>> kjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQK
>>>> LroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=m2htr8bZ5GnacvnHur2nmU2ZA_whdHa
>>>> qMu07RxqWC8o&e=
>>>> Office: Earl Mountbatten Building 1.39
>>>> Twitter: @gray_alasdair
>>>> -------------------------
>>>> Heriot-Watt is a leader in ideas and solutions.
>>>> With campuses and students across the entire globe we span the
>>>> world, delivering innovation and educational excellence in
>>>> business, engineering, design and the physical, social and life
>>>> sciences.
>>>> This email is generated from the Heriot-Watt University Group,
>>>> which includes:
>>>> * Heriot-Watt University, a Scottish charity registered under
>>>> number SC000278
>>>> * Edinburgh Business School a Charity Registered in Scotland,
>>>> SC026900. Edinburgh Business School is a company limited by
>>>> guarantee, registered in Scotland with registered number SC173556
>>>> and registered office at Heriot-Watt University Finance Office,
>>>> Riccarton, Currie, Midlothian, EH14 4AS
>>>> * Heriot- Watt Services Limited (Oriam), Scotland's national
>>>> performance centre for sport. Heriot-Watt Services Limited is a
>>>> private limited company registered is Scotland with registered
>>>> number SC271030 and registered office at Research & Enterprise
>>>> Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.
>>>> The contents (including any attachments) are confidential. If you
>>>> are not the intended recipient of this e-mail, any disclosure,
>>>> copying, distribution or use of its contents is strictly
>>>> prohibited, and you should please notify the sender immediately and
>>>> then delete it (including any attachments) from your system.
>> --
>> Alasdair J G Gray
>> Associate Professor in Computer Science, School of Mathematical and
>> Computer Sciences Heriot-Watt University, Edinburgh, UK.
>> Email: A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>
>> Web:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.macs.hw.ac.uk

>> _-7Eajg33&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9L
>> vaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0t
>> isoNx7KJQZ1xegXqyg&s=g-Y7L58vpqNcKEE1Av3OwMNwrCN0DZuOoxkll8375ZY&e=
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__orcid.org_0000-2D

>> 0002-2D5711-2D4872&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgs
>> pw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0D
>> ANNTw2d0tisoNx7KJQZ1xegXqyg&s=m2htr8bZ5GnacvnHur2nmU2ZA_whdHaqMu07Rxq
>> WC8o&e=
>> Office: Earl Mountbatten Building 1.39
>> Twitter: @gray_alasdair
>> Links:
>> ------
>> [1]
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwI

>> GaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5
>> TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegX
>> qyg&s=nbyl2sZnvQQv_BYn3lmWOze4_KC9X71SP_xPlR7OBlQ&e=

The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at https://urldefense.proofpoint.com/v2/url?u=http-3A__www.partners.org_complianceline&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=PCf9hEQn8A4qGfKzVy5Tr4vuvVmHyLLNZ9hhXb6z3Rw&e= . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.


Rafael C Jimenez

ELIXIR Chief Data Architect

ELIXIR Hub, South Building
Wellcome Genome Campus
Hinxton, Cambridge, CB10 1SD, UK
Tel: +44 (0) 1223 49 2574
E-Mail: rafael.jimenez@elixir-europe.org<mailto:rafael.jimenez@elixir-europe.org>


Received on Monday, 10 September 2018 15:33:08 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:08:06 UTC