Re: DataRecord and Dataset Search

To me, a data record could belong to one or more datasets. It depends on
the structure and organisation of the data resource. Data records could be
organised in datasets in many different ways. For instance, by the species
they belong to, the disease they have been classified to (ie. cardiomegaly)
or the experiment they were identified in.

To give some examples of data records, below some links pointing to
different types of data records:

  - Protein record in UniProt: http://identifiers.org/uniprot:P69905
  - Protein record in PDB: http://identifiers.org/pdb:4n7n
  - Chemical record in ChEBI: http://identifiers.org/CHEBI:27732
  - Gene record in ENSEMBL: http://identifiers.org/ensembl:ENSG00000244734

I like in Bioschmeas we are trying to annotate all our data resources using
few types and relationships: DataCatalog -> DataSet ->
DataRecord[BioChemEntity]. Some of our data resources like EGA or OmicsDI
will have a high number of datasets, but I think the majority of our
resources (UniProt, PDB, ChEBI or ENSEMBL mentioned above) will have a high
number of data records and few datasets. Sometimes some of our data
resources might even have just one dataset for all their data records.

The alternative Alasdair is talking about is to use the DataSet type for
the concept of DataRecord. It would be to change to  DataCatalog -> DataSet
-> DataSet[BioChemEntity]. Though for some people it might not be
semantically that correct I think this approach has some advantages: 1. We
do not need to propose a new type DataRecord type to schema.org, 2.- The
properties we wanted to use for DataRecord are already in the DataSet type,
3.- Our data records will also be displayed in the Google dataset search,
4.- It does not really change much the way we have been working in
Bioschemas.

Bringing back the question from Alisdair, which I think it is
important. Should we push for a new DataRecord type in Schema.org or should
we re-use the DataSet type instead?

Regards,
Rafa


On Mon, 10 Sep 2018 at 15:19, Shimoyama, Mary <shimoyama@mcw.edu> wrote:

> Is there some thought to the idea of a data record belonging to multiple
> datasets? For example, there is an annotation for the rat A2m gene
> indicating it is associated with cardiomegaly. Does this A2m-cardiomegaly
> record belong to the dataset of the A2m gene and all of the data related to
> A2m, does it belong to the dataset of Cardiomegaly and all of the genes
> associated with cardiomegaly, does it belong to the dataset of  all the
> annotations and data taken from PMID:12494268/RGDID:1549856, does it belong
> to the dataset of all rat genes and their disease annotations or does it
> belong to the dataset of the entire RGD corpus of data?
>
> -----Original Message-----
> From: Clark, Timothy W. [mailto:TWCLARK@mgh.harvard.edu]
> Sent: Monday, September 10, 2018 8:04 AM
> To: ljgarcia <ljgarcia@ebi.ac.uk>
> Cc: Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk>; Dan Brickley <
> danbri@google.com>; public-bioschemas@w3.org; Natasha Noy <noy@google.com>;
> Vicki Tardif Holland <vtardif@google.com>; Shimoyama, Mary <
> shimoyama@mcw.edu>
> Subject: Re: DataRecord and Dataset Search
>
> ATTENTION: This email originated from a sender outside of MCW. Use caution
> when clicking on links or opening attachments.
> ________________________________
>
> Just adding in Mary Shimoyama PI of RGB to this discussion.
>
> > On Sep 10, 2018, at 8:35 AM, ljgarcia <ljgarcia@ebi.ac.uk> wrote:
> >
> >       External Email - Use Caution
> > Hi Alasdair,
> >
> > I sounds to me you have covered it all. Maybe just some more information
> about how we link sdo:Dataset, bs:DataRecord and bs:BioChemEntity.
> sdo:Dataset sdo:hasPart bs:DataRecord (DataRecord actually extends from
> Dataset) and then sdo:DataRecord sdo:isPartOf sdo:Dataset. A sdo:DataRecord
> has sdo:maiEntity bs:BioChemEntity and then a bs:BioChemEntity is
> sdo:mainEntityOfPage of a sdo:DataRecord.
> >
> > DataRecord include two additional properties:
> > * sdo:additionalProperty because we want everybody to be able to add
> > no-named properties as needed
> > * bs:seeAlso so ther can be links to related data records in other
> datasets, this one is very important in Life Sciences.
> >
> > Note: I am using sdo for schema.org and bs for bioschemas, although
> bioschemas types along with their properties should go to schema.org at
> some point (hopefully soon).
> >
> > Regards,
> >
> > On 2018-09-09 19:03, Gray, Alasdair J G wrote:
> >> Hi Dan
> >> In the life sciences datasets, the individual records tend to get
> >> their own web page, i.e. each concept in the database would have its
> >> own page. The idea for the DataRecord is to be able declare that the
> >> page about a concept is part of a Dataset.
> >> I believe the approach is agnostic to the underlying storage, i.e.
> >> the page could be generated from a relational database which pulls
> >> data about the concept from multiple tables, a triplestore, or some
> >> other form of database. It is more about the granularity of this
> >> being about a single concept, e.g. row in a relational database with
> >> its foreign keys.
> >> Leyla, Rafa, Susanna, what do you think? Have I characterised this
> >> correctly or are there things in Dan’s email that I am missing.
> >> Alasdair
> >>> On 7 Sep 2018, at 18:12, Dan Brickley <danbri@google.com> wrote:
> >>> (+Natasha Noy, +Vicki Tardif Holland) On Fri, 7 Sep 2018 at 15:54,
> >>> Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk> wrote:
> >>>> Hi Dan,
> >>>> Great to see the announcement this week about the Google Dataset
> >>>> search. Here is a link to a blog post for anyone who has not seen
> >>>> it yet
> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.blog.google_
> >> products_search_making-2Dit-2Deasier-2Ddiscover-2Ddatasets_&d=DwIGaQ&
> >> c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5Tsee
> >> hzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&
> >> s=X7OaasRJiIqJhU4v5NnlNJGHFRGBPnsqrNJMduz-DKQ&e=
> >>>> Within Bioschemas, we have been building up a profile usage of
> >>>> DataCatalog containing Dataset(s) which themselves contain
> >>>> DataRecords. A DataRecord is something that we would be proposing
> >>>> as an addition to schema.org [1]. The idea is that a DataRecord is
> >>>> contained within a Dataset and would specify the types of entity
> >>>> that the record is about, e.g. Protein.
> >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__bioschemas.org_
> >>>> types_DataRecord_specification_&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iq
> >>>> Fn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m
> >>>> =9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=VQXoaBLgxbCy_Qxi4h8R
> >>>> bqij_biYI-o3xrRcqvYMSPg&e= We would like to understand whether
> >>>> DataRecord is an idea to which the schema.org [1] community would
> >>>> be receptive. An alternative approach would be to use Dataset for
> >>>> both records within a Dataset and the Dataset itself.
> >>> It is certainly a direction worth exploring and discussing.
> >>> One issue to think through (and I think I raised this at a
> >>> bioschemas f2f last year) is that "Dataset" is a very broad notion.
> >>> Some but not all datasets are tabular for example. And tabular (e.g.
> >>> csv, sql) structures have non-trivial mappings to "entity"-oriented
> >>> and "record"-oriented representations. Other formats will have
> >>> different (and possibly simpler) ideas about "records". Thinking
> >>> about tabular first, there are complex mapping languages like D2RQ
> >>> or
> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.w3.org_TR_r
> >>>
> 2rml_&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=MYcr4sn8940aywRFbBWENNFVPxseMcirke2j3PEHUcM&e=
> and the RDF graph it generates versus a rows-as-records view, how would
> your draft design deal with multi-table datasets?
> >>> Nearby in this world are specs like W3C CSVW, Data Cube, ... lots of
> >>> overlaps. It would be great to work through some examples in
> >>> detail...
> >>> Dan
> >>>> Thanks
> >>>> Alasdair
> >>>> --
> >>>> Alasdair J G Gray
> >>>> Associate Professor in Computer Science, School of Mathematical and
> >>>> Computer Sciences Heriot-Watt University, Edinburgh, UK.
> >>>> Email: A.J.G.Gray@hw.ac.uk
> >>>> Web:
> >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.macs.hw.ac.
> >>>> uk_-7Eajg33&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&
> >>>> r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANN
> >>>> Tw2d0tisoNx7KJQZ1xegXqyg&s=g-Y7L58vpqNcKEE1Av3OwMNwrCN0DZuOoxkll837
> >>>> 5ZY&e=
> >>>> ORCID:
> >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__orcid.org_0000-
> >>>> 2D0002-2D5711-2D4872&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQ
> >>>> kjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQK
> >>>> LroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=m2htr8bZ5GnacvnHur2nmU2ZA_whdHa
> >>>> qMu07RxqWC8o&e=
> >>>> Office: Earl Mountbatten Building 1.39
> >>>> Twitter: @gray_alasdair
> >>>> -------------------------
> >>>> _HERIOT-WATT UNIVERSITY IS THE TIMES & THE SUNDAY TIMES
> >>>> INTERNATIONAL UNIVERSITY OF THE YEAR 2018_ Founded in 1821,
> >>>> Heriot-Watt is a leader in ideas and solutions.
> >>>> With campuses and students across the entire globe we span the
> >>>> world, delivering innovation and educational excellence in
> >>>> business, engineering, design and the physical, social and life
> >>>> sciences.
> >>>> This email is generated from the Heriot-Watt University Group,
> >>>> which includes:
> >>>> * Heriot-Watt University, a Scottish charity registered under
> >>>> number SC000278
> >>>> * Edinburgh Business School a Charity Registered in Scotland,
> >>>> SC026900. Edinburgh Business School is a company limited by
> >>>> guarantee, registered in Scotland with registered number SC173556
> >>>> and registered office at Heriot-Watt University Finance Office,
> >>>> Riccarton, Currie, Midlothian, EH14 4AS
> >>>> * Heriot- Watt Services Limited (Oriam), Scotland's national
> >>>> performance centre for sport. Heriot-Watt Services Limited is a
> >>>> private limited company registered is Scotland with registered
> >>>> number SC271030 and registered office at Research & Enterprise
> >>>> Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.
> >>>> The contents (including any attachments) are confidential. If you
> >>>> are not the intended recipient of this e-mail, any disclosure,
> >>>> copying, distribution or use of its contents is strictly
> >>>> prohibited, and you should please notify the sender immediately and
> >>>> then delete it (including any attachments) from your system.
> >> --
> >> Alasdair J G Gray
> >> Associate Professor in Computer Science, School of Mathematical and
> >> Computer Sciences Heriot-Watt University, Edinburgh, UK.
> >> Email: A.J.G.Gray@hw.ac.uk
> >> Web:
> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.macs.hw.ac.uk
> >> _-7Eajg33&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9L
> >> vaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0t
> >> isoNx7KJQZ1xegXqyg&s=g-Y7L58vpqNcKEE1Av3OwMNwrCN0DZuOoxkll8375ZY&e=
> >> ORCID:
> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__orcid.org_0000-2D
> >> 0002-2D5711-2D4872&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgs
> >> pw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0D
> >> ANNTw2d0tisoNx7KJQZ1xegXqyg&s=m2htr8bZ5GnacvnHur2nmU2ZA_whdHaqMu07Rxq
> >> WC8o&e=
> >> Office: Earl Mountbatten Building 1.39
> >> Twitter: @gray_alasdair
> >> Links:
> >> ------
> >> [1]
> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwI
> >> GaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5
> >> TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegX
> >> qyg&s=nbyl2sZnvQQv_BYn3lmWOze4_KC9X71SP_xPlR7OBlQ&e=
> >
>
>
>
> The information in this e-mail is intended only for the person to whom it
> is addressed. If you believe this e-mail was sent to you in error and the
> e-mail contains patient information, please contact the Partners Compliance
> HelpLine at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.partners.org_complianceline&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=PCf9hEQn8A4qGfKzVy5Tr4vuvVmHyLLNZ9hhXb6z3Rw&e=
> . If the e-mail was sent to you in error but does not contain patient
> information, please contact the sender and properly dispose of the e-mail.
>


-- 

*Rafael C Jimenez*
ELIXIR Chief Data Architect
www.elixir-europe.org

ELIXIR Hub, South Building
Wellcome Genome Campus
Hinxton, Cambridge, CB10 1SD, UK
Tel: +44 (0) 1223 49 2574
E-Mail: rafael.jimenez@elixir-europe.org [image: ELIXIR]
<http://www.elixir-europe.org/>

Received on Monday, 10 September 2018 14:40:09 UTC