RE: DataRecord and Dataset Search

Absolutely agree with (b) – the elements on a web page change regularly – for most of the MODs, these can change daily, weekly, monthly. He is right that the elements displayed on a webpage are integrated from queries and present views of data from multiple datasets within the overall structure of the resource. As for (c) on a gene page there are elements that each have unique identifiers

Subject: Re: DataRecord and Dataset Search

(a) Pointing out that since a set may have cardinality = 1, a data record is certainly a data set.

(b) I wonder if using the concept “record” to mean the contents of a web page could be problematic when pages are constructed by queries and views on underlying data resources and assembled not based on normalization rules but for best UX purposes and contain a melange of many elements some of which are repeating.

(c)  For example, supposing we assign FOO:0010 to identify a web page containing some information, all of which is not in 1st normal form, i.e. it contains some unique attributes and some repeating groups? And those group elements have their own identifiers assigned, e.g FOO:0001, FOO:0002, etc?  What are we looking at ? Does FOO:0010 identify a dataset or a data record?

(d) But if you stick with dataset “all the way down” you may be better off, FOO:0010, FOO:0001, and FOO:0002 are all datasets.

Something to consider.


To me, a data record could belong to one or more datasets. It depends on the structure and organisation of the data resource. Data records could be organised in datasets in many different ways. For instance, by the species they belong to, the disease they have been classified to (ie. cardiomegaly) or the experiment they were identified in.

To give some examples of data records, below some links pointing to different types of data records:

  - Protein record in UniProt: http://identifiers.org/uniprot:P69905<https://urldefense.proofpoint.com/v2/url?u=http-3A__identifiers.org_uniprot-3AP69905&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=fcvRDFkHxKmxBT5NqR3zy23AhHgrcqxJhBrr1YQIxbc&e=>
  - Protein record in PDB: http://identifiers.org/pdb:4n7n<https://urldefense.proofpoint.com/v2/url?u=http-3A__identifiers.org_pdb-3A4n7n&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=6bhzIha2rtBDN1b-UapJW6wNsUDYmlpTTc7RkVIwJGk&e=>
  - Chemical record in ChEBI: http://identifiers.org/CHEBI:27732<https://urldefense.proofpoint.com/v2/url?u=http-3A__identifiers.org_CHEBI-3A27732&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=3euSTTFX7v7hVnJZ_pNaJt_UyX-pFPSl_lvgBzDh91M&e=>
  - Gene record in ENSEMBL: http://identifiers.org/ensembl:ENSG00000244734<https://urldefense.proofpoint.com/v2/url?u=http-3A__identifiers.org_ensembl-3AENSG00000244734&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=LOx1vWRdWE3jr5V1dFSg55lp6iGm8dl6NKwu5e0xsnI&e=>

I like in Bioschmeas we are trying to annotate all our data resources using few types and relationships: DataCatalog -> DataSet -> DataRecord[BioChemEntity]. Some of our data resources like EGA or OmicsDI will have a high number of datasets, but I think the majority of our resources (UniProt, PDB, ChEBI or ENSEMBL mentioned above) will have a high number of data records and few datasets. Sometimes some of our data resources might even have just one dataset for all their data records.

The alternative Alasdair is talking about is to use the DataSet type for the concept of DataRecord. It would be to change to  DataCatalog -> DataSet -> DataSet[BioChemEntity]. Though for some people it might not be semantically that correct I think this approach has some advantages: 1. We do not need to propose a new type DataRecord type to schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=>, 2.- The properties we wanted to use for DataRecord are already in the DataSet type, 3.- Our data records will also be displayed in the Google dataset search, 4.- It does not really change much the way we have been working in Bioschemas.

Bringing back the question from Alisdair, which I think it is important. Should we push for a new DataRecord type in Schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__Schema.org&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=0dpHewweLhkTKo1BgBEW71rBPWIKeH7QrO3-TDdQ7yI&e=> or should we re-use the DataSet type instead?


Is there some thought to the idea of a data record belonging to multiple datasets? For example, there is an annotation for the rat A2m gene indicating it is associated with cardiomegaly. Does this A2m-cardiomegaly record belong to the dataset of the A2m gene and all of the data related to A2m, does it belong to the dataset of Cardiomegaly and all of the genes associated with cardiomegaly, does it belong to the dataset of  all the annotations and data taken from PMID:12494268/RGDID:1549856, does it belong to the dataset of all rat genes and their disease annotations or does it belong to the dataset of the entire RGD corpus of data?

Subject: Re: DataRecord and Dataset Search

Just adding in Mary Shimoyama PI of RGB to this discussion.

> I sounds to me you have covered it all. Maybe just some more information about how we link sdo:Dataset, bs:DataRecord and bs:BioChemEntity. sdo:Dataset sdo:hasPart bs:DataRecord (DataRecord actually extends from Dataset) and then sdo:DataRecord sdo:isPartOf sdo:Dataset. A sdo:DataRecord has sdo:maiEntity bs:BioChemEntity and then a bs:BioChemEntity is sdo:mainEntityOfPage of a sdo:DataRecord.
> DataRecord include two additional properties:
> * sdo:additionalProperty because we want everybody to be able to add
> no-named properties as needed
> * bs:seeAlso so ther can be links to related data records in other datasets, this one is very important in Life Sciences.
> Note: I am using sdo for schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=> and bs for bioschemas, although bioschemas types along with their properties should go to schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=> at some point (hopefully soon).
>> Hi Dan
>> In the life sciences datasets, the individual records tend to get
>> their own web page, i.e. each concept in the database would have its
>> own page. The idea for the DataRecord is to be able declare that the
>> page about a concept is part of a Dataset.
>> I believe the approach is agnostic to the underlying storage, i.e.
>> the page could be generated from a relational database which pulls
>> data about the concept from multiple tables, a triplestore, or some
>> other form of database. It is more about the granularity of this
>> being about a single concept, e.g. row in a relational database with
>> its foreign keys.
>> Leyla, Rafa, Susanna, what do you think? Have I characterised this
>> correctly or are there things in Dan’s email that I am missing.
>> Alasdair
>>>> Hi Dan,
>>>> Great to see the announcement this week about the Google Dataset
>>>> search. Here is a link to a blog post for anyone who has not seen
>>>> it yet
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.blog.google_

>> products_search_making-2Dit-2Deasier-2Ddiscover-2Ddatasets_&d=DwIGaQ&
>> c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5Tsee
>> hzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&
>> s=X7OaasRJiIqJhU4v5NnlNJGHFRGBPnsqrNJMduz-DKQ&e=
>>>> Within Bioschemas, we have been building up a profile usage of
>>>> DataCatalog containing Dataset(s) which themselves contain
>>>> DataRecords. A DataRecord is something that we would be proposing
>>>> as an addition to schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=> [1]. The idea is that a DataRecord is
>>>> contained within a Dataset and would specify the types of entity
>>>> that the record is about, e.g. Protein.
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__bioschemas.org_

>>>> types_DataRecord_specification_&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iq
>>>> Fn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m
>>>> =9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=VQXoaBLgxbCy_Qxi4h8R
>>>> bqij_biYI-o3xrRcqvYMSPg&e= We would like to understand whether
>>>> DataRecord is an idea to which the schema.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__schema.org_&d=DwMGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=4c6VLbGuyLBfnCbZjSqnsP4IcONzzE5HnaQt8gEBM0A&s=pWdXlfBCWDRHFvEWI4bEWwBlqRlBzfaiJY2FFHH0yaM&e=> [1] community would
>>>> be receptive. An alternative approach would be to use Dataset for
>>>> both records within a Dataset and the Dataset itself.
>>> It is certainly a direction worth exploring and discussing.
>>> One issue to think through (and I think I raised this at a
>>> bioschemas f2f last year) is that "Dataset" is a very broad notion.
>>> Some but not all datasets are tabular for example. And tabular (e.g.
>>> csv, sql) structures have non-trivial mappings to "entity"-oriented
>>> and "record"-oriented representations. Other formats will have
>>> different (and possibly simpler) ideas about "records". Thinking
>>> about tabular first, there are complex mapping languages like D2RQ
>>> or
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.w3.org_TR_r

>>> 2rml_&d=DwIGaQ&c=aFamLAsxMIDYjNglYHTMV0iqFn3z4pVFYPQkjgspw4Y&r=9LvaCUW2sYxo387m5TseehzDcIGIVxSis9TsUt73Qqg&m=9Do_KY8oQKLroB0DANNTw2d0tisoNx7KJQZ1xegXqyg&s=MYcr4sn8940aywRFbBWENNFVPxseMcirke2j3PEHUcM&e= and the RDF graph it generates versus a rows-as-records view, how would your draft design deal with multi-table datasets?
>>> Nearby in this world are specs like W3C CSVW, Data Cube, ... lots of
>>> overlaps. It would be great to work through some examples in
>>> detail...
>>> Dan
>>>> Thanks
>>>> Alasdair
