Re: Worry to many Datasets => spam Was [Re: {Disarmed} Re: DataRecord and Dataset Search] from Rodrigo Lopez on 2018-09-28 (public-bioschemas@w3.org from September 2018)

From: Rodrigo Lopez <rls@ebi.ac.uk>
Date: Fri, 28 Sep 2018 21:58:40 +0100
To: public-bioschemas@w3.org
Message-ID: <6d423399-e247-af16-11e9-817e49990250@ebi.ac.uk>
Hi,

I agree with both and hope you will allow my humble point of view. For 
me, the dataset as a unit is fine. But then, as in the case of the most 
biological databases, there are collections of datasets, and these need 
to be defined & described within their own unique context and data 
structures.

The purpose of the Google dataset IMHO -and I'm guessing here- and 
indexable unit for a predetermined purpose (e.g. SOE). Indexing the 
whole of all structures in a collection of biological datasets is hard. 
We, therefore, opted for an approached that allowed for a minimal amount 
of annotation to become compulsory in the indexible unit and  ensured it 
is possible to navigate to here to the mothership, where the whole 
structure is available to the expert.

To me, one of the challenges is keeping these indices up-to-date and 
ensure the datasets presented to the user are relevant. This is part of 
a process for which we use different names in different collections of 
datasets: annotation/curation/examination and the many interations a 
single dataset may go through during it's lifetime. Briefly: a datasets 
presentation that is not up-to-date is not optimal and search is always 
optimised.

Kind regards,

R:)

On 28/09/2018 21:01, Karen Yook wrote:
> It seems that we really need to define what is a Dataset since  don't
> think we are talking about the same thing.
>
> To us, in its most basic form, a 'dataset' is a file of results from a
> single experiment. I can see this easily expanded to mean and contain
> a group of files that represent results from a number of similar
> studies, but beyond that, one has to be creative about its meaning to
> include the aggregation of results that the MODs do to create the page
> to which we are trying to apply these structured data tag sets. If we
> do that, we will at the very least be applying datasets to datasets
> that are made up of datasets, each potentially made with different
> elements.
>
> That said, I do appreciate trying to keep things simple, and not
> understanding the implications of going that route, I certainly
> wouldn't dismiss it.
>
> However, I suspect unwanted repercussions to google indexing if we go
> with the simplest, 'everything is a dataset' route right now.  I also
> want to avoid losing the opportunity to do this right from the
> beginning.  Or if we can't do it totally correct at this point, we
> should at least put something in place now so that it can be corrected
> later.
>
> So, to respond to EBI (yperez), I agree with you, there is a
> fundamental difference in dataset from what we would call a datarecord
> (or other BioChemEntity), and thank you for bringing this up!
>
> Karen
>
> ~~~~
> Karen Yook
>
> Curator
> WormBase Caltech
> Tel: 415.306.4150
> e-mail: kyook@caltech.edu
> e-mail: karen@wormbase.org
> skype name: wbkaren
>
>
> On Fri, Sep 28, 2018 at 12:22 PM EBI <yperez@ebi.ac.uk> wrote:
>> Hi all:
>> I’m following this discussions but not participating directly. We implemented long time ago the omicsdi representation.
>>
>> I have been reading this thread about data record and dataset. Probably I’m wrong but when someone look for a dataset is looking for a collection of data in this case biological data and not individual entries from uniprot or ensembl. It can be counterproductive if we represent all this data in the same place where a dataset from omicsdi or an archive is represented. The main problem I see is for example that users can be looking for an actual dataset when the first 1000 entries are biological records.
>>
>> This is actually the main advance of using schema.org over normal search in google (as fast as I understand) you can “classify” the entry pages for some expert categories which make easier the navigation.
>>
>> Regards
>> Sorry if I misunderstood the discussion.
>>
>>> On 28 Sep 2018, at 19:57, Karen Yook <karen@wormbase.org> wrote:
>>>
>>> Hi Jerven,
>>>
>>> Thank you Jerven for suggesting the
>>> "subtype to schema:StructuredValue e.g. bioschema:BioChemConcept"
>>>
>>> And raising the potential problem with sole use of 'Dataset' in the
>>> current proposed tag set. As you point out for UniProt,  that is one
>>> of the problems that would also affect Alliance pages. In addition,
>>> 'Dataset' is just not a good description of our pages, rather they are
>>> the living compilations of curation being created  from many
>>> 'datasets', which range from large scale datasets to single bioentity
>>> studies.
>>>
>>> Alasdair, if you need more specific examples of how 'Dataset' would be
>>> less than ideal for us, let me know.  However, for now, I am happy
>>> with what Jerven has proposed.  I will discuss this internally with
>>> the Alliance to see if there are more specific things we need to
>>> address.
>>>
>>> Best,
>>> Karen
>>>
>>>
>>>
>>>
>>> On Fri, Sep 28, 2018 at 1:37 AM Jerven Bolleman
>>> <jerven.bolleman@sib.swiss> wrote:
>>>> Hi Alasdair, All,
>>>>
>>>> Now that google dataset search exists I have a new worry of over using
>>>> Dataset.
>>>>
>>>> Take www.uniprot.org as an example. It has a bit more than a billion
>>>> webpages. Marking them all up with Dataset for what was a DataRecord
>>>> before would mean we would have a bit over 3.5 billion Datasets.
>>>> Google has no problem with dealing with the volume, but I am worried
>>>> that their antispam logic/relevance would drown out the 7 or so Datasets
>>>> that I would like to see highly ranked in their toolbox search.
>>>>
>>>> Considering that most of this work is SEO related, I would vote to mark
>>>> up just 1 page with DataCatalog/Dataset on www.uniprot.org and not on
>>>> the other pages.
>>>>
>>>> A more specific concept would be quite nice. May I suggest using a
>>>> subtype to schema:StructuredValue e.g. bioschema:BioChemConcept.
>>>> For example the schema:mainEntity on
>>>> "https://wormbase.org/species/c_elegans/gene/WBGene00012939" would be of
>>>> type schema:StructuredValue.
>>>>
>>>> In (hand-typed) JSON-LD roughly this.
>>>>
>>>> {
>>>>    "@context" : "http://schema.org",
>>>>    "@id" : "https://wormbase.org/species/c_elegans/gene/WBGene00012939" ,
>>>>    "@type" : "Webpage" ,
>>>>    "identifier" : "WBGene00012939",
>>>>    "mainEntity" : {
>>>>         "@type" : "StructuredValue" ,
>>>>          "name"  : "subs-4" ,
>>>>          "hasPart" : {
>>>>                 "@type" : "PropertyValue" ,
>>>>                 "propertyID" : "Sequence",
>>>>                 "value" : "Y47D3B.1 "
>>>>         }
>>>>     }
>>>> }
>>>>
>>>>
>>>> Regards,
>>>> Jerven
>>>>
>>>>
>>>>> On 09/28/2018 09:37 AM, Gray, Alasdair J G wrote:
>>>>> Hi Karen,
>>>>>
>>>>>> On 27 Sep 2018, at 22:38, Karen Yook <karen@wormbase.org
>>>>>> <mailto:karen@wormbase.org>> wrote:
>>>>>>
>>>>>> I just need to weigh in here as a voice in the Alliance of Genome
>>>>>> Resources before anything gets finalized wrt to DataRecord or DataSet.
>>>>>> While we are not tied to 'DataRecord' per se, we will need something
>>>>>> other than just 'DataSet' to tag our pages.
>>>>> Can you elaborate on what you mean by, “we will need something other
>>>>> than just ‘DataSet’ to tag our pages”?
>>>>>
>>>>>> We also believe specific distinctions via sub-types perhaps seems to
>>>>>> be the preferred way to do things by bothschemas.org
>>>>>> <http://schemas.org/>and Google.  We
>>>>>> will try to come up with a more specific proposal by or at the
>>>>>> Biohackathon in Paris in a couple weeks.
>>>>> We would like to get these issues resolved before the hackathon so that
>>>>> we can have stable core profiles for use in marking up with resources.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Alasdair
>>>>>
>>>>> --
>>>>> Alasdair J G Gray
>>>>> Associate Professor in Computer Science,
>>>>> School of Mathematical and Computer Sciences
>>>>> Heriot-Watt University, Edinburgh, UK.
>>>>>
>>>>> Email: A.J.G.Gray@hw.ac.uk <mailto:A.J.G.Gray@hw.ac.uk>
>>>>> Web: http://www.macs.hw.ac.uk/~ajg33
>>>>> ORCID: http://orcid.org/0000-0002-5711-4872
>>>>> Office: Earl Mountbatten Building 1.39
>>>>> Twitter: @gray_alasdair
>>>>>
>>>>> Untitled Document
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> */Heriot-Watt University is The Times & The Sunday Times International
>>>>> University of the Year 2018/*
>>>>>
>>>>> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
>>>>> campuses and students across the entire globe we span the world,
>>>>> delivering innovation and educational excellence in business,
>>>>> engineering, design and the physical, social and life sciences.
>>>>>
>>>>> This email is generated from the Heriot-Watt University Group, which
>>>>> includes:
>>>>>
>>>>> 1. Heriot-Watt University, a Scottish charity registered under number
>>>>>     SC000278
>>>>> 2. Edinburgh Business School a Charity Registered in Scotland,
>>>>>     SC026900. Edinburgh Business School is a company limited by
>>>>>     guarantee, registered in Scotland with registered number SC173556
>>>>>     and registered office at Heriot-Watt University Finance Office,
>>>>>     Riccarton, Currie, Midlothian, EH14 4AS
>>>>> 3. Heriot- Watt Services Limited (Oriam), Scotland's national
>>>>>     performance centre for sport. Heriot-Watt Services Limited is a
>>>>>     private limited company registered is Scotland with registered
>>>>>     number SC271030 and registered office at Research & Enterprise
>>>>>     Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.
>>>>>
>>>>> The contents (including any attachments) are confidential. If you are
>>>>> not the intended recipient of this e-mail, any disclosure, copying,
>>>>> distribution or use of its contents is strictly prohibited, and you
>>>>> should please notify the sender immediately and then delete it
>>>>> (including any attachments) from your system.
>>>>>
>>>>
Received on Friday, 28 September 2018 20:59:04 UTC