Re: Worry to many Datasets => spam Was [Re: {Disarmed} Re: DataRecord and Dataset Search]

It seems that we really need to define what is a Dataset since  don't
think we are talking about the same thing.

To us, in its most basic form, a 'dataset' is a file of results from a
single experiment. I can see this easily expanded to mean and contain
a group of files that represent results from a number of similar
studies, but beyond that, one has to be creative about its meaning to
include the aggregation of results that the MODs do to create the page
to which we are trying to apply these structured data tag sets. If we
do that, we will at the very least be applying datasets to datasets
that are made up of datasets, each potentially made with different
elements.

That said, I do appreciate trying to keep things simple, and not
understanding the implications of going that route, I certainly
wouldn't dismiss it.

However, I suspect unwanted repercussions to google indexing if we go
with the simplest, 'everything is a dataset' route right now.  I also
want to avoid losing the opportunity to do this right from the
beginning.  Or if we can't do it totally correct at this point, we
should at least put something in place now so that it can be corrected
later.

So, to respond to EBI (yperez), I agree with you, there is a
fundamental difference in dataset from what we would call a datarecord
(or other BioChemEntity), and thank you for bringing this up!

Karen

~~~~
Karen Yook

Curator
WormBase Caltech
Tel: 415.306.4150
e-mail: kyook@caltech.edu
e-mail: karen@wormbase.org
skype name: wbkaren


On Fri, Sep 28, 2018 at 12:22 PM EBI <yperez@ebi.ac.uk> wrote:
>
> Hi all:
> I’m following this discussions but not participating directly. We implemented long time ago the omicsdi representation.
>
> I have been reading this thread about data record and dataset. Probably I’m wrong but when someone look for a dataset is looking for a collection of data in this case biological data and not individual entries from uniprot or ensembl. It can be counterproductive if we represent all this data in the same place where a dataset from omicsdi or an archive is represented. The main problem I see is for example that users can be looking for an actual dataset when the first 1000 entries are biological records.
>
> This is actually the main advance of using schema.org over normal search in google (as fast as I understand) you can “classify” the entry pages for some expert categories which make easier the navigation.
>
> Regards
> Sorry if I misunderstood the discussion.
>
> > On 28 Sep 2018, at 19:57, Karen Yook <karen@wormbase.org> wrote:
> >
> > Hi Jerven,
> >
> > Thank you Jerven for suggesting the
> > "subtype to schema:StructuredValue e.g. bioschema:BioChemConcept"
> >
> > And raising the potential problem with sole use of 'Dataset' in the
> > current proposed tag set. As you point out for UniProt,  that is one
> > of the problems that would also affect Alliance pages. In addition,
> > 'Dataset' is just not a good description of our pages, rather they are
> > the living compilations of curation being created  from many
> > 'datasets', which range from large scale datasets to single bioentity
> > studies.
> >
> > Alasdair, if you need more specific examples of how 'Dataset' would be
> > less than ideal for us, let me know.  However, for now, I am happy
> > with what Jerven has proposed.  I will discuss this internally with
> > the Alliance to see if there are more specific things we need to
> > address.
> >
> > Best,
> > Karen
> >
> >
> >
> >
> > On Fri, Sep 28, 2018 at 1:37 AM Jerven Bolleman
> > <jerven.bolleman@sib.swiss> wrote:
> >>
> >> Hi Alasdair, All,
> >>
> >> Now that google dataset search exists I have a new worry of over using
> >> Dataset.
> >>
> >> Take www.uniprot.org as an example. It has a bit more than a billion
> >> webpages. Marking them all up with Dataset for what was a DataRecord
> >> before would mean we would have a bit over 3.5 billion Datasets.
> >> Google has no problem with dealing with the volume, but I am worried
> >> that their antispam logic/relevance would drown out the 7 or so Datasets
> >> that I would like to see highly ranked in their toolbox search.
> >>
> >> Considering that most of this work is SEO related, I would vote to mark
> >> up just 1 page with DataCatalog/Dataset on www.uniprot.org and not on
> >> the other pages.
> >>
> >> A more specific concept would be quite nice. May I suggest using a
> >> subtype to schema:StructuredValue e.g. bioschema:BioChemConcept.
> >> For example the schema:mainEntity on
> >> "https://wormbase.org/species/c_elegans/gene/WBGene00012939" would be of
> >> type schema:StructuredValue.
> >>
> >> In (hand-typed) JSON-LD roughly this.
> >>
> >> {
> >>   "@context" : "http://schema.org",
> >>   "@id" : "https://wormbase.org/species/c_elegans/gene/WBGene00012939" ,
> >>   "@type" : "Webpage" ,
> >>   "identifier" : "WBGene00012939",
> >>   "mainEntity" : {
> >>        "@type" : "StructuredValue" ,
> >>         "name"  : "subs-4" ,
> >>         "hasPart" : {
> >>                "@type" : "PropertyValue" ,
> >>                "propertyID" : "Sequence",
> >>                "value" : "Y47D3B.1 "
> >>        }
> >>    }
> >> }
> >>
> >>
> >> Regards,
> >> Jerven
> >>
> >>
> >>> On 09/28/2018 09:37 AM, Gray, Alasdair J G wrote:
> >>> Hi Karen,
> >>>
> >>>> On 27 Sep 2018, at 22:38, Karen Yook <karen@wormbase.org
> >>>> <mailto:karen@wormbase.org>> wrote:
> >>>>
> >>>> I just need to weigh in here as a voice in the Alliance of Genome
> >>>> Resources before anything gets finalized wrt to DataRecord or DataSet.
> >>>> While we are not tied to 'DataRecord' per se, we will need something
> >>>> other than just 'DataSet' to tag our pages.
> >>>
> >>> Can you elaborate on what you mean by, “we will need something other
> >>> than just ‘DataSet’ to tag our pages”?
> >>>
> >>>>
> >>>> We also believe specific distinctions via sub-types perhaps seems to
> >>>> be the preferred way to do things by bothschemas.org
> >>>> <http://schemas.org/>and Google.  We
> >>>> will try to come up with a more specific proposal by or at the
> >>>> Biohackathon in Paris in a couple weeks.
> >>>
> >>> We would like to get these issues resolved before the hackathon so that
> >>> we can have stable core profiles for use in marking up with resources.
> >>>
> >>> Thanks
> >>>
> >>> Alasdair
> >>>
> >>> --
> >>> Alasdair J G Gray
> >>> Associate Professor in Computer Science,
> >>> School of Mathematical and Computer Sciences
> >>> Heriot-Watt University, Edinburgh, UK.
> >>>
> >>> Email: A.J.G.Gray@hw.ac.uk <mailto:A.J.G.Gray@hw.ac.uk>
> >>> Web: http://www.macs.hw.ac.uk/~ajg33
> >>> ORCID: http://orcid.org/0000-0002-5711-4872
> >>> Office: Earl Mountbatten Building 1.39
> >>> Twitter: @gray_alasdair
> >>>
> >>> Untitled Document
> >>> ------------------------------------------------------------------------
> >>>
> >>> */Heriot-Watt University is The Times & The Sunday Times International
> >>> University of the Year 2018/*
> >>>
> >>> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
> >>> campuses and students across the entire globe we span the world,
> >>> delivering innovation and educational excellence in business,
> >>> engineering, design and the physical, social and life sciences.
> >>>
> >>> This email is generated from the Heriot-Watt University Group, which
> >>> includes:
> >>>
> >>> 1. Heriot-Watt University, a Scottish charity registered under number
> >>>    SC000278
> >>> 2. Edinburgh Business School a Charity Registered in Scotland,
> >>>    SC026900. Edinburgh Business School is a company limited by
> >>>    guarantee, registered in Scotland with registered number SC173556
> >>>    and registered office at Heriot-Watt University Finance Office,
> >>>    Riccarton, Currie, Midlothian, EH14 4AS
> >>> 3. Heriot- Watt Services Limited (Oriam), Scotland's national
> >>>    performance centre for sport. Heriot-Watt Services Limited is a
> >>>    private limited company registered is Scotland with registered
> >>>    number SC271030 and registered office at Research & Enterprise
> >>>    Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.
> >>>
> >>> The contents (including any attachments) are confidential. If you are
> >>> not the intended recipient of this e-mail, any disclosure, copying,
> >>> distribution or use of its contents is strictly prohibited, and you
> >>> should please notify the sender immediately and then delete it
> >>> (including any attachments) from your system.
> >>>
> >>
> >>
> >
>

Received on Friday, 28 September 2018 20:01:49 UTC