hcls dataset description comments from Michael Miller on 2014-07-22 (public-semweb-lifesci@w3.org from July 2014)

From: Michael Miller <Michael.Miller@systemsbiology.org>
Date: Tue, 22 Jul 2014 15:36:26 -0700
To: w3c semweb hcls <public-semweb-lifesci@w3.org>
Message-ID: <85061d437adec22e5c924c8c9b748595@mail.gmail.com>
hi all,



tremendous work, very clear and well-written.  my group at ISB, the
Shmulevich lab is looking to provide provenance for the analysis datasets
we are producing for TCGA.  we're not sure if we'll be able to 'go all the
way' but we want to make sure we have at hand all the information that we
could, at least in theory, be compliant.  as long as i was reading the
document, below are some notes.



general comments:

·         s4.4 'Dataset Linking': might mention also that datasets are
derived from other datasets?
'A dataset may incorporate, or link to, data in other datasets, e.g. in the
creation of a data mashup ' --> 'A dataset may incorporate, be derived
from, or link to, data in other datasets, e.g. in the analysis of original
datasets or in the creation of a data mashup '

·         the chembl example in s5 is not compliant to the property table
below, it probably is only supposed to show the relationship of the three
terms but that could be clarified

·         s6.2.12 could use the example filled in

·         6.3.2: not sure what an 'X level description' is

·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
individual organizations but three (8.4, 8.8, 8.9) have subsections for
different organizations.  maybe organize so all top level sections define a
type of organization with subsections beneath or make all top-level?

·         s8: many of the use cases could be more focused on how this note
will help them

·         s8.9: how do Data Catalogs fit into this note?  wasn't clear to
me how this note is relevant to them

·         would be nice to have a 'complete' example p[put together, maybe
based on chembl?



our use case questions:

·         how to reference 3rd party datasets that aren't described by this
standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' with
the IRI being the URL into the repository?

·         we have a lot of intermediary files that we won't publish, the
software specified in creating our published datasets from its sources form
a (branching) workflow with the input being from the previous step(s) in
the workflow.  how best to represent this?  this note doesn't seem to cover
how the dataset is created so any recommendations?



text issues:

·         Figure 1: 'Overview of dataset description level metadata
profiles and their relationships': reference not resolved, image doesn't
show

·         Figure 2: 'Improve diagram. Multiple appearance of
concepts/description levels unclear.': reference not resolved, image
doesn't show.  add actual label



minor edits:

·         bottom of s.3: 'placeholde' should be 'placeholder'

·         use straight quotes rather than slant quotes in s6.2.2 example
(and elsewhere)?

·         the text runs out of the box in s6.2.3 'Description'

·         s6.2.3: 'Dates of Creation and Issuance': 'state the date the
dataset was generated using dct:created and/or the date the dataset was
made public using dct:created' should be 'state the date the dataset was
generated using dct:created and/or the date the dataset was made public
using dct:issued'?

·         there are two s6.2.3 sections

·         s6.2.4: 'Creation: ... The date of authorship' should be '...The
date of creation' and 'Curation:... The date of authorship' should be '...The
date of curation'?

·         s8.5: the author list has end parenthesis without beginning
parenthesis

·         s8.8.1: '... what period it is updated. To know when to...'
should be '...what period it is updated to know when to...'



cheers,

michael



Michael Miller

Software Engineer

Institute for Systems Biology
Received on Tuesday, 22 July 2014 22:36:51 UTC