Re: BP document - 2nd draft is ready to be published

Thanks - I'm working on it.

The version at now includes my SVG 
diagram. I'm currently in discussion with Robin Berjon on how to handle 
the interactivity better. I've used <embed /> which is old school, and 
I've had to add in some scripting to make clicking links in the embedded 
SVG interact with the parent HTML doc's DOM. Which seems rather 
over-complex but... it seems to work, unlike all the other methods I can 
find/have been advised to use. Grr... this has already swallowed *way* 
too much time...


On 22/06/2015 14:46, Caroline Burle wrote:
> Thank you, Annette!
> @Phila, considering that, the second draft is ready to be published.
> Kind regards,
> Caroline
> On 19/06/15 20:46, Annette Greiner wrote:
>> Thanks, guys, it looks good for now.
>> cheers,
>> -Annette
>> --
>> Annette Greiner
>> NERSC Data and Analytics Services
>> Lawrence Berkeley National Laboratory
>> 510-495-2935
>> On Jun 19, 2015, at 12:10 PM, Caroline Burle <
>> <>> wrote:
>>> Annette,
>>> we created Issues at comments tracker [1] and also at the Wiki [2].
>>> Using the Wiki's issues, we updated the document [3]. We kindly ask
>>> you to let us know if they attend your expectations.
>>> Thank you! Have a great weekend!
>>> Kind regards,
>>> Bernadette, Caroline and Newton
>>> [1]
>>> [2]
>>> [3]
>>> On 18/06/15 23:03, Annette Greiner wrote:
>>>> Hi folks,
>>>> I gave the BP doc a read-through today and noticed a few things that
>>>> I think we should address. Sorry this got rather long; it was really
>>>> short when I started it! I am finding the document is getting better
>>>> and better, and many of the new changes are things that make it
>>>> better still. Much of what follows is stuff I think we can address
>>>> in a later version. I’ll put an asterisk at the beginning of each
>>>> note that I think is important to consider before publishing the
>>>> next version.
>>>> -Annette
>>>> Provide metadata
>>>> How to Test: we say “access the same URL” to test machine
>>>> readability. I think it’s fine for the machine-readable metadata to
>>>> be separate, under a different URL.
>>>> Provide descriptive metadata
>>>> Possible approach to implementation: The list of metadata to be
>>>> included is not an implementation, so that should be moved up and
>>>> listed under intended outcome. Spatial coverage and temporal period
>>>> are irrelevant for lots of datasets, so they should be marked “if
>>>> relevant". Keywords and themes/categories are dependent on the
>>>> context of a catalog, so I think we should leave them out of this list.
>>>> Data Quality
>>>> The introduction says that quality “can affect the potentiality of
>>>> the application that use data”. I don’t understand that phrase. The
>>>> text under Evidence should use the same format as in other BPs.
>>>> * Data Versioning
>>>> The chart describes time series data, not versions of data. I would
>>>> say that, if released independently, the items in yellow each
>>>> represent a different dataset (they report different data points),
>>>> not a different version. If you revised any of them, then the
>>>> original and the revision would be different versions. I think by
>>>> definition, versions attempt to report the same data.
>>>> Data Identification
>>>> The introductory text about URIs and URLs and IRIs is potentially
>>>> confusing and not necessary for our audience to understand the BPs
>>>> about identifiers. Also, URLs are for for the internet, not just the
>>>> web. I also disagree with the representation of DOIs as something
>>>> that cannot be looked up, though the question is not something I
>>>> think we should make readers think about.
>>>> * I would like this section to limit itself to information that
>>>> applies to publishing *data*. The BP is about assigning persistent
>>>> identifiers to datasets, but the possible approach to implementation
>>>> is about much more than that. The list items are also not
>>>> consistent. (one shows use of extensions, another says not to do
>>>> that). I worry that this will open up a holy war about how to
>>>> implement a REST API.
>>>> Use machine-readable standardized data formats
>>>> Possible approach to implementation: the first sentence is about
>>>> choosing a format that your users will be able to parse. The second
>>>> sentence mentions being nonproprietary. Proprietariness is a
>>>> different issue and separate BP. We should say “commonly used
>>>> standard formats include, but are not limited to, CSV, …"
>>>>   Use standardized terms
>>>> should be “Standardized terms should be used to provide metadata
>>>> whenever they are available.” In scientific domains, often there are
>>>> no standard terms yet available.
>>>> Other vocabularies-related BPs, I don’t think they are in scope.
>>>> Vocabularies are not data.
>>>> Preserve people’s right to privacy
>>>> I agree with the concept completely, but we need to be careful about
>>>> how we state things. Can data about famous people be published
>>>> without their consent? What about public acts? I think the current
>>>> intended outcome is too restrictive. Maybe “Data that can identify a
>>>> private individual must not be published without their consent.”
>>>> Security can involve more than protecting privacy. I think we should
>>>> have a separate BP for security.
>>>> We have a BP saying to use REST for APIs, but we don’t have one
>>>> saying to make data available with an API. That strikes me as odd.
>>>> Provide real-time access
>>>> We say “where data is produced in real time” where I think we mean
>>>> “where data changes in real time.” It seems like real-time data is
>>>> always published by the producer. I can’t think of an example where
>>>> a publisher would be waiting for a producer to give them real-time
>>>> data.
>>>> Provide data up to date
>>>> * I think this needs editing. It’s difficult to understand the
>>>> actual requirement. At times it sounds like we are saying all data
>>>> should be published immediately, which is impractical for many
>>>> publishers. I think the goal should be to adhere to a published
>>>> schedule for updates.
>>>> Assess dataset coverage
>>>> I disagree that a chunk of web data is by definition dependent on
>>>> the rest of the global graph. I think a good dataset is sufficiently
>>>> self-explanatory that it can be used fully without requiring some
>>>> other pieces of the web to be present. One should avoid dependencies
>>>> on external resources that are not expected to persist.
>>>> Use a trusted serialization format for preserved data dumps
>>>> To the extent that this is in scope, it is covered under the BP
>>>> about using standardized formats. We could add a note to that
>>>> mentioning the value for preservation. I don’t think this needs to
>>>> be a separate BP.
>>>> Update the status of identifiers
>>>> To the extent that this is in scope, it should be covered under
>>>> versioning or unavailability. What are “preserved” datasets? Are
>>>> they available on the web? If not, it is out of scope. If they are,
>>>> then they are versions.
>>>> Feedback
>>>> We say “blogs and other publicly available feedback should be
>>>> displayed in a human-readable form through the user interface.” That
>>>> suggests that publishers should re-publish blog content, which is
>>>> probably not what we want (copyright issues, for one thing).
>>>> Publishers of data can’t control the format of other people’s
>>>> publications.
>>>> Gather feedback from data consumers
>>>> Possible approach to implementation: registration is not feedback. I
>>>> don’t think filling in a comment box is properly referred to as
>>>> “blogging”.
>>>> Data enrichment
>>>> * Enrichment yields derived data, not just metadata. For example,
>>>> you could take a dataset of scheduled and real bus arrival times and
>>>> enrich it by adding on-time arrival percentages. The percentages are
>>>> data, not metadata.
>>>> I don’t know the meaning of the word “topification”.
>>>> I would still like to see mention of the value of enabling users to
>>>> grab subsets of data.
>>>> --
>>>> Annette Greiner
>>>> NERSC Data and Analytics Services
>>>> Lawrence Berkeley National Laboratory
>>>> 510-495-2935


Phil Archer
W3C Data Activity Lead
+44 (0)7887 767755

Received on Monday, 22 June 2015 14:02:50 UTC