My review of the DWBP 21st Jan editor's draft

Hello everyone, here are the results of my BPs document walkthrough.
Sorry for the huge email in advance. Quite a lot of comments but I think
most of them are non-stoppers for a first draft publication. Maybe the only
part that may require some more attention IMO could be the data
preservation section and associated BPs

Happy to further discuss anything that is not clear and also to help
editing and updating whatever the group thinks that should be incorporated
to the document.


INTRO

- I think the term "data re-users" may be more appropriate than "data
consumers" in all the document given that "re-users" are willing to re-use
data to do something (analysis, services, products, alternative
presentations, whatever) and consumers just consume data passively (using
products or services, reading analysis, etc.) IMO re-users are the real
beneficiaries of this BPs. Re-users (also sometimes called infomediaries)
are a sort of intermediaries between publishers and consumers.

- "A basic knowledge of vocabularies and data models would be helpful to
better understand some aspects of this document." would replace this for "A
basic knowledge of some specific technologies could also be helpful to
better understand certain implementation techniques of this document."

In general I would try to keep all literature technologically-neutral with
the exception of the content of the implementation sections.

CHALLENGES

- Some are not described in a technologically-neutral way and thus biassed
e.g.

"How should URIs be designed and managed for persistence?" (should be "How
should IDs be designed and managed for persistence?"

"Data Vocabularies
How can existing vocabularies be used to provide semantic interoperability?
How can a new vocabulary be designed if needed?"

(should replace "data vocabularies" for "data models" everywhere)

LIFECYCLE

- Data creation phase is quite confusing, at least the name, because data
already exists somewhere in most of the cases. I think "data preparation"
or similar is more appropriate.

- I think that the refinement arrow should connect with the data
preparation (creation) phase, not with the publication one.

- I would think twice about including a "data archiving" phase, as it is
usually considered not a good practice to make things disappear from the
web (with some exceptions maybe)

TEMPLATE

- RFC keywords are currently not only used at the intended outcomes section
as it is stated in the template description, but also other places in the
document.

METADATA

- "In terms of metadata, the particular implementation method will depend
on the format of the dataset distribution, for example, metadata describing
a CSV file should be provided in a different way than for an RDF dataset."

After reading several times I still don't understand what we mean by this.
Maybe it is just me, but I think it is not really clear.

BP1

- Last point of implementation (license and rights) is already included in
the first one (DCAT)

- I don't know why we haven't added also JSON or XML as possible (and
frequent) implementations for this.

BP2

I have sent a separated mail for this.

BP3

- Should keep using terms instead vocabularies in the BP description as in
the title for consistence and technology neutrality.

- Description of the possible implementation is not about using standard
terms but about using self-descriptive formats and that should be part of a
different BP - see also my separated email on BP2 and (deleted) BP4. We
should focus here on providing a list of well-known reference metadata
element sets that are widely used (i.e. dc; dcat; foaf...)

BP4

- Have already discussed this on a separated email

BP5

- "Search tools must be able to discover datasets." I would say "user
agents" or "automated tools" or anything more generic than "search tools"

- What kind of access mechanism is "linked data platform"? What's the
difference with SPARQL endpoint?

BP5-BP6

- Is there any reason for not to provide a more complete list of terms in
the implementation sections? (e.g. all those from DCAT)

BP7

- In how to test using a formal specification (e.g. ISO) should be also a
valid option

DATA IDENTIFICATION

- Remove "Just by adopting a common identification system we are making
possible basic identification and comparison processes by the different
stakeholders in a reliable way. These will be essential pre-conditions for
proper data management and to facilitate reuse." as it is duplicated in the
BP7 content.

BP7

- Remove all the IRIs stuff from the why section to keep it technology
neutral

- Remove IRIs from implementation as I am really hesitant we should be
recommending using IRIs or mnemonics for IDs as best practice and need more
discussion. Best practices is usually to keep IDs (and URIs) neutral
instead.

- Remove or complete "Apply the design rules" from test, as it basically
means nothing as currently.

- Missing link to "HTTP Status codes"

DATA FORMATS

- RDF and JSON examples should be removed from the introduction to keep
technology neutrality there.

BP8

- Remove reference to proprietary or non-proprietary formats because (1) it
is not the scope of this BP and (2) it is already covered by other BP

BP9

- If we are going to include a BP on open standards I would also include
one on open licenses. Neither of those are required for having data on the
web but both are good practices in order to increase audience, so deserve
the same treatment

- Include at least XML also in the list of open standards provided

BP10

- "Providing data in more than one format reduces costs incurred in data
transformation" we should clarify this is for data re-users (increase costs
in fact for data producers)

DATA VOCABULARIES

- Should be called data models or anything else more neutral (also for all
BPs titles and descriptions in this section possibly with the only
exception of implementation sections)
- Get rid off (or move to another more apropriated place) all the
introductory vocabularies, ontologies and skos stuff as it is not
technology neutral at all

BP11

Same problems as for the analog "document metadata" BP. Same alternatives
suggested are also valid here.

BP13

Implementation to approach really weak. Should need to suggest some minimal
versioning policy recommendations (will be looking at that later)

BP15

Why section is not technologically neutral and need to be rewritten

HOW TO FIND VOCABULARIES

Has been integrated in BP15 and should be removed here

HOW TO CHOOSE VOCABULARIES

Should also be removed from here and integrated in BP15

DATA LICENSES

Intro is not technologically neutral, that references should be removed and
only part of BP17 possible implementation. Maybe
http://theodi.org/guides/publishers-guide-to-the-open-data-rights-statement-vocabulary
more appropriate.

BP17

Looks like the ODI-LICENSING reference is not providing really useful
information here

DATA PROVENANCE

The provenance ontology reference should be removed from the intro as it is
an implementation-only question.

BP18

- "Data provenance is metadata that corresponds to data."  I don't really
understand this sentence.

- Can't also understand the expected outcome.

- All options in (3) at implementation are indeed machine-readable, not
only the two first.

DATA QUALITY

The ZAVERI reference for LOD techniques should be removed from the intro
for not being tech-neutral

BP19

Remove reference to the data quality work from implementation as it is
still work in progress (more appropriate as a note in the meanwhile)

SENSITIVE DATA

Reference to HTTPS should be removed from intro for being tech-specific.

BP20

Current test looks more like a implementation technique.

BP21

"From a consumer machine usage perspective, the Web HTML file could contain
Turtle or JSON-LD (for RDF) or it can be embdedded in the HTML page, again
as [JSON-LD], or [HTML-RDFA] or [Microdata]."

Don't really understand this: the web html file can also be embedded in the
HTML page?

DATA ACCESS

Too much content about the specific techniques in the intro IMO.

BP22

I don't think APIs/REST services could be suggested as a good *bulk*
download option

BP23

"Humans should be possible to access data using browser as a client." looks
like a quite strange desirable output, no? I wouldn't say that's a
desirable output by itself, more likely a side-effect.

BP24

It is somehow already contained (or a specialization) of BP25

BP25

- The BP should be more general, something like "PRovide timely access to
data"
- "Update frequency" looks like a more appropriate term than "update cycle"

BP26

- "Good versioning helps them to determine when to update to a newer
version." I don't see how versioning policy could help on this. Update
frequency from BP25 looks like much more valuable for that.

- Track record of changes is the core of BP27 and should be removed from
here.

BP27

- I think that "Recommended" is not one of the RFCs, no?

- We could include for implementation a recommendation to include
references to other versions from each dataset (previous, first, last,
next, etc.)

BP28

- Shouldn't be a BP as is because is technology-tied (API). Looks more like
a technique for BP26

- Implementation should clarify that difference between V1 and V2 should be
the data model or the functions or collections or similar, not the data
itself. In fact same call for V1 and V2 should retrieve the same data
(although maybe in a different data model)

DATA PRESERVATION

I feel quite uncomfortable with this section in general. I have some
problems trying to understand the underlying principles for this BPs, but
overall it looks to be about data archiving generally speaking instead
about data persistence that is indeed the best practice IMO and also
coherent with other BPs in the document (such as versioning). In fact data
archiving looks more like a bad practice for me than a best one.

BP29

I don't really understand the purpose of this BP

BP30

Same as for BP29, but even more confusing given the use of "coverage"
apparently with a different meaning of the one from DCAT for example.

BP31

Why section should be tech-agnostic

BP32

As currently the BP is tech-dependent (only for URIs). Should refer to IDs
instead and mention only specific tech on implementation.

BP33

The reference to the data usage should not be part of the bp yet because
work in progress. A note may be more appropriate at this stage.


GENERAL

- Several "how to test sections" are a little bit weak from my auditor
perspective (i.e. explained in a way that it is difficult to test or not
objective enough to ensure two different test by different people will
raise similar results) e.g. BP1; BP5; BP13; BP21; BP23; BP25; BP32; BP33 In
any case, that's something to review once the content of the BPs is more
stable.

- URIs/IRIs is used inconsistently around the document. Suggest to use
always URIs for the shake of consistency and simplicity.


That's all folks!

Best,
 CI.
---

Carlos Iglesias.
Open Data Consultant.
+34 687 917 759
contact@carlosiglesias.es
@carlosiglesias
http://es.linkedin.com/in/carlosiglesiasmoro/en

Received on Wednesday, 21 January 2015 22:47:47 UTC