Re: My review of the DWBP 21st Jan editor's draft

Some comments and actions from me on this.

On 21/01/2015 22:47, Carlos Iglesias wrote:
> Hello everyone, here are the results of my BPs document walkthrough.
> Sorry for the huge email in advance. Quite a lot of comments but I think
> most of them are non-stoppers for a first draft publication. Maybe the only
> part that may require some more attention IMO could be the data
> preservation section and associated BPs
>
> Happy to further discuss anything that is not clear and also to help
> editing and updating whatever the group thinks that should be incorporated
> to the document.
>
>
> INTRO
>
> - I think the term "data re-users" may be more appropriate than "data
> consumers" in all the document given that "re-users" are willing to re-use
> data to do something (analysis, services, products, alternative
> presentations, whatever) and consumers just consume data passively (using
> products or services, reading analysis, etc.) IMO re-users are the real
> beneficiaries of this BPs. Re-users (also sometimes called infomediaries)
> are a sort of intermediaries between publishers and consumers.

Personally I agree that re-users is better than consumers but the 
consensus in the WG so far has been consumer.

>
> - "A basic knowledge of vocabularies and data models would be helpful to
> better understand some aspects of this document." would replace this for "A
> basic knowledge of some specific technologies could also be helpful to
> better understand certain implementation techniques of this document."
>
> In general I would try to keep all literature technologically-neutral with
> the exception of the content of the implementation sections.

It depends on your POV. I like the term data model but Bernadette argues 
that it means something different in the context of an RDB although I 
think it's worth thinking about. Vocabularies typically have a UML-like 
diagram. The important thing about DCAT, for example, is the distinction 
between a dataset and a distribution. Whether you use dcat or schema.org 
terms is less important IMO. But this is a WG discussion point.

>
> CHALLENGES
>
> - Some are not described in a technologically-neutral way and thus biassed
> e.g.
>
> "How should URIs be designed and managed for persistence?" (should be "How
> should IDs be designed and managed for persistence?"

No. ID schemes other than URIs are not on the Web and therefore out of 
scope. Welcome to W3C.

>
> "Data Vocabularies
> How can existing vocabularies be used to provide semantic interoperability?
> How can a new vocabulary be designed if needed?"
>
> (should replace "data vocabularies" for "data models" everywhere)

Maybe. See above.

>
> LIFECYCLE
>
> - Data creation phase is quite confusing, at least the name, because data
> already exists somewhere in most of the cases. I think "data preparation"
> or similar is more appropriate.
>
> - I think that the refinement arrow should connect with the data
> preparation (creation) phase, not with the publication one.
>
> - I would think twice about including a "data archiving" phase, as it is
> usually considered not a good practice to make things disappear from the
> web (with some exceptions maybe)

This has been discussed (at TPAC and since IIRC). We talk only about the 
Web aspects, such as redirects, proper HTTP response codes etc. Actually 
I think we can sharpen up that advice perhaps with 303 and 410 codes 
etc. But that's a BP I'm currently looking at for a variety of reasons 
and will have a modified version to offer later.


>
> TEMPLATE
>
> - RFC keywords are currently not only used at the intended outcomes section
> as it is stated in the template description, but also other places in the
> document.

+1 This needs to be fixed - I'm on it.

>
> METADATA
>
> - "In terms of metadata, the particular implementation method will depend
> on the format of the dataset distribution, for example, metadata describing
> a CSV file should be provided in a different way than for an RDF dataset."
>
> After reading several times I still don't understand what we mean by this.
> Maybe it is just me, but I think it is not really clear.

Given the sentences preceding this one, I don't think it actually adds 
anything except, it seems, confusion. So I've simply removed it.

>
> BP1
>
> - Last point of implementation (license and rights) is already included in
> the first one (DCAT)

True. I have removed it.

>
> - I don't know why we haven't added also JSON or XML as possible (and
> frequent) implementations for this.

That would be useful. Care to suggest some text and links? (I'm trying 
to help the editors get this done in a bit of a hurry).

>
> BP2
>
> I have sent a separated mail for this.
>
> BP3
>
> - Should keep using terms instead vocabularies in the BP description as in
> the title for consistence and technology neutrality.
>
> - Description of the possible implementation is not about using standard
> terms but about using self-descriptive formats and that should be part of a
> different BP - see also my separated email on BP2 and (deleted) BP4. We
> should focus here on providing a list of well-known reference metadata
> element sets that are widely used (i.e. dc; dcat; foaf...)

OK, again, suggested text would be helpful.

>
> BP4
>
> - Have already discussed this on a separated email
>
> BP5
>
> - "Search tools must be able to discover datasets." I would say "user
> agents" or "automated tools" or anything more generic than "search tools"

I agree, user agents it is.


>
> - What kind of access mechanism is "linked data platform"? What's the
> difference with SPARQL endpoint?

Reference to the LDP spec added. Perhaps therefore we should also add 
refs to SPARQL, SOAP, and, less easily, "REST interfaces."


>
> BP5-BP6
>
> - Is there any reason for not to provide a more complete list of terms in
> the implementation sections? (e.g. all those from DCAT)

Brevity and readability. We have referred to DCAT a lot, what we're 
highlighting here is that it covers both discovery and admin aspects.


===
Out of time for this pass. I'll return to the e-mail and look at the 
rest when I can, hopefully later today.

Phil.


>
> BP7
>
> - In how to test using a formal specification (e.g. ISO) should be also a
> valid option
>
> DATA IDENTIFICATION
>
> - Remove "Just by adopting a common identification system we are making
> possible basic identification and comparison processes by the different
> stakeholders in a reliable way. These will be essential pre-conditions for
> proper data management and to facilitate reuse." as it is duplicated in the
> BP7 content.
>
> BP7
>
> - Remove all the IRIs stuff from the why section to keep it technology
> neutral
>
> - Remove IRIs from implementation as I am really hesitant we should be
> recommending using IRIs or mnemonics for IDs as best practice and need more
> discussion. Best practices is usually to keep IDs (and URIs) neutral
> instead.
>
> - Remove or complete "Apply the design rules" from test, as it basically
> means nothing as currently.
>
> - Missing link to "HTTP Status codes"
>
> DATA FORMATS
>
> - RDF and JSON examples should be removed from the introduction to keep
> technology neutrality there.
>
> BP8
>
> - Remove reference to proprietary or non-proprietary formats because (1) it
> is not the scope of this BP and (2) it is already covered by other BP
>
> BP9
>
> - If we are going to include a BP on open standards I would also include
> one on open licenses. Neither of those are required for having data on the
> web but both are good practices in order to increase audience, so deserve
> the same treatment
>
> - Include at least XML also in the list of open standards provided
>
> BP10
>
> - "Providing data in more than one format reduces costs incurred in data
> transformation" we should clarify this is for data re-users (increase costs
> in fact for data producers)
>
> DATA VOCABULARIES
>
> - Should be called data models or anything else more neutral (also for all
> BPs titles and descriptions in this section possibly with the only
> exception of implementation sections)
> - Get rid off (or move to another more apropriated place) all the
> introductory vocabularies, ontologies and skos stuff as it is not
> technology neutral at all
>
> BP11
>
> Same problems as for the analog "document metadata" BP. Same alternatives
> suggested are also valid here.
>
> BP13
>
> Implementation to approach really weak. Should need to suggest some minimal
> versioning policy recommendations (will be looking at that later)
>
> BP15
>
> Why section is not technologically neutral and need to be rewritten
>
> HOW TO FIND VOCABULARIES
>
> Has been integrated in BP15 and should be removed here
>
> HOW TO CHOOSE VOCABULARIES
>
> Should also be removed from here and integrated in BP15
>
> DATA LICENSES
>
> Intro is not technologically neutral, that references should be removed and
> only part of BP17 possible implementation. Maybe
> http://theodi.org/guides/publishers-guide-to-the-open-data-rights-statement-vocabulary
> more appropriate.
>
> BP17
>
> Looks like the ODI-LICENSING reference is not providing really useful
> information here
>
> DATA PROVENANCE
>
> The provenance ontology reference should be removed from the intro as it is
> an implementation-only question.
>
> BP18
>
> - "Data provenance is metadata that corresponds to data."  I don't really
> understand this sentence.
>
> - Can't also understand the expected outcome.
>
> - All options in (3) at implementation are indeed machine-readable, not
> only the two first.
>
> DATA QUALITY
>
> The ZAVERI reference for LOD techniques should be removed from the intro
> for not being tech-neutral
>
> BP19
>
> Remove reference to the data quality work from implementation as it is
> still work in progress (more appropriate as a note in the meanwhile)
>
> SENSITIVE DATA
>
> Reference to HTTPS should be removed from intro for being tech-specific.
>
> BP20
>
> Current test looks more like a implementation technique.
>
> BP21
>
> "From a consumer machine usage perspective, the Web HTML file could contain
> Turtle or JSON-LD (for RDF) or it can be embdedded in the HTML page, again
> as [JSON-LD], or [HTML-RDFA] or [Microdata]."
>
> Don't really understand this: the web html file can also be embedded in the
> HTML page?
>
> DATA ACCESS
>
> Too much content about the specific techniques in the intro IMO.
>
> BP22
>
> I don't think APIs/REST services could be suggested as a good *bulk*
> download option
>
> BP23
>
> "Humans should be possible to access data using browser as a client." looks
> like a quite strange desirable output, no? I wouldn't say that's a
> desirable output by itself, more likely a side-effect.
>
> BP24
>
> It is somehow already contained (or a specialization) of BP25
>
> BP25
>
> - The BP should be more general, something like "PRovide timely access to
> data"
> - "Update frequency" looks like a more appropriate term than "update cycle"
>
> BP26
>
> - "Good versioning helps them to determine when to update to a newer
> version." I don't see how versioning policy could help on this. Update
> frequency from BP25 looks like much more valuable for that.
>
> - Track record of changes is the core of BP27 and should be removed from
> here.
>
> BP27
>
> - I think that "Recommended" is not one of the RFCs, no?
>
> - We could include for implementation a recommendation to include
> references to other versions from each dataset (previous, first, last,
> next, etc.)
>
> BP28
>
> - Shouldn't be a BP as is because is technology-tied (API). Looks more like
> a technique for BP26
>
> - Implementation should clarify that difference between V1 and V2 should be
> the data model or the functions or collections or similar, not the data
> itself. In fact same call for V1 and V2 should retrieve the same data
> (although maybe in a different data model)
>
> DATA PRESERVATION
>
> I feel quite uncomfortable with this section in general. I have some
> problems trying to understand the underlying principles for this BPs, but
> overall it looks to be about data archiving generally speaking instead
> about data persistence that is indeed the best practice IMO and also
> coherent with other BPs in the document (such as versioning). In fact data
> archiving looks more like a bad practice for me than a best one.
>
> BP29
>
> I don't really understand the purpose of this BP
>
> BP30
>
> Same as for BP29, but even more confusing given the use of "coverage"
> apparently with a different meaning of the one from DCAT for example.
>
> BP31
>
> Why section should be tech-agnostic
>
> BP32
>
> As currently the BP is tech-dependent (only for URIs). Should refer to IDs
> instead and mention only specific tech on implementation.
>
> BP33
>
> The reference to the data usage should not be part of the bp yet because
> work in progress. A note may be more appropriate at this stage.
>
>
> GENERAL
>
> - Several "how to test sections" are a little bit weak from my auditor
> perspective (i.e. explained in a way that it is difficult to test or not
> objective enough to ensure two different test by different people will
> raise similar results) e.g. BP1; BP5; BP13; BP21; BP23; BP25; BP32; BP33 In
> any case, that's something to review once the content of the BPs is more
> stable.
>
> - URIs/IRIs is used inconsistently around the document. Suggest to use
> always URIs for the shake of consistency and simplicity.
>
>
> That's all folks!
>
> Best,
>   CI.
> ---
>
> Carlos Iglesias.
> Open Data Consultant.
> +34 687 917 759
> contact@carlosiglesias.es
> @carlosiglesias
> http://es.linkedin.com/in/carlosiglesiasmoro/en
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1

Received on Thursday, 22 January 2015 14:45:12 UTC