RE: Use machine-readable standardized data formats / Use non-proprietary data formats

Dear all,

As previously commented, there are two main aspects to "web friendly data" (linked-data):

1. Identification
Having a URI recommendation, taking full advantage of the variants for at least format, language and version.

  http://example.com/ foo         # resource
  http://example.com/ foo.ur     # original data variant
  http://example.com/ foo.csv   # CSV variant
  http://example.com/ foo.pdf   # PDF variant

 
2. Format
* Simplicity: we should recommend the *simplest* format for the task at hand; for example, CSV is simpler for tabular data than XML. The simpler the specification the more likely that is going to be used.

* Machine-readable: this is *essential* - we must recommend that at least one format is machine readable; for example, the same data could be available in CSV and XML.

* Other formats: if the data is available in the original format, we should recommend that it is also made available - similar for presentation, if the data is available in a good presentation format such as PDF, it should also be made available.

* Dirty data: if the data is available only in a very dirty and unfriendly format, make it also available: dirty data is better than not data.

* Combined: several aspects can be combined; for example, PDF is good for presentation and can contain also good machine-readable data in the form of XML.


"Structure the data, the how is secondary"
  http://dragoman.org/format

Regards
Tomas


________________________________________
From: Makx Dekkers [mail@makxdekkers.com]
Sent: 15 August 2015 10:11
To: 'DWBP WG'
Subject: RE: Use machine-readable standardized data formats / Use  non-proprietary  data formats

Just for context: my examples of online legislation come from work that I have been doing over the last year or so. Publication services around the world are moving from publishing PDFs on the Web to 'webby' publication with all the aspects that Erik lists.

It became clear to me that basically all the Best Practices of this group directly apply to that environment too. Persistent identifiers, URI templates, multiple formats (XML, HTML, PDF) metadata based on standard predicate vocabularies, common controlled value vocabularies, versioning, linking within and between acts and between national and supranational level, quality, timeliness, etc. etc., the whole lot.

They even have more issues that have to do with legacy data, something we don't cover, and I don't suggest we do: moving from legacy identifier systems to persistent URIs taking into account citation practices; converting legacy data formats and PDF to 'webby' formats; scanning and OCR'ing medieval acts.

Just look at legislation.gov.uk, and it's all there. I would even say that lots of what they do could be used as real examples of several of our best practices.

Makx.




> -----Original Message-----
> From: Erik Wilde [mailto:dret@berkeley.edu]
> Sent: 15 August 2015 03:48
> To: DWBP WG <public-dwbp-wg@w3.org>
> Cc: Laufer <laufer@globo.com>
> Subject: Re: Use machine-readable standardized data formats / Use non-
> proprietary data formats
>
> hello all.
>
> On 2015-08-14 20:50, Laufer wrote:
> > If we have BPs that orient publishers to provide metadata about
> > structure, license, etc., to provide version information, and a lot of
> > other BPs, why we have to explain what data is ruled out or ruled in?
> > Why we have to forbid some publishers of following our BPs?
>
> thanks for this, laufer, that was exactly what i was thinking when reading
> annette's email. what's the problem with legislation documents, if all the BP
> talks about is how to represent them well in a webby way as part of a
> legislative dataset? all the BP should talk about are the webby parts, so it can
> safely stay away from any issues that pertain to a specific aspect of the data
> that's not specifically about being webby.
>
> starting from https://github.com/dret/webdata, let's see how you could talk
> about webby legislative documents:
>
> 1: Linkable
>
> publish all your documents at stable URIs, so that they can be referenced. at
> the very least, give them unique and stable URIs, if you don't want to make
> them directly accessible.
>
> for legislation, fragments (any news about this from the group, btw?) would
> be very essential, so that references can not just refer to documents, but all
> relevant parts of it.
>
> but again, reference culture in legislation is complex and hard, but they
> should think about the things they want to reference (as resources and sub-
> resources), and make sure all of those get stable identifiers.
> the BP would simply tell them *to do it*, not *how to do it* for their
> particular scenario.
>
> 2: Parseable
>
> probably use XML which is a good foundation for document-ish content. if
> you better like SGML or whatever floats your boat, that's fine, too.
>
> 3: Understandable
>
> use or define a documented format for your legal documents. use whatever
> schema language makes you happy (DTD, XSD, RNG, ...), but define and
> document the schema so that people accessing your data know what the
> XML represents.
>
> 4: Linked
>
> when cross-referencing legislation (such as a law from a ruling), use the URI
> of the referenced resource so that references are established at the web
> level.
>
> 5: Usable
>
> label your document with a license, so that others know how they can use it.
> there are many licenses to choose from, and picking any one of those is
> better than not picking one at all.
>
> so, what's the difficulty in making these recommendations, and maybe
> adding more that's in the BP but not (currently) in web data? BP wouldn't
> have to go out on a limb and try to explain how to design an XML schema,
> that's a different issue. so why exclude this scenario?
>
> cheers,
>
> dret.
>
> --
> erik wilde | mailto:dret@berkeley.edu  -  tel:+1-510-2061079 |
>             | UC Berkeley  -  School of Information (ISchool) |
>             | http://dret.net/netdret http://twitter.com/dret |



Received on Saturday, 15 August 2015 09:14:29 UTC