Metadata for CSV on the Web

Dear WG,

My colleagues and I have taken a closer look at the Metadata Vocabulary for Tabular Data document (and closely related Working Drafts) to see how we could contribute to the developments of your Working Group.

We have written a paper that contains thoughts on aspects of the "Schema language for CSV"; or, the metadata. 
It has just been accepted to the World Wide Web Conference.

In essence, we observed that the current example for the metadata language (Example 2 in the Metadata document - http://www.w3.org/TR/2014/WD-tabular-metadata-20140710/) seems, in a sense, rather restricted when it comes to defining the structure of a CSV-like file. I mean that the metadata only seems to specify the format of rows / columns / cells, whereas the Use Cases seem to require a bit more than that. We took a focus on further developing this aspect of the schema language and tried to write it down as clearly as we could (taking examples from the Use Cases; providing a formal semantics for them; and studying whether data can be efficiently validated). Our main message is that we think that the metadata language / schema language can be made more expressive as it seems now while retaining user-friendliness and efficient validation.


The paper is available at http://arxiv.org/abs/1411.2351. The quickest way of getting an impression of what we're doing is probably Section 2.
That section contains schemas for Use Cases 2, 3, and 13 of the use case document. (Space for the PDB use case was getting tight.)
We also think that the core of the schema language may be a usable starting point for transformation languages (into RDF of XML); see Section 5. You probably don't need to pay much attention to the Appendix, which mainly just contains proofs.


We are aware that your current proposal is aiming towards a JSON syntax for specifying schemas, but I don't think that this is a big issue. Many core ideas that we write about are orthogonal to having it described in JSON or not. In the paper we basically aimed for being self-contained and understandable without requiring the readers to have prior knowledge of formats like JSON. So we tried to make the syntax as simple as we could in order to make the main ideas such as the region selection expressions and content expressions more transparent to readers. I have a feeling that, *if* one would be interested in letting some of the ideas flow into the existing spec, then it could be just a matter of defining a syntax for it in JSON.


All the best,
Wim Martens

Received on Tuesday, 20 January 2015 08:55:19 UTC