Re: Spec review request: CSV on the Web from Gregg Kellogg on 2015-04-19 (public-csv-wg@w3.org from April 2015)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Sun, 19 Apr 2015 10:19:52 -0700
To: ashok.malhotra@oracle.com
Cc: Jeni Tennison <jeni@jenitennison.com>, www-tag@w3.org, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-Id: <A1FC472D-89D4-4A56-ABC3-54278D4D1C0D@greggkellogg.net>
> On Apr 18, 2015, at 4:07 PM, ashok malhotra <ashok.malhotra@oracle.com> wrote:
> 
> Shouldn't it be possible to create Relational tables from tabular data?
> That is, after all, a popular use of tabular data.
> There are probably existing tools and standards to do this but I would
> think it was worth at least a mention.

Hi Ashok,

Note that the abstract tabular data model does include such relationships, although they are not used as part of the conversion process.

A Row [1] contains a “referenced rows” annotation:

[[[
• referenced rows — a possibly empty list of pairs of a foreign key and a row in a table within the same group of tables (which may be another row in the table in which this row appears).
]]]

From the perspective of the data model, are these not relational tables?

My implementation does not populate this annotation, but in principle it could. The main difference from the RDB2RDF serializations is that these relationships are used to output triples that make use of these relationships, while the CSVW conversions do not.

Getting conversions to make use of these relations would be a big change, as the abstract tabular data model is abstract, and not something which is necessarily constructed. A converter can process a row at a time over a potentially unbounded dataset; to make use of this would require creating an in-memory data model and a convention for URI template variables to access information through such relationships; now it is limited to cell values and specific metadata to act as variables.

For example, take the Foreign keys referencing candidate keys Example [2]: this generates the following triples:

@base <http://foo.example/DB/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<People/ID=7> rdf:type <People> .
<People/ID=7> <People#ID> 7 .
<People/ID=7> <People#fname> "Bob" .
<People/ID=7> <People#addr> 18 .
<People/ID=7> <People#ref-addr> <Addresses/ID=18> .
<People/ID=7> <People#deptName> "accounting" .
<People/ID=7> <People#deptCity> "Cambridge" .
<People/ID=7> <People#ref-deptName;deptCity> <Department/ID=23> .
<People/ID=8> rdf:type <People> .
<People/ID=8> <People#ID> 8 .
<People/ID=8> <People#fname> "Sue" .

<Addresses/ID=18> rdf:type <Addresses> .
<Addresses/ID=18> <Addresses#ID> 18 .
<Addresses/ID=18> <Addresses#city> "Cambridge" .
<Addresses/ID=18> <Addresses#state> "MA" .

<Department/ID=23> rdf:type <Department> .
<Department/ID=23> <Department#ID> 23 .
<Department/ID=23> <Department#name> "accounting" .
<Department/ID=23> <Department#city> "Cambridge" .
<Department/ID=23> <Department#manager> 8 .
<Department/ID=23> <Department#ref-manager> <People#ID=8> .

The <People#ref-deptName;deptCity> value references <Department/ID=23>.

As the conversions can act over cells for either actual or virtual columns, this relationship could only come from a virtual column, where the valueUrl would somehow expand to <Addresses/ID=18> using the ID column from the addresses table. This isn’t currently available as any kind of URI Template variable, and would need to come out of the referenced rows annotation. You could imagine something like the following:

“columns”: [
  …
  {
    "name": "ref",
    "virtual": true,
    "propertyUrl": "http://foo.example/DB/People#ref-deptName;deptCity",
    "valueUrl" : "http://foo.example/DB/People/ID=XX"
  }

The “XX” part of the valueUrl would need to use the foreignKey to find a row relationship and use that to find a cell for the column with name “ID” and use that cell value as the variable substitution (borrowing some Ruby syntax):

XX = cell.row.referencedRows(“deptName”,”deptCity”).row.cells.detect {|cell| cell.column.name == “ID”}.cellValue

We might imagine some way to describe these relationships, but they obviously come with quite a bit more complexity.

An alternative I suggested [3] was to create a transformation which takes the RDF result of the conversion, without such a relationship, and use some SPARQL to create these relationships. This seems like a better separation of concerns to me, as SPARQL is very well suited to traverse such relationships. This might form the basis of an example in the metadata document that describes both the relationship to the direct mapping and the use of SPARQL in a transformation to achieve these results.

Gregg

[1] http://www.w3.org/TR/tabular-data-model/#rows
[2] http://www.w3.org/TR/rdb-direct-mapping/#fk-candidate
[3] https://github.com/w3c/csvw/issues/503#issuecomment-94029765

> All the best, Ashok
> 
> On 4/18/2015 6:24 AM, Jeni Tennison wrote:
>> Hello TAG,
>> 
>> The CSV on the Web Working Group would like to request that the TAG review the following Working Drafts:
>> 
>>   Model for Tabular Data and Metadata on the Web -
>>     http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/
>>   Metadata Vocabulary for Tabular Data -
>>     http://www.w3.org/TR/2015/WD-tabular-metadata-20150416/
>>   Generating JSON from Tabular Data on the Web -
>>     http://www.w3.org/TR/2015/WD-csv2json-20150416/
>>   Generating RDF from Tabular Data on the Web -
>>     http://www.w3.org/TR/2015/WD-csv2rdf-20150416/
>> 
>> There are three things in particular that I’d like to draw the TAG’s attention to, where we have adopted a “pragmatic” rather than “correct” design:
>> 
>> 1. We have a facility to enable transformations over tabular data using templates or scripts [1], to provide for transformations beyond those we’ve defined for JSON and RDF. In doing this we need to be able to indicate the format of both the result of the transformation and the format of the template or script that is being used.
>> 
>> We think that the “correct” way of doing this would be to use media types. However, it’s quite rare for templating syntaxes (such as Mustache) to have a registered media type, so instead we have opted to use URLs to name those formats and encourage users to use URLs in the form http://www.iana.org/assignments/media-types/{mediatype} when there is a registered media type. Is this the right approach to take or should we be more insistent on the use of a media type?
>> 
>> 2. In the conversion to RDF, we want to use the ‘describes’ link relation defined in [2] to say that a particular row in the tabular data describes a particular thing (such as a person or event). Because this is RDF, the relationship has to have a URL.
>> 
>> However, as has been discussed elsewhere [3], IANA registered link relations do not have individual URLs and http://www.iana.org/assignments/link-relations/describes doesn’t resolve. Similarly, the link relation wiki doesn’t have individual URLs for link relations. We decided to create a URL for this relationship in our own namespace, with a reference to the proper definition (see discussion at [4]), but hope that this case might prompt the TAG to try to get some movement on this issue.
>> 
>> 3. The model of access that we’re assuming for CSV and other tabular data files is that someone will link directly to the CSV file (as currently) and that processors will need to retrieve a metadata file about that CSV based on the location of the CSV file. Note that metadata files are file-specific; we wouldn’t expect a single metadata file that includes information about every CSV file on a particular site.
>> 
>> We think that the “correct” way of getting this pointer to a metadata file (given that there is no scope for embedding information within the CSV file itself) is to use a Link header that points to the metadata file, and we have specified that here [5].
>> 
>> However, we recognise that there are many publishing environments in which it is impossible for users to set HTTP headers, particularly on an individual file basis. We have therefore specified two other mechanisms to retrieve metadata files, used only if the URL of the original CSV file doesn’t include a query string:
>> 
>>   * appending ‘-metadata.json’ to the end of the URL to get file-specific metadata [6]
>>   * resolving the URL ‘../metadata.json’ against the URL to get directory-level metadata [7]
>> 
>> Neither of these feels great: they require users who can’t use Link headers to structure their URL space in particular ways, and they use string concatenation on URLs which is horrible. However, we can’t see any better alternative to meet our requirement for what is in effect a file-specific well known URI.
>> 
>> We’d obviously welcome wider review of the documents if you have time, but these are the three issues on which we’d particularly like your opinion.
>> 
>> Thanks,
>> 
>> Jeni
>> 
>> [1] http://www.w3.org/TR/2015/WD-tabular-metadata-20150416/#transformation-definitions
>> [2] http://tools.ietf.org/html/rfc6892
>> [3] https://github.com/mnot/I-D/issues/39
>> [4] https://github.com/w3c/csvw/issues/297
>> [5] http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#link-header
>> [6] http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#standard-file-metadata
>> [7] http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#standard-directory-metadata
>> --
>> Jeni Tennison
>> http://www.jenitennison.com/
>> 
>
Received on Sunday, 19 April 2015 17:20:22 UTC