Re: Merging embedded metadata with defined metadata from Gregg Kellogg on 2015-01-11 (public-csv-wg@w3.org from January 2015)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Sat, 10 Jan 2015 23:29:05 -0800
To: Ivan Herman <ivan@w3.org>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <C22D55C1-C0D3-429F-A0E4-92D92B3DBC35@greggkellogg.net>
On Jan 10, 2015, at 10:02 PM, Ivan Herman <ivan@w3.org> wrote:
> 
> At first glance something like that may work. Tiny issues/questions:
> 
> - when used as 'name', shouldn't the 'title' value be normalized somehow? Or should we say we normalize only when used as part of a URI? (I wonder what happens if those names appear in a URI template: does those template allow for unnormalized values?)

Perhaps, but I do it when creating predicateUrl. It could be normalized when creating name using URI escaping, however it really can't be used in a URI template, as that would imply metadata, which would require having an explicit "name" property, so it's sort of a moot point.

> - what would be the relative priority of embedded metadata vs. user metadata? My feeling is that embedded is the first to consider then, and user metadata should be merged to it first, then the others.

Yes, I retrieve/create user-specified metadata, merge embedded metadata to that, and then merge found metadata, although I stop at the first found metadata I encounter; this should be up for discussion when talking about "import". I give user metadata a higher priority as defined in the syntax document [1].

The main issue I have in reproducing the expected output for the tree-ops in CSV2RDF is in not knowing that dc:license should be a URI vs a literal.

It ended up that the simplest way for me to turn common properties (and titles) into RDF is by simply using the JSON-LD toRdf API method, after creating a node with an @id and the property referencing the value to be output: [2] and [3].

Gregg
[1] http://w3c.github.io/csvw/syntax/#locating-metadata
[2] https://github.com/gkellogg/rdf-tabular/blob/develop/lib/rdf/tabular/metadata.rb#L543
[3] https://github.com/gkellogg/rdf-tabular/blob/develop/lib/rdf/tabular/metadata.rb#L566


> Ivan
> 
>> On 11 Jan 2015, at 02:05 , Gregg Kellogg <gregg@greggkellogg.net> wrote:
>> 
>> I have been trying to reconcile how embedded metadata is merged with other metadata, and believe I've found a solution that works (at least for my interpretation). The problem is, that as currently defined in CSV2JSON, embedded metadata ends up creating a Table/Schema with columns based on headers from the CSV, which define "name", "table", and "predicateUrl". Instead, I think it better to relax the requirement for "name" in Column metadata, and infer this from "title", if not otherwise defined.
>> 
>> The mechanism I use for creating Table metadata from a CSV is to create a Column entry with just "title" from the header; this is consistent with Jeni's examples in the syntax document. Presently, these columns would be invalid, as "name" is a required value. However, if this is relaxed, then "name" can be derived from "title" as can "predicateUrl". When merging (as defined in import metadata), columns match if they have the same name _or_ they have a common title; this allows creating metadata from the CSV and then merging with found metadata, say from foo.csv-metadata.json and causes the columns to property reconcile.
>> 
>> When accessing the "name" of a Column, take the asserted value of "name", if available, otherwise, take the first value from "title", if available, otherwise, name is "_row=n" where n is the row number. (This comes into play when headerRowCount is zero).
>> 
>> Similarly, when accessing predicateUrl, take the asserted value, otherwise set to the table location (expanded @id of Table) using the URI-encoded value of "name" as a fragment identifier.
>> 
>> Looking at the example in 5.1.1.2 of the syntax doc [1], the embedded metadata would look like the following:
>> 
>> {
>> "@id": "tree-ops.csv",
>> "@context": "http://www.w3.org/ns/csvw",
>> "schema": {
>>   "columns": [
>>     {"title": "GID"},
>>     {"title": "On Street"},
>>     {"title": "Species"},
>>     {"title": "Trim Cycle"},
>>     {"title": "Inventory Date"}
>>   ]
>> }
>> }
>> 
>> If this were processed without any external metadata, it would be effectively expanded to the following:
>> 
>> {
>> "@id": "tree-ops.csv",
>> "@context": "http://www.w3.org/ns/csvw",
>> "schema": {
>>   "columns": [{
>>     "name": "GID",
>>     "title": "GID",
>>     "predicateUrl": "tree-ops.csv#GID"
>>   }, {
>>     "name": "On Street",
>>     "title": "On Street",
>>     "predicateUrl": "tree-ops.csv#On%20Street"
>>   }, {
>>     "name": "Species",
>>     "title": "Species",
>>     "predicateUrl": "tree-ops.csv#Species"
>>   }, {
>>     "name": "Trim Cycle",
>>     "title": "Trim Cycle",
>>     "predicateUrl": "tree-ops.csv#Trim%20Cycle"
>>   }, {
>>     "name": "Inventory Date",
>>     "title": "Inventory Date",
>>     "predicateUrl": "tree-ops.csv#Inventory%20Date"
>>   }]
>> }
>> }
>> 
>> However, if merged with tree-ops.csv-metadata.json, it would properly merge with that metadata creating something like the following:
>> 
>> {
>> "@id": "tree-ops.csv",
>> "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
>> "dc:title": "Tree Operations",
>> "dc:keywords": ["tree", "street", "maintenance"],
>> "dc:publisher": [{
>>   "sch:name": "Example Municipality",
>>   "sch:web": "http://example.org"
>> }],
>> "dc:license": "http://opendefinition.org/licenses/cc-by/",
>> "dc:modified": "2010-12-31",
>> "schema": {
>>   "columns": [{
>>     "name": "GID",
>>     "title": {"und": ["GID", "Generic Identifier"]},
>>     "dc:description": "An identifier for the operation on a tree.",
>>     "datatype": "string",
>>     "required": true
>>   }, {
>>     "name": "on-street",
>>     "title": {"und": "On Street"},
>>     "dc:description": "The street that the tree is on.",
>>     "datatype": "string"
>>   }, {
>>     "name": "species",
>>     "title": {"und": "Species"},
>>     "dc:description": "The species of the tree.",
>>     "datatype": "string"
>>   }, {
>>     "name": "trim-cycle",
>>     "title": {"und": "Trim Cycle"},
>>     "dc:description": "The operation performed on the tree.",
>>     "datatype": "string"
>>   }, {
>>     "name": "inventory-date",
>>     "title": {"und": "Inventory Date"},
>>     "dc:description": "The date of the operation that was performed.",
>>     "datatype": "date",
>>     "format": "M/D/YYYY"
>>   }],
>>   "primaryKey": "GID"
>> }
>> }
>> 
>> The advantage of this approach is that embedded metadata can be used to create a regular Metadata object and take advantage of the import (merge) semantics defined in [2].
>> 
>> An added step could also eliminate the need for "Mapping Core Tabular Data" by simply setting headerRowCount to 0. In this case, the effective embedded metadata is simply the following:
>> 
>> {
>> "@id": "tree-ops.csv",
>> "@context": "http://www.w3.org/ns/csvw",
>> "schema": {}
>> }
>> 
>> When processing the first row, an empty Column entry can be created for each column encountered:
>> 
>> {
>> "@id": "tree-ops.csv",
>> "@context": "http://www.w3.org/ns/csvw",
>> "schema": {
>>   "columns": [{}, {}, {}, {}, {}]
>> }
>> }
>> 
>> Then, the defaults for "name" and "predicateUrl" simply use "_row_n". This essentially reproduces the behavior in Mapping Core Tabular Data [3] without requiring a separate algorithm.
>> 
>> Gregg Kellogg
>> gregg@greggkellogg.net
>> 
>> [1] http://w3c.github.io/csvw/syntax/#using-a-metadata-file
>> [2] http://w3c.github.io/csvw/metadata/#importing-metadata
>> [3] http://w3c.github.io/csvw/csv2rdf/#map-core-tab
>> 
>> 
>> 
> 
> 
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704
> 
> 
> 
>
Received on Sunday, 11 January 2015 07:29:35 UTC