Merging embedded metadata with defined metadata from Gregg Kellogg on 2015-01-11 (public-csv-wg@w3.org from January 2015)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Sat, 10 Jan 2015 17:05:21 -0800
To: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <D947F879-6900-4F14-88D3-23B9B2C766A2@greggkellogg.net>
I have been trying to reconcile how embedded metadata is merged with other metadata, and believe I've found a solution that works (at least for my interpretation). The problem is, that as currently defined in CSV2JSON, embedded metadata ends up creating a Table/Schema with columns based on headers from the CSV, which define "name", "table", and "predicateUrl". Instead, I think it better to relax the requirement for "name" in Column metadata, and infer this from "title", if not otherwise defined.

The mechanism I use for creating Table metadata from a CSV is to create a Column entry with just "title" from the header; this is consistent with Jeni's examples in the syntax document. Presently, these columns would be invalid, as "name" is a required value. However, if this is relaxed, then "name" can be derived from "title" as can "predicateUrl". When merging (as defined in import metadata), columns match if they have the same name _or_ they have a common title; this allows creating metadata from the CSV and then merging with found metadata, say from foo.csv-metadata.json and causes the columns to property reconcile.

When accessing the "name" of a Column, take the asserted value of "name", if available, otherwise, take the first value from "title", if available, otherwise, name is "_row=n" where n is the row number. (This comes into play when headerRowCount is zero).

Similarly, when accessing predicateUrl, take the asserted value, otherwise set to the table location (expanded @id of Table) using the URI-encoded value of "name" as a fragment identifier.

Looking at the example in 5.1.1.2 of the syntax doc [1], the embedded metadata would look like the following:

{
  "@id": "tree-ops.csv",
  "@context": "http://www.w3.org/ns/csvw",
  "schema": {
    "columns": [
      {"title": "GID"},
      {"title": "On Street"},
      {"title": "Species"},
      {"title": "Trim Cycle"},
      {"title": "Inventory Date"}
    ]
  }
}

If this were processed without any external metadata, it would be effectively expanded to the following:

{
  "@id": "tree-ops.csv",
  "@context": "http://www.w3.org/ns/csvw",
  "schema": {
    "columns": [{
      "name": "GID",
      "title": "GID",
      "predicateUrl": "tree-ops.csv#GID"
    }, {
      "name": "On Street",
      "title": "On Street",
      "predicateUrl": "tree-ops.csv#On%20Street"
    }, {
      "name": "Species",
      "title": "Species",
      "predicateUrl": "tree-ops.csv#Species"
    }, {
      "name": "Trim Cycle",
      "title": "Trim Cycle",
      "predicateUrl": "tree-ops.csv#Trim%20Cycle"
    }, {
      "name": "Inventory Date",
      "title": "Inventory Date",
      "predicateUrl": "tree-ops.csv#Inventory%20Date"
    }]
  }
}

However, if merged with tree-ops.csv-metadata.json, it would properly merge with that metadata creating something like the following:

{
  "@id": "tree-ops.csv",
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "dc:title": "Tree Operations",
  "dc:keywords": ["tree", "street", "maintenance"],
  "dc:publisher": [{
    "sch:name": "Example Municipality",
    "sch:web": "http://example.org"
  }],
  "dc:license": "http://opendefinition.org/licenses/cc-by/",
  "dc:modified": "2010-12-31",
  "schema": {
    "columns": [{
      "name": "GID",
      "title": {"und": ["GID", "Generic Identifier"]},
      "dc:description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "required": true
    }, {
      "name": "on-street",
      "title": {"und": "On Street"},
      "dc:description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "title": {"und": "Species"},
      "dc:description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim-cycle",
      "title": {"und": "Trim Cycle"},
      "dc:description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory-date",
      "title": {"und": "Inventory Date"},
      "dc:description": "The date of the operation that was performed.",
      "datatype": "date",
      "format": "M/D/YYYY"
    }],
    "primaryKey": "GID"
  }
}

The advantage of this approach is that embedded metadata can be used to create a regular Metadata object and take advantage of the import (merge) semantics defined in [2].

An added step could also eliminate the need for "Mapping Core Tabular Data" by simply setting headerRowCount to 0. In this case, the effective embedded metadata is simply the following:

{
  "@id": "tree-ops.csv",
  "@context": "http://www.w3.org/ns/csvw",
  "schema": {}
}

When processing the first row, an empty Column entry can be created for each column encountered:

{
  "@id": "tree-ops.csv",
  "@context": "http://www.w3.org/ns/csvw",
  "schema": {
    "columns": [{}, {}, {}, {}, {}]
  }
}

Then, the defaults for "name" and "predicateUrl" simply use "_row_n". This essentially reproduces the behavior in Mapping Core Tabular Data [3] without requiring a separate algorithm.

Gregg Kellogg
gregg@greggkellogg.net

[1] http://w3c.github.io/csvw/syntax/#using-a-metadata-file
[2] http://w3c.github.io/csvw/metadata/#importing-metadata
[3] http://w3c.github.io/csvw/csv2rdf/#map-core-tab
Received on Sunday, 11 January 2015 01:05:51 UTC