Re: A draft outline for the CSV2RDF document from Gregg Kellogg on 2014-05-21 (public-csv-wg@w3.org from May 2014)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Wed, 21 May 2014 16:25:55 -0700
To: Ivan Herman <ivan@w3.org>
Cc: Andy Seaborne <andy@apache.org>, Jeni Tennison <jeni@jenitennison.com>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <61BF7755-2CE2-499E-A460-B9D69CC0EA62@greggkellogg.net>
Sorry, catching up, so I'll likely have several replies on this thread.

On May 19, 2014, at 6:17 AM, Ivan Herman <ivan@w3.org> wrote:

> Putting my money where my mouth it, so to say:), I have made a modified version of the CSV2RDF document. Because there has been no WG discussion on it yet, I have put it into another branch on github, it can be viewed at 
> 
> http://htmlpreview.github.io/?https://github.com/w3c/csvw/blob/rdfconversion-ivan/csv2rdf/index.html
> 
> (for some reasons the date goes wrong through this viewer, I am not sure why, let us not worry about that now...)
> 
> Thoughts?

In general, I like the approach of intuiting metadata if none is provided, with or without a header row.

3.1 – This seems to provide rules for generating triples for each row, but only where the column names are either a DC Term or @type; are other fields ignored? Or, would they be generated as fragment-based against the document base?

As an alternative to describing RDF-specific rules for interpreting the data, in the spirit of the metadata intuition, it might be better to describe the construction of a document template, so given your first example, a template could be described such as the following:

@prefix : <http://www.example.org/file.csv> .
@prefix csvw: 
 .
[ csvw:row "_row_num} ;
  :col_1 "{col_1}";
  :col_2 "{col_2}";
  :col_3 "{col_3}";
  :col_4 "{col_4}";
  :col_5 "{col_5}"
] .

(Note, if the column metadata has an associated type (datatype?) it may be used to set a literal type in the template. If the type indicates an IRI, it may be used to surround the value with <> and enable RFC6570 processing of the field value).

Then, the processing of a template is consistent, whether or not a template was defined explicitly, or inferred based on metadata (which also might be inferred). We could have format-specific rules for creating templates, which would handle Turtle, JSON-LD and RDF/XML. The equivalent JSON-LD template might look like the following:

{
  "@context": {
    "@vocab": "http://www.example.org/file.csv#",
    "csvw": "http://www.w3.org/ns/csvw#",
  },
  "@id": "{_row_num}",
  "col_1": "{col_1}",
  "col_2": "{col_2}",
  "col_3": "{col_3}",
  "col_4": "{col_4}",
  "col_5": "{col_5}"
}

3.2 – Sorry, this looks really complicated. I think that by working against a textual template, the processing steps can be simplified:

* The metadata specification in [tabular-metadata] requires the presence of the name property for each column.
* For each row "j",
  * Set the variable "_row_num" to the number of the row
  * For each field template found in the document template
    * If the template variable matches a column name
      * substitute the field template with the associated field value after performing per-field transformations as indicated in the column metadata
        * If the column type designates an IRI, also perform RFC6570 value modifications
    * Otherwise, replace the template variable with null (or some other format-appopriate value
  * The result is the original document template with all field templates replaced by appropriately modified values.
    This result may be emitted directly, or further process according to the associated template mime-type.

There may be some missing details, but this results in processing rules that are pretty consistent across different formats. We probably need to include some format-specific escape operations for value substitutions.

Gregg

> Ivan
> 
> On 18 May 2014, at 18:59 , Ivan Herman <ivan@w3.org> wrote:
> 
>> 
>> Andy, Gregg & all
>> 
>> on the call on Wednesday I suggested that, by putting the general description of the conversion into (for now) the metadata document, it may be necessary to restructure the current CSV2RDF document. I tried to draft what a structure & algorithm would look like; I give you here what I jotted down. Note that I rely on the fact that the template part would migrate as a general mechanism somewhere; there seems to be an agreement on this on the mailing list. I refer to it as a 'template' attribute in the metadata.
>> 
>> I think the changes should be in section 3 (see below). Following the flow in the metadata document I put the 'table level metadata' into a subsection of section 3, meaning section 4 in the current document can disappear. I would remove section 5, because that should become a more general topic, not bound to RDF; the minimal mapping is also part of section 3. I am not sure about current section 7 (I think that should move elsewhere). Note also that, I believe, this skeleton may be similar for XML and JSON, but I did not check that.
>> 
>> I believe is that the three (or more, eventually, with JSON amd XML) relevant documents (syntax, metadata, and conversions) should be in synchrony, and this before the next publication round...
>> 
>> With that, this is what I had in mind:
>> 
>> [[[
>> 3. Processing Model
>> 3.1 Conversion of a core tabular data, or annotated with embedded metadata only
>> 
>> The file's URI is also used as a 'namespace' for URI-s in the generated triples, by concatenating the URI with '#' and with the string for a column name (denoted by ':name' in what follows)
>> 
>> - this case either yields a header for each column; if not, :col1, :col2, :col3, ... are defined
>> - the generation is done by
>> - each row has the same subject, a new bnode (Bi)
>> - each cell generates the triple (Bi :headerj "content of cell j")
>> 
>> Where :xxx means URIOFCSVFILE#xxx
>> 
>> 3.2 Conversion of annotated tabular data
>> 
>> 3.1.1 Table level metadata
>> The conversion uses the entries defined in section 3.1 of the metadata document to generate table level metadata triples as follows:
>> 
>> - @id is used as a subject for all table level metadata
>> - @type generates a (@id rdf:type @type) triple
>> - the fields defined by DC-TERMS are used directly, with @id as subject
>> 
>> The @id is also used as a 'namespace' for URI-s in the generated triples, by concatenating the URI with '#' and with the string for a column name (denoted by ':name' in what follows)
>> 
>> 3.1.2 Field level metadata
>> The conversion uses the entries defined in section 3.2 of the metadata document to generate table level metadata triples using the steps below. The processing is based on general metadata attributes as defined in that section; this specification adds one field level attribute: 'rdf_predicate_type', which can take the value of 'object' or 'literal'
>> 
>> - each row generates a number of triple with a common subject. This subject is 
>> - a new blank node for each row if no primaryKey attribute is defined, or
>> - :field1-field2-...-fieldn, where fieldi are the (column) names appearing in the value of the primaryKey attribute if that attribute contains a list of names, or
>> - :field, where field is the column name appearing as the value of the primaryKey attribute
>> - for each cell in the field _that is not a primary Key_, the following triple is generated
>> - subject is the subject defined for the row
>> - predicate is :name, where 'name' is the value of the 'name' attribute in the field descriptor (3.2.2 in the metadata spec)
>> - object:
>>   - if the column is defined as a foreign key through the 'foreignKeys' attribute, the object is a RDF URI Resource as defined by the foreign Key reference (3.2.7 in the metadata spec)
>>     - otherwise, if a 'type' attribute is defined (3.2.4) then the cell is converted into that typed of literal (in case of date, this may also use the 'format' attribute)
>>     - otherwise, if a 'template' attribute is defined, then the template is used to generate a value; if the value of rdf_predicate_type is missing or is set to 'literal', the object is an rdf literal of type xsd:string; otherwise, the object is an RDF URI Resource.
>>     - otherwise, the value of the cell is uses as an RDF Literal as an xsd:string 
>> 
>> 
>> Some open issues:
>> - do we need to add a (rowid csv:row "rownumber") kind of triple for each row; (probably yes)
>> - do we need to add a series of triples of the sort (@id cvs:rows rowid) for each row, to make a "bridge" between the graph overall and its constituents. It may not be all that important for RDF, but it may be necessary for JSON)
>> ]]]
>> 
>> Does this make sense?
>> 
>> I may try to find some time editing the document, but would be good to have a minimal agreement from the group.
>> 
>> Ivan
>> 
>> 
>> ----
>> Ivan Herman, W3C 
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> GPG: 0x343F1A3D
>> WebID: http://www.ivan-herman.net/foaf#me
>> 
> 
> 
> ----
> Ivan Herman, W3C 
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> GPG: 0x343F1A3D
> WebID: http://www.ivan-herman.net/foaf#me
> 
> 
> 
> 
>
Received on Wednesday, 21 May 2014 23:26:30 UTC