A draft outline for the CSV2RDF document

Andy, Gregg & all

on the call on Wednesday I suggested that, by putting the general description of the conversion into (for now) the metadata document, it may be necessary to restructure the current CSV2RDF document. I tried to draft what a structure & algorithm would look like; I give you here what I jotted down. Note that I rely on the fact that the template part would migrate as a general mechanism somewhere; there seems to be an agreement on this on the mailing list. I refer to it as a 'template' attribute in the metadata.

I think the changes should be in section 3 (see below). Following the flow in the metadata document I put the 'table level metadata' into a subsection of section 3, meaning section 4 in the current document can disappear. I would remove section 5, because that should become a more general topic, not bound to RDF; the minimal mapping is also part of section 3. I am not sure about current section 7 (I think that should move elsewhere). Note also that, I believe, this skeleton may be similar for XML and JSON, but I did not check that.

I believe is that the three (or more, eventually, with JSON amd XML) relevant documents (syntax, metadata, and conversions) should be in synchrony, and this before the next publication round...

With that, this is what I had in mind:

[[[
3. Processing Model
3.1 Conversion of a core tabular data, or annotated with embedded metadata only

The file's URI is also used as a 'namespace' for URI-s in the generated triples, by concatenating the URI with '#' and with the string for a column name (denoted by ':name' in what follows)

- this case either yields a header for each column; if not, :col1, :col2, :col3, ... are defined
- the generation is done by
 - each row has the same subject, a new bnode (Bi)
 - each cell generates the triple (Bi :headerj "content of cell j")

Where :xxx means URIOFCSVFILE#xxx

3.2 Conversion of annotated tabular data

3.1.1 Table level metadata
The conversion uses the entries defined in section 3.1 of the metadata document to generate table level metadata triples as follows:

- @id is used as a subject for all table level metadata
- @type generates a (@id rdf:type @type) triple
- the fields defined by DC-TERMS are used directly, with @id as subject

The @id is also used as a 'namespace' for URI-s in the generated triples, by concatenating the URI with '#' and with the string for a column name (denoted by ':name' in what follows)

3.1.2 Field level metadata
The conversion uses the entries defined in section 3.2 of the metadata document to generate table level metadata triples using the steps below. The processing is based on general metadata attributes as defined in that section; this specification adds one field level attribute: 'rdf_predicate_type', which can take the value of 'object' or 'literal'

- each row generates a number of triple with a common subject. This subject is 
 - a new blank node for each row if no primaryKey attribute is defined, or
 - :field1-field2-...-fieldn, where fieldi are the (column) names appearing in the value of the primaryKey attribute if that attribute contains a list of names, or
 - :field, where field is the column name appearing as the value of the primaryKey attribute
- for each cell in the field _that is not a primary Key_, the following triple is generated
 - subject is the subject defined for the row
 - predicate is :name, where 'name' is the value of the 'name' attribute in the field descriptor (3.2.2 in the metadata spec)
 - object:
   - if the column is defined as a foreign key through the 'foreignKeys' attribute, the object is a RDF URI Resource as defined by the foreign Key reference (3.2.7 in the metadata spec)
     - otherwise, if a 'type' attribute is defined (3.2.4) then the cell is converted into that typed of literal (in case of date, this may also use the 'format' attribute)
     - otherwise, if a 'template' attribute is defined, then the template is used to generate a value; if the value of rdf_predicate_type is missing or is set to 'literal', the object is an rdf literal of type xsd:string; otherwise, the object is an RDF URI Resource.
     - otherwise, the value of the cell is uses as an RDF Literal as an xsd:string 


Some open issues:
- do we need to add a (rowid csv:row "rownumber") kind of triple for each row; (probably yes)
- do we need to add a series of triples of the sort (@id cvs:rows rowid) for each row, to make a "bridge" between the graph overall and its constituents. It may not be all that important for RDF, but it may be necessary for JSON)
]]]

Does this make sense?

I may try to find some time editing the document, but would be good to have a minimal agreement from the group.

Ivan


----
Ivan Herman, W3C 
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
WebID: http://www.ivan-herman.net/foaf#me

Received on Sunday, 18 May 2014 16:59:52 UTC