Re: A draft outline for the CSV2RDF document from Ivan Herman on 2014-05-19 (public-csv-wg@w3.org from May 2014)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 19 May 2014 16:00:11 +0200
To: Andy Seaborne <andy@apache.org>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <836B654D-C17A-44FE-9542-4E85CAAED9D5@w3.org>
On 19 May 2014, at 15:24 , Andy Seaborne <andy@apache.org> wrote:

> On 18/05/14 17:59, Ivan Herman wrote:
>> 
>> Andy, Gregg & all
>> 
>> on the call on Wednesday I suggested that, by putting the general
>> description of the conversion into (for now) the metadata document, it
>> may be necessary to restructure the current CSV2RDF document. I tried to
>> draft what a structure & algorithm would look like; I give you here what
>> I jotted down.
> 
> Thank you.
> 
>> Note that I rely on the fact that the template part would
>> migrate as a general mechanism somewhere; there seems to be an agreement
>> on this on the mailing list. I refer to it as a 'template' attribute in
>> the metadata.
>> 
>> I think the changes should be in section 3 (see below). Following the
>> flow in the metadata document I put the 'table level metadata' into a
>> subsection of section 3, meaning section 4 in the current document can
>> disappear. I would remove section 5, because that should become a more
>> general topic, not bound to RDF; the minimal mapping is also part of
>> section 3. I am not sure about current section 7 (I think that should
>> move elsewhere). Note also that, I believe, this skeleton may be similar
>> for XML and JSON, but I did not check that.
>> 
>> I believe is that the three (or more, eventually, with JSON amd XML)
>> relevant documents (syntax, metadata, and conversions) should be in
>> synchrony, and this before the next publication round...
>> 
>> With that, this is what I had in mind:
>> 
>> [[[
>> 3. Processing Model
>> 3.1 Conversion of a core tabular data, or annotated with embedded
>> metadata only
>> 
>> The file's URI is also used as a 'namespace' for URI-s in the generated
>> triples, by concatenating the URI with '#' and with the string for a
>> column name (denoted by ':name' in what follows)
>> 
>> - this case either yields a header for each column; if not, :col1,
>> :col2, :col3, ... are defined
>> - the generation is done by
>>  - each row has the same subject, a new bnode (Bi)
>>  - each cell generates the triple (Bi :headerj "content of cell j")
> 
> with number-like fields being numbers? (to follow what spreadsheets do)

Yep, in the current model (and in the document) I have not put any automatic datatype conversion. I guess this could be done for some of the very usual ones (numbers, anything else? maybe dates?). I am a little bit neutral on this, to be honest.

> 
>> 
>> Where :xxx means URIOFCSVFILE#xxx
>> 
>> 3.2 Conversion of annotated tabular data
>> 
>> 3.1.1 Table level metadata
>> The conversion uses the entries defined in section 3.1 of the metadata
>> document to generate table level metadata triples as follows:
>> 
>> - @id is used as a subject for all table level metadata
>> - @type generates a (@id rdf:type @type) triple
>> - the fields defined by DC-TERMS are used directly, with @id as subject
>> 
>> The @id is also used as a 'namespace' for URI-s in the generated
>> triples, by concatenating the URI with '#' and with the string for a
>> column name (denoted by ':name' in what follows)
>> 
>> 3.1.2 Field level metadata
>> The conversion uses the entries defined in section 3.2 of the metadata
>> document to generate table level metadata triples using the steps below.
> 
> For me, a template (in RDF conversion) is the template for one complete rows-worth of conversion.
> 
> I was thinking that if no template were explicitly given, the metadata would be used to define a template and the template be applied.  We could have descriptive text about what happens when there is no user-defined template.  Your outline seems to define the process when templateless separately from templates.
> 
> Generating a template, if none provided, would keep the user-template driven mechanism and metadata-gdefineeneated template mechanism in-step.  It would be clear that they aren't alternatives with (potentially) capabilities in the direct roue not in the template route.  You could get the generated template and tweak it, for example.
> 

I would need an example to understand what you mean...

> The part in common is escaping syntax.  Building up URIs from fields in a row may involve URI query strings, URI path segments etc and these have slightly different rules for conversion from a character string to the URI form (e.g. spaces, use of ?, & and /).
> 
>> The processing is based on general metadata attributes as defined in
>> that section; this specification adds one field level attribute:
>> 'rdf_predicate_type', which can take the value of 'object' or 'literal'
>> define
>> - each row generates a number of triple with a common subject. This
>> subject is
>>  - a new blank node for each row if no primaryKey attribute is defined, or
>>  - :field1-field2-...-fieldn, where fieldi are the (column) names
>> appearing in the value of the primaryKey attribute if that attribute
>> contains a list of names, or
>>  - :field, where field is the column name appearing as the value of the
>> primaryKey attribute
>> - for each cell in the field _that is not a primary Key_, the following
>> triple is generated
>>  - subject is the subject defined for the row
>>  - predicate is :name, where 'name' is the value of the 'name'
>> attribute in the field descriptor (3.2.2 in the metadata spec)
>>  - object:
>>    - if the column is defined as a foreign key through the
>> 'foreignKeys' attribute, the object is a RDF URI Resource as defined by
>> the foreign Key reference (3.2.7 in the metadata spec)
> 
> I think the term 'foreign key' brings a lot of baggage with it such as foreign key constraints, and guarantees, especially any assumption about whether the link target exists or not.

Yes, I agree. I am not even sure we need those; after all, the metadata can generate URIs and can tell that the value should be taken to be a URI.

At the moment I just tried to align with the metadata document. It is a more general issue than RDF.

> 
> I'd rather just talk about generating URIs as one "type", and reserve 'foreign key' for the case of a link within a group of tables converted together or associated in somewhere a foreign key is highly likely to mean the target of the link exists.

Right. And I am not sure whether the case for several tables in one file is in scope...

> 
>>      - otherwise, if a 'type' attribute is defined (3.2.4) then the
>> cell is converted into that typed of literal (in case of date, this may
>> also use the 'format' attribute)
>>      - otherwise, if a 'template' attribute is defined, then the
>> template is used to generate a value; if the value of rdf_predicate_type
>> is missing or is set to 'literal', the object is an rdf literal of type
>> xsd:string; otherwise, the object is an RDF URI Resource.
> 
> So templates for you are templating individual RDF objects?  Jeremy's conversion example has a specific shape of RDF per row.

Yes. Of course, a specific template for a field can use the names of other fields, too. I modeled my thoughts on the R2RML approach which is also granular on fields.

> 
>>      - otherwise, the value of the cell is uses as an RDF Literal as an
>> xsd:string
>> 
>> 
>> Some open issues:
>> - do we need to add a (rowid csv:row "rownumber") kind of triple for
>> each row; (probably yes)
>> - do we need to add a series of triples of the sort (@id cvs:rows rowid)
>> for each row, to make a "bridge" between the graph overall and its
>> constituents. It may not be all that important for RDF, but it may be
>> necessary for JSON)
> 
> Agreed - we probably want to define triples generated that tie the RDF back to the CSV input.  Probably "optional extra" as for a lasr CSVfile, there would be a significant increase is size of output.
> 
>> ]]]
>> 
>> Does this make sense?
>> 
>> I may try to find some time editing the document, but would be good to
>> have a minimal agreement from the group.
>> 
>> Ivan
> 
> 	Andy
> 
>> 
>> 
>> ----
>> Ivan Herman, W3C
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153 <tel:+31-641044153>
>> GPG: 0x343F1A3D
>> WebID: http://www.ivan-herman.net/foaf#me
>> 
> 
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
WebID: http://www.ivan-herman.net/foaf#me
Received on Monday, 19 May 2014 14:00:44 UTC