Re: A draft outline for the CSV2RDF document

On May 19, 2014, at 6:24 AM, Andy Seaborne <andy@apache.org> wrote:

> On 18/05/14 17:59, Ivan Herman wrote:
>> 
>> Andy, Gregg & all
>> 
>> on the call on Wednesday I suggested that, by putting the general
>> description of the conversion into (for now) the metadata document, it
>> may be necessary to restructure the current CSV2RDF document. I tried to
>> draft what a structure & algorithm would look like; I give you here what
>> I jotted down.
> 
> Thank you.
> 
>> Note that I rely on the fact that the template part would
>> migrate as a general mechanism somewhere; there seems to be an agreement
>> on this on the mailing list. I refer to it as a 'template' attribute in
>> the metadata.
>> 
>> I think the changes should be in section 3 (see below). Following the
>> flow in the metadata document I put the 'table level metadata' into a
>> subsection of section 3, meaning section 4 in the current document can
>> disappear. I would remove section 5, because that should become a more
>> general topic, not bound to RDF; the minimal mapping is also part of
>> section 3. I am not sure about current section 7 (I think that should
>> move elsewhere). Note also that, I believe, this skeleton may be similar
>> for XML and JSON, but I did not check that.
>> 
>> I believe is that the three (or more, eventually, with JSON amd XML)
>> relevant documents (syntax, metadata, and conversions) should be in
>> synchrony, and this before the next publication round...
>> 
>> With that, this is what I had in mind:
>> 
>> [[[
>> 3. Processing Model
>> 3.1 Conversion of a core tabular data, or annotated with embedded
>> metadata only
>> 
>> The file's URI is also used as a 'namespace' for URI-s in the generated
>> triples, by concatenating the URI with '#' and with the string for a
>> column name (denoted by ':name' in what follows)
>> 
>> - this case either yields a header for each column; if not, :col1,
>> :col2, :col3, ... are defined
>> - the generation is done by
>>  - each row has the same subject, a new bnode (Bi)
>>  - each cell generates the triple (Bi :headerj "content of cell j")
> 
> with number-like fields being numbers? (to follow what spreadsheets do)
> 
>> 
>> Where :xxx means URIOFCSVFILE#xxx
>> 
>> 3.2 Conversion of annotated tabular data
>> 
>> 3.1.1 Table level metadata
>> The conversion uses the entries defined in section 3.1 of the metadata
>> document to generate table level metadata triples as follows:
>> 
>> - @id is used as a subject for all table level metadata
>> - @type generates a (@id rdf:type @type) triple
>> - the fields defined by DC-TERMS are used directly, with @id as subject
>> 
>> The @id is also used as a 'namespace' for URI-s in the generated
>> triples, by concatenating the URI with '#' and with the string for a
>> column name (denoted by ':name' in what follows)
>> 
>> 3.1.2 Field level metadata
>> The conversion uses the entries defined in section 3.2 of the metadata
>> document to generate table level metadata triples using the steps below.
> 
> For me, a template (in RDF conversion) is the template for one complete rows-worth of conversion.

Yes for me too; in my last email, I suggested that we automatically construct such a template if none is provided, which I think simplifies subsequent processing.

> I was thinking that if no template were explicitly given, the metadata would be used to define a template and the template be applied.  We could have descriptive text about what happens when there is no user-defined template.  Your outline seems to define the process when templateless separately from templates.

+1

> Generating a template, if none provided, would keep the user-template driven mechanism and metadata-gdefineeneated template mechanism in-step.  It would be clear that they aren't alternatives with (potentially) capabilities in the direct roue not in the template route.  You could get the generated template and tweak it, for example.

+1

> The part in common is escaping syntax.  Building up URIs from fields in a row may involve URI query strings, URI path segments etc and these have slightly different rules for conversion from a character string to the URI form (e.g. spaces, use of ?, & and /).

We need to be sure we can access the RFC6570 escape conventions; I suggested how we might do this in my processing instructions.

>> The processing is based on general metadata attributes as defined in
>> that section; this specification adds one field level attribute:
>> 'rdf_predicate_type', which can take the value of 'object' or 'literal'
>> define
>> - each row generates a number of triple with a common subject. This
>> subject is
>>  - a new blank node for each row if no primaryKey attribute is defined, or
>>  - :field1-field2-...-fieldn, where fieldi are the (column) names
>> appearing in the value of the primaryKey attribute if that attribute
>> contains a list of names, or
>>  - :field, where field is the column name appearing as the value of the
>> primaryKey attribute
>> - for each cell in the field _that is not a primary Key_, the following
>> triple is generated
>>  - subject is the subject defined for the row
>>  - predicate is :name, where 'name' is the value of the 'name'
>> attribute in the field descriptor (3.2.2 in the metadata spec)
>>  - object:
>>    - if the column is defined as a foreign key through the
>> 'foreignKeys' attribute, the object is a RDF URI Resource as defined by
>> the foreign Key reference (3.2.7 in the metadata spec)
> 
> I think the term 'foreign key' brings a lot of baggage with it such as foreign key constraints, and guarantees, especially any assumption about whether the link target exists or not.
> 
> I'd rather just talk about generating URIs as one "type", and reserve 'foreign key' for the case of a link within a group of tables converted together or associated in somewhere a foreign key is highly likely to mean the target of the link exists.

Agreed, I don't see any real value here; we need to be able to designate that the type in column metadata is an IRI (@id?).

>>      - otherwise, if a 'type' attribute is defined (3.2.4) then the
>> cell is converted into that typed of literal (in case of date, this may
>> also use the 'format' attribute)

This is a metadata processing rule, as opposed to a template processing rule.

>>      - otherwise, if a 'template' attribute is defined, then the
>> template is used to generate a value; if the value of rdf_predicate_type
>> is missing or is set to 'literal', the object is an rdf literal of type
>> xsd:string; otherwise, the object is an RDF URI Resource.
> 
> So templates for you are templating individual RDF objects?  Jeremy's conversion example has a specific shape of RDF per row.
> 
>>      - otherwise, the value of the cell is uses as an RDF Literal as an
>> xsd:string
>> 
>> 
>> Some open issues:
>> - do we need to add a (rowid csv:row "rownumber") kind of triple for
>> each row; (probably yes)
>> - do we need to add a series of triples of the sort (@id cvs:rows rowid)
>> for each row, to make a "bridge" between the graph overall and its
>> constituents. It may not be all that important for RDF, but it may be
>> necessary for JSON)
> 
> Agreed - we probably want to define triples generated that tie the RDF back to the CSV input.  Probably "optional extra" as for a lasr CSVfile, there would be a significant increase is size of output.

There's some description in Ivan's version of the document that suggests this, but it isn't carried out in the examples.

(BTW, I suggest we merge Ivan's changes in (making him an Editor) and use that as the basis going forward.)

Gregg

>> ]]]
>> 
>> Does this make sense?
>> 
>> I may try to find some time editing the document, but would be good to
>> have a minimal agreement from the group.
>> 
>> Ivan
> 
> 	Andy
> 
>> 
>> 
>> ----
>> Ivan Herman, W3C
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153 <tel:+31-641044153>
>> GPG: 0x343F1A3D
>> WebID: http://www.ivan-herman.net/foaf#me
>> 
> 
> 

Received on Wednesday, 21 May 2014 23:36:51 UTC