Re: A draft outline for the CSV2RDF document from Andy Seaborne on 2014-05-19 (public-csv-wg@w3.org from May 2014)

From: Andy Seaborne <andy@apache.org>
Date: Mon, 19 May 2014 14:24:12 +0100
To: public-csv-wg@w3.org
Message-ID: <537A05FC.2070909@apache.org>
On 18/05/14 17:59, Ivan Herman wrote:
>
> Andy, Gregg & all
>
> on the call on Wednesday I suggested that, by putting the general
> description of the conversion into (for now) the metadata document, it
> may be necessary to restructure the current CSV2RDF document. I tried to
> draft what a structure & algorithm would look like; I give you here what
> I jotted down.

Thank you.

> Note that I rely on the fact that the template part would
> migrate as a general mechanism somewhere; there seems to be an agreement
> on this on the mailing list. I refer to it as a 'template' attribute in
> the metadata.
>
> I think the changes should be in section 3 (see below). Following the
> flow in the metadata document I put the 'table level metadata' into a
> subsection of section 3, meaning section 4 in the current document can
> disappear. I would remove section 5, because that should become a more
> general topic, not bound to RDF; the minimal mapping is also part of
> section 3. I am not sure about current section 7 (I think that should
> move elsewhere). Note also that, I believe, this skeleton may be similar
> for XML and JSON, but I did not check that.
>
> I believe is that the three (or more, eventually, with JSON amd XML)
> relevant documents (syntax, metadata, and conversions) should be in
> synchrony, and this before the next publication round...
>
> With that, this is what I had in mind:
>
> [[[
> 3. Processing Model
> 3.1 Conversion of a core tabular data, or annotated with embedded
> metadata only
>
> The file's URI is also used as a 'namespace' for URI-s in the generated
> triples, by concatenating the URI with '#' and with the string for a
> column name (denoted by ':name' in what follows)
>
> - this case either yields a header for each column; if not, :col1,
> :col2, :col3, ... are defined
> - the generation is done by
>   - each row has the same subject, a new bnode (Bi)
>   - each cell generates the triple (Bi :headerj "content of cell j")

with number-like fields being numbers? (to follow what spreadsheets do)

>
> Where :xxx means URIOFCSVFILE#xxx
>
> 3.2 Conversion of annotated tabular data
>
> 3.1.1 Table level metadata
> The conversion uses the entries defined in section 3.1 of the metadata
> document to generate table level metadata triples as follows:
>
> - @id is used as a subject for all table level metadata
> - @type generates a (@id rdf:type @type) triple
> - the fields defined by DC-TERMS are used directly, with @id as subject
>
> The @id is also used as a 'namespace' for URI-s in the generated
> triples, by concatenating the URI with '#' and with the string for a
> column name (denoted by ':name' in what follows)
>
> 3.1.2 Field level metadata
> The conversion uses the entries defined in section 3.2 of the metadata
> document to generate table level metadata triples using the steps below.

For me, a template (in RDF conversion) is the template for one complete 
rows-worth of conversion.

I was thinking that if no template were explicitly given, the metadata 
would be used to define a template and the template be applied.  We 
could have descriptive text about what happens when there is no 
user-defined template.  Your outline seems to define the process when 
templateless separately from templates.

Generating a template, if none provided, would keep the user-template 
driven mechanism and metadata-gdefineeneated template mechanism in-step. 
  It would be clear that they aren't alternatives with (potentially) 
capabilities in the direct roue not in the template route.  You could 
get the generated template and tweak it, for example.

The part in common is escaping syntax.  Building up URIs from fields in 
a row may involve URI query strings, URI path segments etc and these 
have slightly different rules for conversion from a character string to 
the URI form (e.g. spaces, use of ?, & and /).

> The processing is based on general metadata attributes as defined in
> that section; this specification adds one field level attribute:
> 'rdf_predicate_type', which can take the value of 'object' or 'literal'
> define
> - each row generates a number of triple with a common subject. This
> subject is
>   - a new blank node for each row if no primaryKey attribute is defined, or
>   - :field1-field2-...-fieldn, where fieldi are the (column) names
> appearing in the value of the primaryKey attribute if that attribute
> contains a list of names, or
>   - :field, where field is the column name appearing as the value of the
> primaryKey attribute
> - for each cell in the field _that is not a primary Key_, the following
> triple is generated
>   - subject is the subject defined for the row
>   - predicate is :name, where 'name' is the value of the 'name'
> attribute in the field descriptor (3.2.2 in the metadata spec)
>   - object:
>     - if the column is defined as a foreign key through the
> 'foreignKeys' attribute, the object is a RDF URI Resource as defined by
> the foreign Key reference (3.2.7 in the metadata spec)

I think the term 'foreign key' brings a lot of baggage with it such as 
foreign key constraints, and guarantees, especially any assumption about 
whether the link target exists or not.

I'd rather just talk about generating URIs as one "type", and reserve 
'foreign key' for the case of a link within a group of tables converted 
together or associated in somewhere a foreign key is highly likely to 
mean the target of the link exists.

>       - otherwise, if a 'type' attribute is defined (3.2.4) then the
> cell is converted into that typed of literal (in case of date, this may
> also use the 'format' attribute)
>       - otherwise, if a 'template' attribute is defined, then the
> template is used to generate a value; if the value of rdf_predicate_type
> is missing or is set to 'literal', the object is an rdf literal of type
> xsd:string; otherwise, the object is an RDF URI Resource.

So templates for you are templating individual RDF objects?  Jeremy's 
conversion example has a specific shape of RDF per row.

>       - otherwise, the value of the cell is uses as an RDF Literal as an
> xsd:string
>
>
> Some open issues:
> - do we need to add a (rowid csv:row "rownumber") kind of triple for
> each row; (probably yes)
> - do we need to add a series of triples of the sort (@id cvs:rows rowid)
> for each row, to make a "bridge" between the graph overall and its
> constituents. It may not be all that important for RDF, but it may be
> necessary for JSON)

Agreed - we probably want to define triples generated that tie the RDF 
back to the CSV input.  Probably "optional extra" as for a lasr CSVfile, 
there would be a significant increase is size of output.

> ]]]
>
> Does this make sense?
>
> I may try to find some time editing the document, but would be good to
> have a minimal agreement from the group.
>
> Ivan

 Andy

>
>
> ----
> Ivan Herman, W3C
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153 <tel:+31-641044153>
> GPG: 0x343F1A3D
> WebID: http://www.ivan-herman.net/foaf#me
>
Received on Monday, 19 May 2014 13:24:43 UTC