Re: A draft outline for the CSV2RDF document from Gregg Kellogg on 2014-05-21 (public-csv-wg@w3.org from May 2014)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Wed, 21 May 2014 16:48:17 -0700
To: Ivan Herman <ivan@w3.org>
Cc: Andy Seaborne <andy@apache.org>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <752B9D00-0B20-4870-84E2-856D60FCCE4F@greggkellogg.net>
On May 19, 2014, at 10:09 AM, Ivan Herman <ivan@w3.org> wrote:

> Ok, now I understand the difference, thanks. Indeed, I use templates for one term; again, just as R2RML does.
> 
> I am a little bit afraid of the potential complexity of that approach. The one-term-template is pretty straightforward both for the implementation and the user, is syntax independent and can be easily re-used for XML or JSON, too. The per-row-template seems to be syntax dependent and more complex though, clearly, much more powerful. I have to think about it...

I think it's really pretty simple; I implemented something similar for another project I'm doing. In Ruby, it takes advantage of the ability to use "gsub" and pass it a block:

    csv.each do |line|
      result = csvm.gsub(/"[^"]*\{[^"]*"/) { |match|
          match.gsub(/\{[^\}]*\}/) { |field_ref|
            ...
          }
      }
    end

In this case, because JSON uses braces in it's basic syntax, I look for braces contained within double-quotes; the example Andy and I use for Turtle are consistent with this approach.

For the non-Ruby literate, it basically says match anything including an opening curly brace ("{") surrounded by double quotes and replace it with the result of the block/callback. Each of these looks for field references such as {...}. Note that the field reference may contain some RFC6570 processing elements in addition to the variable/column name, but these should only be performed if we've determined that the column type is IRI.

Gregg

> Ivan
> 
> 
> 
> On 19 May 2014, at 18:16 , Andy Seaborne <andy@apache.org> wrote:
> 
>> On 19/05/14 15:23, Ivan Herman wrote:
>>> Let me try to see if I understand what you mean...
>>> 
>>> If there is no metadata assigned to the data then (at least conceptually) we say that we generate a metadata of, roughly, the form:
>>> 
>>> {
>>>   "@id" : "URI OF THE DATA",
>>>   "columns" : [{
>>>     "name" : "col1",
>>>     "template" : "{col1},
>>>   },{
>>>     "name" : "col2",
>>>     "template" : "{col2},
>>>   }]
>>> }
>> 
>> Where we seem to differ is "template" - that's a template for one term (the object of a triple).
>> 
>> The template I have in mind is a complete row:
>> 
>> Taking from:
>> 
>> https://github.com/w3c/csvw/blob/gh-pages/examples/simple-weather-observation.md
>> 
>> Date-time, Air temperature (Cel), Dew-point temperature (Cel)
>> 2013-12-13T08:00:00Z, 	11.2, 	10.2
>> 
>> 
>> <site/22580943/date-time/20131213T0800Z>
>>   a ssn:Observation ;
>>   ssn:observationSamplingTime
>>       [ time:inXSDDateTime "2013-12-13T08:00:00Z"^^xsd:dateTime ] ;
>>   ssn:observationResult [
>>       a ssn:SensorOutput ;
>>       def-op:airTemperature_C
>>            [ qudt:numericValue "11.2"^^xsd:double ] ;
>>       def-op:dewPointTemperature_C
>>           [ qudt:numericValue "10.2"^^xsd:double ] ] .
>> 
>> That could be created with a template like:
>> 
>> ----------------------------------------------
>> Columns:
>> 
>> "columns" : [{
>>     "name" : "date-time"
>>    },{
>>     "name" : "air-temperature"
>>    },{
>>     "name" : "dew-point"
>>  }]
>> 
>> 
>> ----------------------------------------------
>> <site/22580943/date-time/{date-time}>
>>   a ssn:Observation ;
>>   ssn:observationSamplingTime
>>       [ time:inXSDDateTime "{date-time}"^^xsd:dateTime ] ;
>>   ssn:observationResult [
>>       a ssn:SensorOutput ;
>>       def-op:airTemperature_C
>>            [ qudt:numericValue "{air-temperature}"^^xsd:double ] ;
>>       def-op:dewPointTemperature_C
>>           [ qudt:numericValue "{dew-point}"^^xsd:double ] ] .
>> ----------------------------------------------
>> 
>> skipping over the conversion of 2013-12-13T08:00:00Z to 20131213T0800Z
>> 
>> 	Andy
>> 
>>> 
>>> And, by doing that, we have only one generation algorithm instead of two branches like in my document now.
>>> 
>>> Yes, this works, I guess. It certainly makes the specification simpler and avoids getting out of sync. I am slightly worried that the end-user would be a bit screwed up, but that may have to go into a separate, tutorial-like text. So it may be worth doing it indeed...
>>> 
>>> (Would need a rewrite of the text I produced, but that is probably relatively easy; just that I would not do it today or tomorrow...)
>>> 
>>> Ivan
>>> 
>>> 
>>> 
>>> On 19 May 2014, at 16:14 , Andy Seaborne <andy@apache.org> wrote:
>>> 
>>>> On 19/05/14 15:00, Ivan Herman wrote:
>>>>>>> Generating a template, if none provided, would keep the user-template driven mechanism and metadata-gdefineeneated template mechanism in-step.  It would be clear that they aren't alternatives with (potentially) capabilities in the direct roue not in the template route.  You could get the generated template and tweak it, for example.
>>>>>>> 
>>>>> I would need an example to understand what you mean...
>>>>> 
>>>> 
>>>> If the columns are "foo" and "bar" and no template is in the metadata then we define the process to be to create and use:
>>>> 
>>>> -------------------------
>>>> [
>>>>  :foo "{foo}" .
>>>>  :bar "{bar}" .
>>>> ]
>>>> -------------------------
>>>> 
>>>> 	Andy
>>>> 
>>> 
>>> 
>>> ----
>>> Ivan Herman, W3C
>>> Digital Publishing Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> GPG: 0x343F1A3D
>>> WebID: http://www.ivan-herman.net/foaf#me
> 
> 
> ----
> Ivan Herman, W3C 
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> GPG: 0x343F1A3D
> WebID: http://www.ivan-herman.net/foaf#me
> 
> 
> 
> 
>
Received on Wednesday, 21 May 2014 23:48:53 UTC