Re: A draft outline for the CSV2RDF document from Ivan Herman on 2014-05-21 (public-csv-wg@w3.org from May 2014)

From: Ivan Herman <ivan@w3.org>
Date: Wed, 21 May 2014 12:35:37 +0200
To: Andy Seaborne <andy@apache.org>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <A291B85E-2ABD-478E-8E24-19FFD8028945@w3.org>
On 20 May 2014, at 23:00 , Andy Seaborne <andy@apache.org> wrote:

> On 20/05/14 11:59, Ivan Herman wrote:
>> 
>> On 20 May 2014, at 12:16 , Andy Seaborne <andy@apache.org> wrote:
>> 
>>> On 20/05/14 05:52, Ivan Herman wrote:
>>>> But also... If my application needs (forgive me:-) RDF/XML, but
>>>> the author of the metadata has put in the row-level template
>>>> using JSON-LD as a base syntax, then I need a JSON-LD parser to
>>>> make any sense of it, right? In other words, the field-level
>>>> template approach is RDF syntax independent. That seems to be
>>>> another major difference, too...
>>>> 
>>> 
>>> We're defining the correct output of a conversion process when the
>>> input is the metadata (without any user templates).  We aren't
>>> requiring the processor does exactly and only those steps.  It
>>> outputs whatever format(s) it supports.
>>> 
>>> Adding user templates is 'advanced' and if we want to allow
>>> control of the shape of the RDF emitted (c.f. Jeremy's example) we
>>> do need to have a language for describing shape. However, that's
>>> not the required mechanism for implementation of metadata\templates
>>> to RDF.
>>> 
>> 
>> I am still trying to turn my head around it; sorry if I am slow...
>> Is this so that (at least conceptually for the user):
>> 
>> - The 'field level templates', essentially as I described and used
>> in [1] can be used essentially as described there (what templates
>> exactly do is something that we still have to define, but I guess we
>> have an idea about a simple mechanism, like the one in R2RML)
> > - There is, _additionally_, the possibility to define a 'shape', ie, a
>> row level template; if present, that replaces the mechanism described
>> in [1]
> 
> Yes.
> 

Great! At least we have a common understanding:-)

> 'field level templates' has another, different dimension that {col} simply isn't enough to generate output (URI construction, transformation of values e.g. upepr/lower case, trim, extracting part of a field, ... and all the ETL-like themes).

Yes, and I think I used the term 'template' in a kind of generic (and-to-be-defined) way. Maybe 'transformation' may be a better term, and it may include some common features that are widely used and implemented:

- simple text replacement, like {...} for field names
- regular expression based replacement
- upper/lower case

In the metadata scheme one would probably have something like

"transformation" : [
   {
      "type"  : "template",
      "value" : "..."
   },
   {
      "type" : "regex",
      "value" :...
   }
]

and the execution would be serially done on the field.


> 
> Templating for shape only used uses field values (that needs to be tested - it might be insufficient).
> 
>> (Specification-wise, one can of course turn things upside down,
>> describe the 'shape' template mechanism and, if, for a specific
>> data, no shape is defined, one could virtually generate such a shape
>> from the metadata. But that is for specification writers and,
>> possibly, for implementers.)
> 
> That is what I am suggesting.
> 
> It means there is smooth progression from simple to shape-based conversion.

Again, good we understand one another:-)

> 
>> 
>> I think that this, technically, works indeed. But I am not sold on
>> it...
>> 
>> - I have the impression that the generic shape mechanism is more
>> complicated to understand for a user and more complex to implement
> 
> ?? The user does not see it unless they want advanced translation goes beyond what can be expressed in the basic field level conversion.

True.

> 
>> - Although I forgot to add this to [1] (and we were not sure whether
>> that should go into the metadata spec in the first place) we did say
>> that we can assign, say, an XSLT script for XML, or a SPARQL
>> CONSTRUCT pattern for RDF that would be executed on the result of the
>> RDF generation; such an extra step could take care of Jeremy's
>> example, right?
> 
> It is something that has been suggested but no one has worked through
> the details.
> 
> Certainly possible in XSLT, but SPARQL CONSTRUCT isn't as powerful as XSLT.  Greeg has made suggestion for CSV-LD.  The XML publishing world commonly has XSLT.  Other communities don't necessary have the same degreee of conversion pipelines.

But all communities have something; at the minimum, one can refer back to a javascript of python or whatever processing...

> 
> See Jeni's
> http://lists.w3.org/Archives/Public/public-csv-wg/2014May/0063.html
> want for conditionality and filed level processing.
> 
> (where do you stand on that msg?)

It makes me scared. "In all the real-life conversions I’ve ever done I’ve always ended up needing conditional statements of some sort". Do we really want to go there?

For the RDF world, I do not see why plugging in either an http URI for a specific SPARQL engine call using CONSTRUCT, or a textual literal with SPARQL CONSTRUCT would not work to massage the output. After all, the SPIN people have already done things like that...

I am wary going down the line of defining the a complex pattern language. That is my problem. And Jeni's mail indicates that a simple replacement of {...} may not be enough. (Put it another way, even if we do use a template language, users will end up using SPARQL...)

> 
> If the output required is JSON-LD, I'd expect the CSV->JSON conversion would be a better starting point because it has control over the JSON.

This is a different issue, but I would hope that the RDF conversion and the JSON conversion would be in synchrony such that the difference between the two, when using JSON, is the presence or not of a @context. But Gregg should be the one telling us whether this is possible.


> 
>> It is, of course, a bit more complex to do this than
>> with shapes, but how frequently do I have to do this?
> 
> Having looked at all the conversions we (Epimorphics) have been involved in, the basic level of CSV -> simple RDF is not sufficient.   One conversion (LandRegistry, 400e6 triples) is actually SPARQL Update not Turtle.
> 

Showing the SPARQL works:-)


> Do we have a real example where is simple is the required output? Jeremy's example needs reshaping.  Reshaping is putting knowledg/semantics/information into the data that wasn't completely theer in the input.  A typical knowledge capture exercise.
> 
> A question I have is whether complete tables are the common case of whether there is commonly multi-row structure in tables. e.g. repeated fields or empty to present tree.
> 
> We need to ground out the requirements.

+1

> 
>> - I still do not see how you can get around the fact that the shape
>> is very language specific, ie, I am not sure how you would define
>> metadata that RDF serialization syntax independent and, even more,
>> independent on whether the target is RDF, JSON, or XML (which works
>> much more easily with the scheme in [1])
> 
> RDF serialization syntax independence is your issue not mime.
> 
> As far as I'm concerned, the metadata can provide a turtle template for Turtle.
> 
> If the output required is JSON-LD, I'd expect the CSV->JSON conversion would be a better starting point because it has control over JSON.
> 
> If RDF/XML is required, converting RDF formats isn't hard at least not in that direction.  Managing the XML namespaces might mean the CSv to XML is a better route.
> 
> The weakness of the post-process argument is if the conversion is sosimple that it becomes a common need to reshape then you are asking the end user to get involved with skills they may not have.  It's only half a standard from consumers POV.
> 

I do see that point. The question is whether the simple 'transformation' would be enough or not.

Ivan 

> 	Andy
> 
>> 
>> Cheers
>> 
>> Ivan
>> 
>> [1]
>> http://htmlpreview.github.io/?https://github.com/w3c/csvw/blob/rdfconversion-ivan/csv2rdf/index.html
>> 
>> 
>> 
>> 
>>> Andy
>>> 
>>>> Ivan
>> 
>> 
>> ---- Ivan Herman, W3C Digital Publishing Activity Lead Home:
>> http://www.w3.org/People/Ivan/ mobile: +31-641044153 GPG: 0x343F1A3D
>> WebID: http://www.ivan-herman.net/foaf#me


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
WebID: http://www.ivan-herman.net/foaf#me
Received on Wednesday, 21 May 2014 10:36:11 UTC