Re: Template as mechanism for CSV conversion.

Hi. What we've done is use XSLT, which is mostly suitable for XML 
output, but we were focusing on RDF output. I wrote a bit of software 
called Grinder (never name a tool without doing a bit of googling, eh?) 
which takes tabular data with a heading row and outputs XML suitable for 
passing to XSLT.
eg.
colour, age, top speed, id
blue,10,100,1
green,2,80,2

becomes
<rows>
   <row>
      <colour>blue</colour>
      <age>10</age>
      <topSpeed>100</topSpeed>
      <id>1</id>
   </row>
   <row>
      <colour>green</colour>
      <age>2</age>
      <topSpeed>80</topSpeed>
      <id>2</id>
   </row>
</rows>

We found a number of cases where it was useful to add more processing;
- multiple values in a field, separated with a delineator -- this is 
quite common in tabular data and allows more complex data without using 
a separate table eg.
building,occupants
1,"chemistry,phyics"
2, physics
- skipping blank leading rows and columns
- carrying a value to the next row if the cell below is blank. Some 
reporting software does things that only output an ID on the first row 
of a block.
- various ways to clean up the values that were tricky in XSLT; md5, 
sha1, camelcaps (make it suitable for being part of a URI)

For a complex example see 
http://data.southampton.ac.uk/dumps/catering/2014-05-15/catering.cfg and 
http://data.southampton.ac.uk/dumps/catering/2014-05-15/openorg-pos.xsl
More details here: 
https://github.com/cgutteridge/Grinder/blob/master/bin/grinder#L91

I am not recommending XSLT, but we've solved a bunch of real world 
CSV=>RDF cases so it may give you some ideas. It does have the merit of 
being a standard, at least.

What was most missed was a regular expression search & replace which 
would have massively reduced the complexity.

I recently did a project templating using moustache and we had to keep 
tweaking our data structures to make it understand them, which was not 
ideal.


On 14/05/2014 19:15, Jeni Tennison wrote:
> Thanks Andy,
>
> I think it makes a lot of sense to have a general purpose template for mapping CSV to other formats (eg YAML, HTML). Open Refine does something similar as described here [1], which enables you to define:
>
>    * a prefix
>    * a template for each row
>    * a row separator
>    * a suffix
>
> What about using an existing templating system such as Mustache [2] which has the advantage of being implemented across lots of programming languages? Then you only have to define how the variables that get passed into the template get set up, not the syntax. (I’m not fixated on Mustache — I’d much prefer something more standard — it’s just that I’d really prefer not to have this Working Group invent a new syntax for templates.)
>
> I have three areas of concerns which mostly relate to the limited flexibility that something like Mustache gives you:
>
> 1. In all the real-life conversions I’ve ever done I’ve always ended up needing conditional statements of some sort. Which means having some kind of logical statements, which means adopting a particular programming language to express them in.
>
> 2. In all the real-life conversions I’ve ever done I’ve always ended up needing to process individual values in some way (ie some level of string parsing), which means defining functions.
>
> 3. In all the real-life conversions I’ve ever done that have involved text-based templating languages that need to produce something with a defined structured syntax I’ve always gotten it wrong and produced non-well-formed/valid output.
>
> All of which means that while I’m sure that templating is a useful thing to provide for general-purpose conversions, I still think there’s a need for more general purpose languages to “bug out” to. And I’m not 100% convinced (but we’ll only see by doing) that it will be possible to define useful conversions to other formats using templates. For example, just naming things like elements/attributes in XML and things like properties in JSON will require different approaches, I think, that it will be hard to express in generic templates.
>
> Cheers,
>
> Jeni
>
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Exporters
> [2] http://mustache.github.io/
>
> ------------------------------------------------------
> From: Andy Seaborne andy@apache.org
> Reply: Andy Seaborne andy@apache.org
> Date: 14 May 2014 at 18:16:10
> To: CSV on the Web Working Group public-csv-wg@w3.org
> Subject:  Template as mechanism for CSV conversion.
>
>> (from the telecon - JeniT asked for this to be made more visible on the
>> list)
>>   
>> Gregg has suggested that if all the conversions are based around the
>> template mechanism, then there could be one conversions document for all
>> of RDF, JSON and XML.
>>   
>> That makes sense to me although I also think that someone arrives at the
>> doc wanting, say, the details of JSON conversion, having them all in one
>> place makes for a less focused document.
>>   
>> e.g. RDF:
>> http://w3c.github.io/csvw/csv2rdf/#graph-template
>>   
>> The templating mechanism is text-based and does not require parsing of
>> some variant of the output syntax ("variant" because of the need for
>> template slots). A processor may provide additional validation of the
>> output but, at a minimum, it can generate output just by text processing
>> (and potentially get illegal syntax due to the lightweight nature of the
>> process).
>>   
>> A starting point for templates is URI Templates
>>   
>> http://tools.ietf.org/html/rfc6570
>>   
>> although there needs to be escaping per syntax support.
>>   
>> (*nix) Shell parameter expansion is a similar mechanism.
>>   
>> http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion
>> (not the array bits)
>>   
>> ${parameter/pattern/string} is a regex replace, for example.
>>   
>> Andy
>>   
>>   
>>   
> --
> Jeni Tennison
> http://www.jenitennison.com/
>

-- 
Christopher Gutteridge -- http://users.ecs.soton.ac.uk/cjg

University of Southampton Open Data Service: http://data.southampton.ac.uk/
You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/

Received on Thursday, 15 May 2014 07:23:05 UTC