Re: Architecture of mapping CSV to other formats from Ivan Herman on 2014-04-24 (public-csv-wg@w3.org from April 2014)

From: Ivan Herman <ivan@w3.org>
Date: Thu, 24 Apr 2014 13:14:18 +0200
To: Innovimax W3C <innovimax+w3c@gmail.com>
Cc: Jeni Tennison <jeni@jenitennison.com>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <D4910143-91D4-45C5-84CA-6BAF31DA10D0@w3.org>
I am not absolutely sure whether it is indeed relevant. GRDDL is a way to associate an XSLT style sheet to an XML file to transform it into RDF. Ie, it is a tool (alas! almost not in use in practice) for XML->RDF, which is not part of this charter...

Ivan

On 24 Apr 2014, at 12:52 , Innovimax W3C <innovimax+w3c@gmail.com> wrote:

> Dear all,
> 
> Just a side node perhaps, but we already have some existing material
> which is GRDDL [1]
> 
> I was surprised that I was not mentionned in the charter
> 
> It would be good to keep GRDDL in mind with respect to answering that
> question in order to keep the link with existing W3C Specification
> 
> Thanks
> 
> Mohamed
> 
> [1] http://www.w3.org/TR/grddl/
> 
> On Wed, Apr 23, 2014 at 9:13 PM, Jeni Tennison <jeni@jenitennison.com> wrote:
>> Hi,
>> 
>> On the call today we discussed briefly the general architecture of mapping from CSV to other formats (eg RDF, JSON, XML, SQL), specifically where to draw the lines between what we specify and what is specified elsewhere.
>> 
>> To make this clear with an XML-based example, suppose that we have a CSV file like:
>> 
>> GID,On Street,Species,Trim Cycle,Inventory Date
>> 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
>> 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
>> 3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
>> 
>> This will have a basic mapping into XML which might look like:
>> 
>> <data>
>>  <row>
>>    <GID>1</GID>
>>    <On_Street>ADDISON AV</On_Street>
>>    <Species>Celtis australis</Species>
>>    <Trim_Cycle>Large Tree Routine Prune</Trim_Cycle>
>>    <Inventory_Date>10/18/2010</Inventory_Date>
>>  </row>
>>  ...
>> </data>
>> 
>> But the XML that someone actually wants the CSV to map into might be different:
>> 
>> <trees>
>>  <tree id="1" date="2010-10-18">
>>    <street>ADDISON AV</street>
>>    <species>Celtis australis</species>
>>    <trim>Large Tree Routine Prune</trim>
>>  </tree>
>>  ...
>> </trees>
>> 
>> There are (at least) four different ways of architecting this:
>> 
>> 1. We just specify the default mapping; people who want a more complex mapping can plug that into their own toolchains. The disadvantage of this is that it makes it harder for the original publisher to specify canonical mappings from CSV into other formats. It also requires people to know how to use a larger toolchain (but I think they are probably have that anyway).
>> 
>> 2. We enable people to point from the metadata about the CSV file to an ‘executable’ file that defines the mapping (eg to an XSLT stylesheet or a SPARQL CONSTRUCT query or a Turtle template or a Javascript module) and define how that gets used to perform the mapping. This gives great flexibility but means that everyone needs to hand craft common patterns of mapping, such as of numeric or date formats into numbers or dates. It also means that processors have to support whatever executable syntax is defined for the different mappings.
>> 
>> 3. We provide specific declarative metadata vocabulary fields that enable configuration of the mapping. For example, each column might have an associated ‘xml-name’ and ‘xml-type’ (element or attribute), as well as (more usefully across all mappings) ‘datatype’ and ‘date-format’. This gives a fair amount of control within a single file.
>> 
>> 4. We have some combination of #2 & #3 whereby some things are configurable declaratively in the metadata file, but there’s an “escape hatch” of referencing out to an executable file that can override. The question is then about where the lines should be drawn: how much should be in the metadata vocabulary (3) and how much left to specific configuration (2).
>> 
>> My inclination is to aim for #4. I also think we should try to reuse existing mechanisms for the mapping as much as possible, and try to focus initially on metadata vocabulary fields that are useful across use cases (ie not just mapping to different formats but also in validation and documentation of CSVs).
>> 
>> What do other people think?
>> 
>> Jeni
>> --
>> Jeni Tennison
>> http://www.jenitennison.com/
>> 
> 
> 
> 
> -- 
> Innovimax SARL
> Consulting, Training & XML Development
> 9, impasse des Orteaux
> 75020 Paris
> Tel : +33 9 52 475787
> Fax : +33 1 4356 1746
> http://www.innovimax.fr
> RCS Paris 488.018.631
> SARL au capital de 10.000 €
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
FOAF: http://www.ivan-herman.net/foaf
Received on Thursday, 24 April 2014 11:14:47 UTC