Re: Architecture of mapping CSV to other formats from Ivan Herman on 2014-04-27 (public-csv-wg@w3.org from April 2014)

From: Ivan Herman <ivan@w3.org>
Date: Sun, 27 Apr 2014 06:50:44 +0200
To: Gregg Kellogg <gregg@greggkellogg.net>
Cc: Jeni Tennison <jeni@jenitennison.com>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <7B21B9A8-3042-4BB3-9DA7-BA153C4B8E4B@w3.org>
Hi Gregg,

let me try to make an attempt answering you, thereby, hopefully, improving my own understanding:-)

On 26 Apr 2014, at 21:01 , Gregg Kellogg <gregg@greggkellogg.net> wrote:

> On Apr 23, 2014, at 12:13 PM, Jeni Tennison <jeni@jenitennison.com> wrote:
> 
>> Hi,
>> 
>> On the call today we discussed briefly the general architecture of mapping from CSV to other formats (eg RDF, JSON, XML, SQL), specifically where to draw the lines between what we specify and what is specified elsewhere.
>> 
>> To make this clear with an XML-based example, suppose that we have a CSV file like:
>> 
>> GID,On Street,Species,Trim Cycle,Inventory Date
>> 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
>> 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
>> 3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010 
>> 
>> This will have a basic mapping into XML which might look like:
>> 
>> <data>
>>  <row>
>>    <GID>1</GID>
>>    <On_Street>ADDISON AV</On_Street>
>>    <Species>Celtis australis</Species>
>>    <Trim_Cycle>Large Tree Routine Prune</Trim_Cycle>
>>    <Inventory_Date>10/18/2010</Inventory_Date>
>>  </row>
>>  ...
>> </data>
>> 
>> But the XML that someone actually wants the CSV to map into might be different:
>> 
>> <trees>
>>  <tree id="1" date="2010-10-18">
>>    <street>ADDISON AV</street>
>>    <species>Celtis australis</species>
>>    <trim>Large Tree Routine Prune</trim>
>>  </tree>
>>  ...
>> </trees>
>> 
>> There are (at least) four different ways of architecting this:
>> 
>> 1. We just specify the default mapping; people who want a more complex mapping can plug that into their own toolchains. The disadvantage of this is that it makes it harder for the original publisher to specify canonical mappings from CSV into other formats. It also requires people to know how to use a larger toolchain (but I think they are probably have that anyway).
> 
> I would say that we have two general deliverables: 1) a default mapping to selected formats, and 2) an informed mapping, so I think that this point is required to meet the requirement for a default mapping.

As far as I am concerned, what you call 'informed mapping' is alternative #3 below. Whether we provide that or not is the subject of this discussion!

> 
>> 2. We enable people to point from the metadata about the CSV file to an ‘executable’ file that defines the mapping (eg to an XSLT stylesheet or a SPARQL CONSTRUCT query or a Turtle template or a Javascript module) and define how that gets used to perform the mapping. This gives great flexibility but means that everyone needs to hand craft common patterns of mapping, such as of numeric or date formats into numbers or dates. It also means that processors have to support whatever executable syntax is defined for the different mappings.
> 
> I'm not sure I'm with you on this, as I see templates as being part of the metadata, not an external executable. In an RDF variant, I think it makes sense that general metadata be mixed with template information used to do such a mapping. In JSON, I think this is pretty much the same thing, with general metadata perhaps contained in a JSON-LD context, and template information as the body of the JSON-LD. For XML, I think it would be natural to include the XSLT under an element in a metadata document also containing general metadata information.
> 

My understanding is that what you call templates is #3 below. The essence of #2 is, to be more crisp about it, is that I may add, to the metadata, a reference to a Javascript file containing an executable that performs or completes the transformation of the content into another format. Some sort of an abstract callback, thus. I realize Javascript may not be the right tool here because we may want to make it language independent although we may even decide to leave this completely undefined. Or provide a WebIDL as a callback possibility. SPARQL CONSTRUCT or XSLT are just other examples: those are all Turing machines in various disguises...

(A detail to be covered is whether that Turing machine should be invoked on the original data, thereby completely bypassing the default or the informed mapping, or *after* those. I am not sure.)


>> 3. We provide specific declarative metadata vocabulary fields that enable configuration of the mapping. For example, each column might have an associated ‘xml-name’ and ‘xml-type’ (element or attribute), as well as (more usefully across all mappings) ‘datatype’ and ‘date-format’. This gives a fair amount of control within a single file.
> 
> I see this as a solution between points #1 and #2. In general, I don't think it provides enough beyond a default mapping to be too interesting, but might fall out of a case in which metadata is defined by no template is provided.
> 
> From the JSON-LD/CSV-LD perspective, this could be done by associating terms the column names and providing @type and @container information along with that term which could help in adapting field values into an otherwise default mapping.

As I said, I think I disagree on the formulation and I have the impression that we all wildly agree on the essence. #3 is the 'informed' mapping, which may be as poor and providing @type only and as rich as covering what is currently in CSV2RDF.

> 
>> 4. We have some combination of #2 & #3 whereby some things are configurable declaratively in the metadata file, but there’s an “escape hatch” of referencing out to an executable file that can override. The question is then about where the lines should be drawn: how much should be in the metadata vocabulary (3) and how much left to specific configuration (2).
>> 
>> My inclination is to aim for #4. I also think we should try to reuse existing mechanisms for the mapping as much as possible, and try to focus initially on metadata vocabulary fields that are useful across use cases (ie not just mapping to different formats but also in validation and documentation of CSVs).
> 
> Other than not necessarily buying into using an external executable format, I agree that #4 is the way to go. I think a template-based approach, such as is outlined for both CSV-LD and CSV2RDF is pretty generally applicable, and with some restrictions, could be considered an entirely textual process, where you don't really even need to know the details for the format it's used in. For example, a template file could be treated as text with regular expression substitution of embedded {} sequences expanded simply using RFC-6570 rules. Other than for a potential envelope surrounding the result of expanding each template, a processor wouldn't need to know if it's dealing with Turtle, JSON or XML.

Yes, this is what I was asking on our last call: is this really true. I believe there may be target format specificities, though (choice of attribute vs. element for XML, definition of @type for RDF, etc.). The one which is probably the simplest is pure JSON (and we may want to start with that one first working out the details, to see how far it goes!)

> Where this breaks down is with format-specific rules that mostly come in when a field has sub-fields.
> 

And that may be, for me, a case when an external Turing machine would come into play as far as this WG is concerned...

Ivan

> Gregg
> 
>> What do other people think?
>> 
>> Jeni
>> --  
>> Jeni Tennison
>> http://www.jenitennison.com/
>> 
> 
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
FOAF: http://www.ivan-herman.net/foaf
Received on Sunday, 27 April 2014 04:51:15 UTC