Re: Architecture of mapping CSV to other formats from Alfredo Serafini on 2014-04-24 (public-csv-wg@w3.org from April 2014)

From: Alfredo Serafini <seralf@gmail.com>
Date: Thu, 24 Apr 2014 13:50:23 +0200
To: Innovimax W3C <innovimax+w3c@gmail.com>
Cc: Ivan Herman <ivan@w3.org>, Jeni Tennison <jeni@jenitennison.com>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-ID: <CADawF4MvexAV=F7ecRvfRNis4Q7eZeRFFh5DgUgF=x7ZUP-kfw@mail.gmail.com>
Hi

given a default mapping, I would use the combination of 3/4 to plug
specific components designed to small changes (aggregate/disaggregate
fields, chaacter normalizations, and so on), or even for a completely new
mapping. This way the standard workflow will not break too much, and it's
open for very different technologies.

For the XML part I strongly suggest to avoid using too much specific
attributes and on the other hand to fix the ID as an attribute: these could
be useful to easily obtain the back mapping with query languages like
xpath, for example.


Alfredo




2014-04-24 13:42 GMT+02:00 Innovimax W3C <innovimax+w3c@gmail.com>:

> Then sorry
>
> I thought the question was about architecture
>
> Regards,
>
> Mohamed
>
> On Thu, Apr 24, 2014 at 1:40 PM, Ivan Herman <ivan@w3.org> wrote:
> > I still do not get it.
> >
> > GRDDL is a way to tell an XML (including XHTML) processor: "here is an
> XSLT file that you can use to transform this XML file into RDF".
> >
> > What we may provide is reference to an XSLT file that may say "if the
> CSV file is transformed into XML, here is an XSLT file that you can use to
> massage the result to produce another XML file". There is no mention of RDF
> in there. So, while there is a vague resemblance to GRDDL, I think
> referring to GRDDL might only muddy the waters:-(
> >
> > Ivan
> >
> > On 24 Apr 2014, at 13:33 , Innovimax W3C <innovimax+w3c@gmail.com>
> wrote:
> >
> >> Sure!
> >>
> >> But the tool we will end up providing with be in the family of "**->
> >> RDF" in which GRDDL.
> >> The same will apply if we do CSV -> XML we will have to deal with XSLT
> >> and XQuery Serialization spec, for example
> >>
> >> Regards,
> >>
> >> Mohamed
> >>
> >> On Thu, Apr 24, 2014 at 1:14 PM, Ivan Herman <ivan@w3.org> wrote:
> >>> I am not absolutely sure whether it is indeed relevant. GRDDL is a way
> to associate an XSLT style sheet to an XML file to transform it into RDF.
> Ie, it is a tool (alas! almost not in use in practice) for XML->RDF, which
> is not part of this charter...
> >>>
> >>> Ivan
> >>>
> >>> On 24 Apr 2014, at 12:52 , Innovimax W3C <innovimax+w3c@gmail.com>
> wrote:
> >>>
> >>>> Dear all,
> >>>>
> >>>> Just a side node perhaps, but we already have some existing material
> >>>> which is GRDDL [1]
> >>>>
> >>>> I was surprised that I was not mentionned in the charter
> >>>>
> >>>> It would be good to keep GRDDL in mind with respect to answering that
> >>>> question in order to keep the link with existing W3C Specification
> >>>>
> >>>> Thanks
> >>>>
> >>>> Mohamed
> >>>>
> >>>> [1] http://www.w3.org/TR/grddl/
> >>>>
> >>>> On Wed, Apr 23, 2014 at 9:13 PM, Jeni Tennison <jeni@jenitennison.com>
> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> On the call today we discussed briefly the general architecture of
> mapping from CSV to other formats (eg RDF, JSON, XML, SQL), specifically
> where to draw the lines between what we specify and what is specified
> elsewhere.
> >>>>>
> >>>>> To make this clear with an XML-based example, suppose that we have a
> CSV file like:
> >>>>>
> >>>>> GID,On Street,Species,Trim Cycle,Inventory Date
> >>>>> 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
> >>>>> 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine
> Prune,6/2/2010
> >>>>> 3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine
> Prune,6/2/2010
> >>>>>
> >>>>> This will have a basic mapping into XML which might look like:
> >>>>>
> >>>>> <data>
> >>>>> <row>
> >>>>>   <GID>1</GID>
> >>>>>   <On_Street>ADDISON AV</On_Street>
> >>>>>   <Species>Celtis australis</Species>
> >>>>>   <Trim_Cycle>Large Tree Routine Prune</Trim_Cycle>
> >>>>>   <Inventory_Date>10/18/2010</Inventory_Date>
> >>>>> </row>
> >>>>> ...
> >>>>> </data>
> >>>>>
> >>>>> But the XML that someone actually wants the CSV to map into might be
> different:
> >>>>>
> >>>>> <trees>
> >>>>> <tree id="1" date="2010-10-18">
> >>>>>   <street>ADDISON AV</street>
> >>>>>   <species>Celtis australis</species>
> >>>>>   <trim>Large Tree Routine Prune</trim>
> >>>>> </tree>
> >>>>> ...
> >>>>> </trees>
> >>>>>
> >>>>> There are (at least) four different ways of architecting this:
> >>>>>
> >>>>> 1. We just specify the default mapping; people who want a more
> complex mapping can plug that into their own toolchains. The disadvantage
> of this is that it makes it harder for the original publisher to specify
> canonical mappings from CSV into other formats. It also requires people to
> know how to use a larger toolchain (but I think they are probably have that
> anyway).
> >>>>>
> >>>>> 2. We enable people to point from the metadata about the CSV file to
> an ‘executable’ file that defines the mapping (eg to an XSLT stylesheet or
> a SPARQL CONSTRUCT query or a Turtle template or a Javascript module) and
> define how that gets used to perform the mapping. This gives great
> flexibility but means that everyone needs to hand craft common patterns of
> mapping, such as of numeric or date formats into numbers or dates. It also
> means that processors have to support whatever executable syntax is defined
> for the different mappings.
> >>>>>
> >>>>> 3. We provide specific declarative metadata vocabulary fields that
> enable configuration of the mapping. For example, each column might have an
> associated ‘xml-name’ and ‘xml-type’ (element or attribute), as well as
> (more usefully across all mappings) ‘datatype’ and ‘date-format’. This
> gives a fair amount of control within a single file.
> >>>>>
> >>>>> 4. We have some combination of #2 & #3 whereby some things are
> configurable declaratively in the metadata file, but there’s an “escape
> hatch” of referencing out to an executable file that can override. The
> question is then about where the lines should be drawn: how much should be
> in the metadata vocabulary (3) and how much left to specific configuration
> (2).
> >>>>>
> >>>>> My inclination is to aim for #4. I also think we should try to reuse
> existing mechanisms for the mapping as much as possible, and try to focus
> initially on metadata vocabulary fields that are useful across use cases
> (ie not just mapping to different formats but also in validation and
> documentation of CSVs).
> >>>>>
> >>>>> What do other people think?
> >>>>>
> >>>>> Jeni
> >>>>> --
> >>>>> Jeni Tennison
> >>>>> http://www.jenitennison.com/
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Innovimax SARL
> >>>> Consulting, Training & XML Development
> >>>> 9, impasse des Orteaux
> >>>> 75020 Paris
> >>>> Tel : +33 9 52 475787
> >>>> Fax : +33 1 4356 1746
> >>>> http://www.innovimax.fr
> >>>> RCS Paris 488.018.631
> >>>> SARL au capital de 10.000 €
> >>>>
> >>>
> >>>
> >>> ----
> >>> Ivan Herman, W3C
> >>> Digital Publishing Activity Lead
> >>> Home: http://www.w3.org/People/Ivan/
> >>> mobile: +31-641044153
> >>> GPG: 0x343F1A3D
> >>> FOAF: http://www.ivan-herman.net/foaf
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Innovimax SARL
> >> Consulting, Training & XML Development
> >> 9, impasse des Orteaux
> >> 75020 Paris
> >> Tel : +33 9 52 475787
> >> Fax : +33 1 4356 1746
> >> http://www.innovimax.fr
> >> RCS Paris 488.018.631
> >> SARL au capital de 10.000 €
> >
> >
> > ----
> > Ivan Herman, W3C
> > Digital Publishing Activity Lead
> > Home: http://www.w3.org/People/Ivan/
> > mobile: +31-641044153
> > GPG: 0x343F1A3D
> > FOAF: http://www.ivan-herman.net/foaf
> >
> >
> >
> >
> >
>
>
>
> --
> Innovimax SARL
> Consulting, Training & XML Development
> 9, impasse des Orteaux
> 75020 Paris
> Tel : +33 9 52 475787
> Fax : +33 1 4356 1746
> http://www.innovimax.fr
> RCS Paris 488.018.631
> SARL au capital de 10.000 €
>
>
Received on Thursday, 24 April 2014 11:50:51 UTC