Re: CSV2RDF and R2RML from Alfredo Serafini on 2014-02-19 (public-csv-wg@w3.org from February 2014)

From: Alfredo Serafini <seralf@gmail.com>
Date: Wed, 19 Feb 2014 17:23:47 +0100
To: James McKinney <james@opennorth.ca>
Cc: Ivan Herman <ivan@w3.org>, Andy Seaborne <andy@apache.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CADawF4Orxs6wkiHHKjKzNVLeBbt0of9+D+AJ1AEnayDXMzEN1A@mail.gmail.com>

Hi
this is a really nteresting topic! really good ideas :-)

I suggest to left the multiple tables problem outside the general
discussion, and think only about multiple sheets, as using multiple tables
in the same sheet it's really difficult to imagine how can be mapped
automatically. It seems to me a task involving some NLP as well as parsing
from destructured (or bad structured or bard formatted, etc) word files.

Alfredo




2014-02-19 17:09 GMT+01:00 James McKinney <james@opennorth.ca>:

> >
> > What this tells me, though, is that there is only that much we can do on
> > providing clean data. At this moment we are talking about the conversion
> to
> > JSON, RDF, or XML or whatever: in all cases there is a level of cleanup
> that
> > _will_ be in the realm of the data consumer, no matter what. We should
> not try
> > to cover all the pathological cases...
> >
> > To take the example above with
> >
> > Country,Population,2010,2011,2012,2013
> >
> > if the generated JSON is a simply copy of that, ie,
> >
> > {
> > "2000" : "true",
> > "2010" : "false",
> > ...
> > }
> >
> > one can easily produce a post-processing program that transforms this
> data in a
> > more proper way for that specific case, but I have difficulties to
> imagine how
> > we would define some sort of a generic almost-turing-complete language
> to define
> > transformations in general... For this case even the @context of JSON-LD
> would
> > not help.
> >
> > I guess what we may do is to analyze the use cases to see how frequent
> the
> > various pathological cases are, and we may then be able to add metadata
> > information signaling those. But we will not cover all.
>
> I agree that covering all cases is out of scope :) I can see how
> pathological CSV might be converted to JSON or XML. Would the RDF then have
> a bunch of invented terms like ex:2000, ex:2010?
>
>
> > As for the multiple tables with the same file: do you mean that the data
> is such
> > that its structure is not homogeneous, ie, that it is as if several csv
> files,
> > with different structures, were concatenated together? Now *that* is
> really messy:-(
> >
> > Ivan
> >
> > B.t.w., the my original remark referred to the 'foreign key' issue; ie,
> that we
> > can forget about that RDB terms for CSV... I hope that does hold
> although your
> > remark about several tables within the same CSV files made me scared.
>
> Re: multiple tables within a single CSV: it's not uncommon for an Excel
> user to start a table at cell (0,0) (perhaps containing the "raw" data they
> are dealing with), and to then start another table (maybe one that
> summarizes or categorizes the information in the first table) somewhere to
> the right at cell (20,0). That way, they just need to scroll over to switch
> between the two tables, instead of reaching down to Excel's tabs and having
> to refer to cells across sheets when building the second table.
>
> In other words, the Excel sheet is used as a canvas, on which the user
> puts a bunch of tables (not necessarily starting in the first row).
>
> In my experience, most individuals create, open, and work with CSV in
> spreadsheet programs like Excel (LibreOffice, etc. users exhibit the same
> behavior as described above). When those users then try to upload their
> data to Tableau, etc. to visualize it, they are frequently disappointed
> that Tableau, for example, did not understand that the header "2010" is a
> value for the variable "year" and not the name of a variable.
>
> James
>
>
>

Received on Wednesday, 19 February 2014 16:24:15 UTC