Re: CSV2RDF and R2RML from Andy Seaborne on 2014-02-20 (public-csv-wg@w3.org from February 2014)

From: Andy Seaborne <andy@apache.org>
Date: Thu, 20 Feb 2014 14:44:51 +0000
To: Gregg Kellogg <gregg@greggkellogg.net>, Alfredo Serafini <seralf@gmail.com>
CC: James McKinney <james@opennorth.ca>, Ivan Herman <ivan@w3.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <530614E3.9090200@apache.org>
On 19/02/14 16:50, Gregg Kellogg wrote:
> On Feb 19, 2014, at 8:23 AM, Alfredo Serafini <seralf@gmail.com
> <mailto:seralf@gmail.com>> wrote:
>
>> Hi
>> this is a really nteresting topic! really good ideas :-)
>>
>> I suggest to left the multiple tables problem outside the general
>> discussion, and think only about multiple sheets, as using multiple
>> tables in the same sheet it's really difficult to imagine how can be
>> mapped automatically. It seems to me a task involving some NLP as well
>> as parsing from destructured (or bad structured or bard formatted,
>> etc) word files.
>
> I could see how, using my CSV-LD proposal, we could identify the break
> in tables and associate a new context with the next set. The idea would
> be to use an empty line (no columns, just a line separator) to
> essentially terminate processing of the previous table and start
> processing anew as if this were the first line of a new CSV.

Maybe put the location of the data table within a single CSV file into 
the associated metadata: a package description for a single file. 
Multiple tables in one file is then not the responsibility of the 
converter to have rules as what makes a new table.

	Andy

>
> Gregg
>
>> Alfredo
>>
>>
>>
>>
>> 2014-02-19 17:09 GMT+01:00 James McKinney <james@opennorth.ca
>> <mailto:james@opennorth.ca>>:
>>
>>     >
>>     > What this tells me, though, is that there is only that much we
>>     can do on
>>     > providing clean data. At this moment we are talking about the
>>     conversion to
>>     > JSON, RDF, or XML or whatever: in all cases there is a level of
>>     cleanup that
>>     > _will_ be in the realm of the data consumer, no matter what. We
>>     should not try
>>     > to cover all the pathological cases...
>>     >
>>     > To take the example above with
>>     >
>>     > Country,Population,2010,2011,2012,2013
>>     >
>>     > if the generated JSON is a simply copy of that, ie,
>>     >
>>     > {
>>     > "2000" : "true",
>>     > "2010" : "false",
>>     > ...
>>     > }
>>     >
>>     > one can easily produce a post-processing program that transforms
>>     this data in a
>>     > more proper way for that specific case, but I have difficulties
>>     to imagine how
>>     > we would define some sort of a generic almost-turing-complete
>>     language to define
>>     > transformations in general... For this case even the @context of
>>     JSON-LD would
>>     > not help.
>>     >
>>     > I guess what we may do is to analyze the use cases to see how
>>     frequent the
>>     > various pathological cases are, and we may then be able to add
>>     metadata
>>     > information signaling those. But we will not cover all.
>>
>>     I agree that covering all cases is out of scope :) I can see how
>>     pathological CSV might be converted to JSON or XML. Would the RDF
>>     then have a bunch of invented terms like ex:2000, ex:2010?
>>
>>
>>     > As for the multiple tables with the same file: do you mean that
>>     the data is such
>>     > that its structure is not homogeneous, ie, that it is as if
>>     several csv files,
>>     > with different structures, were concatenated together? Now
>>     *that* is really messy:-(
>>     >
>>     > Ivan
>>     >
>>     > B.t.w., the my original remark referred to the 'foreign key'
>>     issue; ie, that we
>>     > can forget about that RDB terms for CSV... I hope that does hold
>>     although your
>>     > remark about several tables within the same CSV files made me
>>     scared.
>>
>>     Re: multiple tables within a single CSV: it's not uncommon for an
>>     Excel user to start a table at cell (0,0) (perhaps containing the
>>     "raw" data they are dealing with), and to then start another table
>>     (maybe one that summarizes or categorizes the information in the
>>     first table) somewhere to the right at cell (20,0). That way, they
>>     just need to scroll over to switch between the two tables, instead
>>     of reaching down to Excel's tabs and having to refer to cells
>>     across sheets when building the second table.
>>
>>     In other words, the Excel sheet is used as a canvas, on which the
>>     user puts a bunch of tables (not necessarily starting in the first
>>     row).
>>
>>     In my experience, most individuals create, open, and work with CSV
>>     in spreadsheet programs like Excel (LibreOffice, etc. users
>>     exhibit the same behavior as described above). When those users
>>     then try to upload their data to Tableau, etc. to visualize it,
>>     they are frequently disappointed that Tableau, for example, did
>>     not understand that the header "2010" is a value for the variable
>>     "year" and not the name of a variable.
>>
>>     James
>>
>>
>>
>
Received on Thursday, 20 February 2014 14:45:22 UTC