Re: CSV2RDF and R2RML

On Feb 19, 2014, at 8:23 AM, Alfredo Serafini <seralf@gmail.com> wrote:

> Hi
> this is a really nteresting topic! really good ideas :-)
> 
> I suggest to left the multiple tables problem outside the general discussion, and think only about multiple sheets, as using multiple tables in the same sheet it's really difficult to imagine how can be mapped automatically. It seems to me a task involving some NLP as well as parsing from destructured (or bad structured or bard formatted, etc) word files.

I could see how, using my CSV-LD proposal, we could identify the break in tables and associate a new context with the next set. The idea would be to use an empty line (no columns, just a line separator) to essentially terminate processing of the previous table and start processing anew as if this were the first line of a new CSV.

Gregg

> Alfredo
> 
> 
> 
> 
> 2014-02-19 17:09 GMT+01:00 James McKinney <james@opennorth.ca>:
> >
> > What this tells me, though, is that there is only that much we can do on
> > providing clean data. At this moment we are talking about the conversion to
> > JSON, RDF, or XML or whatever: in all cases there is a level of cleanup that
> > _will_ be in the realm of the data consumer, no matter what. We should not try
> > to cover all the pathological cases...
> >
> > To take the example above with
> >
> > Country,Population,2010,2011,2012,2013
> >
> > if the generated JSON is a simply copy of that, ie,
> >
> > {
> > "2000" : "true",
> > "2010" : "false",
> > ...
> > }
> >
> > one can easily produce a post-processing program that transforms this data in a
> > more proper way for that specific case, but I have difficulties to imagine how
> > we would define some sort of a generic almost-turing-complete language to define
> > transformations in general... For this case even the @context of JSON-LD would
> > not help.
> >
> > I guess what we may do is to analyze the use cases to see how frequent the
> > various pathological cases are, and we may then be able to add metadata
> > information signaling those. But we will not cover all.
> 
> I agree that covering all cases is out of scope :) I can see how pathological CSV might be converted to JSON or XML. Would the RDF then have a bunch of invented terms like ex:2000, ex:2010?
> 
> 
> > As for the multiple tables with the same file: do you mean that the data is such
> > that its structure is not homogeneous, ie, that it is as if several csv files,
> > with different structures, were concatenated together? Now *that* is really messy:-(
> >
> > Ivan
> >
> > B.t.w., the my original remark referred to the 'foreign key' issue; ie, that we
> > can forget about that RDB terms for CSV... I hope that does hold although your
> > remark about several tables within the same CSV files made me scared.
> 
> Re: multiple tables within a single CSV: it's not uncommon for an Excel user to start a table at cell (0,0) (perhaps containing the "raw" data they are dealing with), and to then start another table (maybe one that summarizes or categorizes the information in the first table) somewhere to the right at cell (20,0). That way, they just need to scroll over to switch between the two tables, instead of reaching down to Excel's tabs and having to refer to cells across sheets when building the second table.
> 
> In other words, the Excel sheet is used as a canvas, on which the user puts a bunch of tables (not necessarily starting in the first row).
> 
> In my experience, most individuals create, open, and work with CSV in spreadsheet programs like Excel (LibreOffice, etc. users exhibit the same behavior as described above). When those users then try to upload their data to Tableau, etc. to visualize it, they are frequently disappointed that Tableau, for example, did not understand that the header "2010" is a value for the variable "year" and not the name of a variable.
> 
> James
> 
> 
> 

Received on Wednesday, 19 February 2014 16:51:07 UTC