- From: Andy Seaborne <andy@apache.org>
- Date: Thu, 20 Feb 2014 14:44:51 +0000
- To: Gregg Kellogg <gregg@greggkellogg.net>, Alfredo Serafini <seralf@gmail.com>
- CC: James McKinney <james@opennorth.ca>, Ivan Herman <ivan@w3.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
On 19/02/14 16:50, Gregg Kellogg wrote: > On Feb 19, 2014, at 8:23 AM, Alfredo Serafini <seralf@gmail.com > <mailto:seralf@gmail.com>> wrote: > >> Hi >> this is a really nteresting topic! really good ideas :-) >> >> I suggest to left the multiple tables problem outside the general >> discussion, and think only about multiple sheets, as using multiple >> tables in the same sheet it's really difficult to imagine how can be >> mapped automatically. It seems to me a task involving some NLP as well >> as parsing from destructured (or bad structured or bard formatted, >> etc) word files. > > I could see how, using my CSV-LD proposal, we could identify the break > in tables and associate a new context with the next set. The idea would > be to use an empty line (no columns, just a line separator) to > essentially terminate processing of the previous table and start > processing anew as if this were the first line of a new CSV. Maybe put the location of the data table within a single CSV file into the associated metadata: a package description for a single file. Multiple tables in one file is then not the responsibility of the converter to have rules as what makes a new table. Andy > > Gregg > >> Alfredo >> >> >> >> >> 2014-02-19 17:09 GMT+01:00 James McKinney <james@opennorth.ca >> <mailto:james@opennorth.ca>>: >> >> > >> > What this tells me, though, is that there is only that much we >> can do on >> > providing clean data. At this moment we are talking about the >> conversion to >> > JSON, RDF, or XML or whatever: in all cases there is a level of >> cleanup that >> > _will_ be in the realm of the data consumer, no matter what. We >> should not try >> > to cover all the pathological cases... >> > >> > To take the example above with >> > >> > Country,Population,2010,2011,2012,2013 >> > >> > if the generated JSON is a simply copy of that, ie, >> > >> > { >> > "2000" : "true", >> > "2010" : "false", >> > ... >> > } >> > >> > one can easily produce a post-processing program that transforms >> this data in a >> > more proper way for that specific case, but I have difficulties >> to imagine how >> > we would define some sort of a generic almost-turing-complete >> language to define >> > transformations in general... For this case even the @context of >> JSON-LD would >> > not help. >> > >> > I guess what we may do is to analyze the use cases to see how >> frequent the >> > various pathological cases are, and we may then be able to add >> metadata >> > information signaling those. But we will not cover all. >> >> I agree that covering all cases is out of scope :) I can see how >> pathological CSV might be converted to JSON or XML. Would the RDF >> then have a bunch of invented terms like ex:2000, ex:2010? >> >> >> > As for the multiple tables with the same file: do you mean that >> the data is such >> > that its structure is not homogeneous, ie, that it is as if >> several csv files, >> > with different structures, were concatenated together? Now >> *that* is really messy:-( >> > >> > Ivan >> > >> > B.t.w., the my original remark referred to the 'foreign key' >> issue; ie, that we >> > can forget about that RDB terms for CSV... I hope that does hold >> although your >> > remark about several tables within the same CSV files made me >> scared. >> >> Re: multiple tables within a single CSV: it's not uncommon for an >> Excel user to start a table at cell (0,0) (perhaps containing the >> "raw" data they are dealing with), and to then start another table >> (maybe one that summarizes or categorizes the information in the >> first table) somewhere to the right at cell (20,0). That way, they >> just need to scroll over to switch between the two tables, instead >> of reaching down to Excel's tabs and having to refer to cells >> across sheets when building the second table. >> >> In other words, the Excel sheet is used as a canvas, on which the >> user puts a bunch of tables (not necessarily starting in the first >> row). >> >> In my experience, most individuals create, open, and work with CSV >> in spreadsheet programs like Excel (LibreOffice, etc. users >> exhibit the same behavior as described above). When those users >> then try to upload their data to Tableau, etc. to visualize it, >> they are frequently disappointed that Tableau, for example, did >> not understand that the header "2010" is a value for the variable >> "year" and not the name of a variable. >> >> James >> >> >> >
Received on Thursday, 20 February 2014 14:45:22 UTC