- From: Ivan Herman <ivan@w3.org>
- Date: Wed, 19 Feb 2014 10:28:19 +0100
- To: James McKinney <james@opennorth.ca>
- CC: Andy Seaborne <andy@apache.org>, public-csv-wg@w3.org
- Message-ID: <53047933.1020500@w3.org>
James McKinney wrote: > > On 2014-02-18, at 6:50 AM, Ivan Herman wrote: > >> my gut feeling tells me that >> most of the CSV data out there are structurally simple (albeit possibly huge...) > > The structure may be simple (to a human), but it is not consistent or easily > guessable by a computer. For example, consider the headers: > > Country,Population,2010,2011,2012,2013 > > A CSV that is optimized for computer-consumption would instead have a "year" > header and repeat the country and population values for each year. The UN > publishes lots of CSVs like this. > > You can also have CSVs which are transpositions. For example: > > Variable,Doctor Who,Sherlock,Dowton Abbey > Genre,... > Seasons,... > Episodes,... > Rating,... > > All very simple for a human, but the variability really confuses a computer. > > Another class of problems comes from the fact that many CSVs are created from > within Excel, in which case you'll often have multiple tables on the same sheet, > usually with empty cells between them. You'll also often have all sorts of > presentation mixed in with the data. For example, many financial statements will > have headers interspersed with the data. I don't know if this class of "messy > tables" is in scope for the working group, though. Sigh:-) I guess, for the time being, messy tables are not out of scope (yet?)... What this tells me, though, is that there is only that much we can do on providing clean data. At this moment we are talking about the conversion to JSON, RDF, or XML or whatever: in all cases there is a level of cleanup that _will_ be in the realm of the data consumer, no matter what. We should not try to cover all the pathological cases... To take the example above with Country,Population,2010,2011,2012,2013 if the generated JSON is a simply copy of that, ie, { "2000" : "true", "2010" : "false", ... } one can easily produce a post-processing program that transforms this data in a more proper way for that specific case, but I have difficulties to imagine how we would define some sort of a generic almost-turing-complete language to define transformations in general... For this case even the @context of JSON-LD would not help. I guess what we may do is to analyze the use cases to see how frequent the various pathological cases are, and we may then be able to add metadata information signaling those. But we will not cover all. As for the multiple tables with the same file: do you mean that the data is such that its structure is not homogeneous, ie, that it is as if several csv files, with different structures, were concatenated together? Now *that* is really messy:-( Ivan B.t.w., the my original remark referred to the 'foreign key' issue; ie, that we can forget about that RDB terms for CSV... I hope that does hold although your remark about several tables within the same CSV files made me scared. > > James
Received on Wednesday, 19 February 2014 09:28:56 UTC