- From: Andy Seaborne <andy@apache.org>
- Date: Thu, 20 Feb 2014 14:44:51 +0000
- To: Gregg Kellogg <gregg@greggkellogg.net>, Alfredo Serafini <seralf@gmail.com>
- CC: James McKinney <james@opennorth.ca>, Ivan Herman <ivan@w3.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
On 19/02/14 16:50, Gregg Kellogg wrote:
> On Feb 19, 2014, at 8:23 AM, Alfredo Serafini <seralf@gmail.com
> <mailto:seralf@gmail.com>> wrote:
>
>> Hi
>> this is a really nteresting topic! really good ideas :-)
>>
>> I suggest to left the multiple tables problem outside the general
>> discussion, and think only about multiple sheets, as using multiple
>> tables in the same sheet it's really difficult to imagine how can be
>> mapped automatically. It seems to me a task involving some NLP as well
>> as parsing from destructured (or bad structured or bard formatted,
>> etc) word files.
>
> I could see how, using my CSV-LD proposal, we could identify the break
> in tables and associate a new context with the next set. The idea would
> be to use an empty line (no columns, just a line separator) to
> essentially terminate processing of the previous table and start
> processing anew as if this were the first line of a new CSV.
Maybe put the location of the data table within a single CSV file into
the associated metadata: a package description for a single file.
Multiple tables in one file is then not the responsibility of the
converter to have rules as what makes a new table.
Andy
>
> Gregg
>
>> Alfredo
>>
>>
>>
>>
>> 2014-02-19 17:09 GMT+01:00 James McKinney <james@opennorth.ca
>> <mailto:james@opennorth.ca>>:
>>
>> >
>> > What this tells me, though, is that there is only that much we
>> can do on
>> > providing clean data. At this moment we are talking about the
>> conversion to
>> > JSON, RDF, or XML or whatever: in all cases there is a level of
>> cleanup that
>> > _will_ be in the realm of the data consumer, no matter what. We
>> should not try
>> > to cover all the pathological cases...
>> >
>> > To take the example above with
>> >
>> > Country,Population,2010,2011,2012,2013
>> >
>> > if the generated JSON is a simply copy of that, ie,
>> >
>> > {
>> > "2000" : "true",
>> > "2010" : "false",
>> > ...
>> > }
>> >
>> > one can easily produce a post-processing program that transforms
>> this data in a
>> > more proper way for that specific case, but I have difficulties
>> to imagine how
>> > we would define some sort of a generic almost-turing-complete
>> language to define
>> > transformations in general... For this case even the @context of
>> JSON-LD would
>> > not help.
>> >
>> > I guess what we may do is to analyze the use cases to see how
>> frequent the
>> > various pathological cases are, and we may then be able to add
>> metadata
>> > information signaling those. But we will not cover all.
>>
>> I agree that covering all cases is out of scope :) I can see how
>> pathological CSV might be converted to JSON or XML. Would the RDF
>> then have a bunch of invented terms like ex:2000, ex:2010?
>>
>>
>> > As for the multiple tables with the same file: do you mean that
>> the data is such
>> > that its structure is not homogeneous, ie, that it is as if
>> several csv files,
>> > with different structures, were concatenated together? Now
>> *that* is really messy:-(
>> >
>> > Ivan
>> >
>> > B.t.w., the my original remark referred to the 'foreign key'
>> issue; ie, that we
>> > can forget about that RDB terms for CSV... I hope that does hold
>> although your
>> > remark about several tables within the same CSV files made me
>> scared.
>>
>> Re: multiple tables within a single CSV: it's not uncommon for an
>> Excel user to start a table at cell (0,0) (perhaps containing the
>> "raw" data they are dealing with), and to then start another table
>> (maybe one that summarizes or categorizes the information in the
>> first table) somewhere to the right at cell (20,0). That way, they
>> just need to scroll over to switch between the two tables, instead
>> of reaching down to Excel's tabs and having to refer to cells
>> across sheets when building the second table.
>>
>> In other words, the Excel sheet is used as a canvas, on which the
>> user puts a bunch of tables (not necessarily starting in the first
>> row).
>>
>> In my experience, most individuals create, open, and work with CSV
>> in spreadsheet programs like Excel (LibreOffice, etc. users
>> exhibit the same behavior as described above). When those users
>> then try to upload their data to Tableau, etc. to visualize it,
>> they are frequently disappointed that Tableau, for example, did
>> not understand that the header "2010" is a value for the variable
>> "year" and not the name of a variable.
>>
>> James
>>
>>
>>
>
Received on Thursday, 20 February 2014 14:45:22 UTC