Re: Scoping Question from Ivan Herman on 2014-02-22 (public-csv-wg@w3.org from February 2014)

From: Ivan Herman <ivan@w3.org>
Date: Sat, 22 Feb 2014 13:07:01 +0100
To: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>
Cc: Jeni Tennison <jeni@jenitennison.com>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <F4A9DECF-4322-4F9F-A1C0-0FA23B74F479@w3.org>
Stasinos,

I think (without having dwelled into the details) that what you write makes a lot of sense, except... we should not forget that this Working Group is chartered until the end of August 2015. Many of us in the group have experience with Working Groups, we know that a precise formulation of a specification, its testing, going through the hoops of comment management, etc, etc, takes a lot of time. Ie, a year-and-a-half is not a long time...

I think your point (a) is in line with what I said. Well, at this point, I would be happy if we succeeded in doing that. At this point, I would stop there. If we have the time and energy, if there is a real use case banging on our doors, then we can consider looking into (b), (c), etc, possibly asking for a new charter for a new group, whatever. But that is way down the line...

Ivan

On 22 Feb 2014, at 12:49 , Stasinos Konstantopoulos <konstant@iit.demokritos.gr> wrote:

> Jeni, Ivan, all,
> 
> There are some, relatively few, CSV variants that will cover most of
> the datasets out there. Adoption of whatever it is we produce should
> be straightforward for those. There will be some stranger and more
> exotic ones; adoption might be harder, but should IMO not be excluded.
> 
> Especially for the more exotic ones, any attrempt to provide
> semantics, interoperate, etc. will almost always be an afterthought:
> many of these datasets will start as a quick and dirty solution for
> data that doesn't seem to be anybody else's business at the time; some
> will stay so, but some might evolve into something that now needs to
> interoperate and to be parsed and understood by other people's tools.
> So no warning along the lines of "follow our best practices or be
> doomed" is going to be effective.
> 
> (a) Proper and simple CSV files, where the problem is one of assigning
> meaning to the columns so that they can be interoperable with other
> people's data. This can be as simple as providing the prefix that
> should be prepended to the "local" column names, so that they become
> proper URIs; or a mapping from each column name to a URI; or a mapping
> from column numbers to URIs, ignoring column names.
> 
> It can also include very simple syntactic elements for letting the
> reader know which is the delimeter, the first and last row that should
> be read, and so on.
> 
> I feel that it is within this WG's scope to provide:
> (a1) A formal definition of what exactly a simple and proper CSV file is
> (a2) A representation for specifying these syntactic and semantic parameters
> (a3) Tooling that reads this and validats/converts/something else?
> these proper and simple CSV files
> 
> (b) But there is no reason to stop there: I feel that the motivation
> and the expertise is here to push beyond this and look into more
> complex, but still well-defined, cases such as NetCDFs (or other, or
> multiple such format; I will just write "NetCDF" to mean "one or more
> well-defined, widely used tabular format") . There already is a
> representation for providing some semantics (such as value types) but
> there is no way to provide data integration semantics. So we can also:
> 
> (b1) Re-forulate the definition of what exactly a NetCDF file is using
> the same language we used for (a1)
> (b2) A representation for specifying the semantic parameters that turn
> column names into meaningful URIs, possibly identical to (a2)
> (b3) Tooling that reads this and validates/converts/something else? these files
> 
> (c) Pushing still further into the CSV jungle, generalize (a) and (c)
> into a general specification for writing machine-readable "data
> descriptions" (or some other name, "data descriptions" was mentioned
> on the list, I think), effectively definitions of classes of tabular
> data files. The WG will evaluate its own specification by using it to
> author (a1), (a2), (b1), (b2), and some characteristic "ugly" CVS
> variants.
> 
> The WG will also need a reference implementation for an engine that
> reads in "data descriptions" and can then validate/convert/something
> else tabular data following the description. The engine will be
> evaluated on the (a), (b), "ugly" data that has been described.
> 
> (d) Make sure that data description are extendible, so that one can
> assume an existing description as a starting point and build upon it.
> So that the standard CSV description can be trivially extended to TSV
> and with a bit more effort into a description where, for example,
> columns 5-10 are *values* for the "year" property but all other
> columns are property names.
> 
> This motivates people to be as close as possible to best practices by
> minimizing the effort needed to get access to existing tools, without
> excluding anybody.
> 
> (e) There is no reason to have to read in data description; one can
> implement an engine that operates on a particular type of files. This
> can be more efficient than the general-purpose engine or have some
> other advantage; the data description in this case only documents the
> engine's operation.
> 
> FWIW, I am attaching a figure I made while trying to get all these
> straight in my head.
> 
> What do you think?
> Stasinos
> 
> 
> 
> PS There is still the issue for linking descriptions to the files
> described and vice versa. The POWDER WG had a similar problem and had:
> 
> - Descriptors that declared the things they are describing
> - Things that optionally pointed to the files that describe them
> through HTTP headers or through rel in the HTML.
> 
> This allowed POWDER descriptions to apply to things that did not even
> know they were being described (e.g., third party annotations of other
> people's Web resources), which could be useful for us as well: it
> might sometimes be the data consumer who undertakes to assign
> semantics to the CSV data and not the data publisher. This can be
> imported and built upon or copy-pasted and re-used; or not.
> 
> But it might be too early to delve into this; and in a different thread.
> 
> s
> 
> 
> On 22 February 2014 11:50, Ivan Herman <ivan@w3.org> wrote:
>> Hi Jeni,
>> 
>> (To be clear, this is not some sort of an 'official' standpoint of W3C, but my personal one.)
>> 
>> Thanks for raising this; it is indeed important that we find consensus on this at the beginning. I must admit the fact that we have to make this choice was not clear to me either...
>> 
>> As far as I am concerned, I do not believe that we can impose any new format on data publishers. Data has been and is being published in CSV, it is messy, and we have to live with it. The most we can do (and I think this _is_ what we should do), as Alf has said, is to define some sort of 'best practices' that is based on the available use cases. This may allow TSV and others dialects (and we may want to contribute to the efforts like CSVDDF[1]), and also some further restrictions like if, and if yes how, several logical tables can be included in one CSV file (something we already discussed a bit). Our conversions to XML/JSON/RDF, our metadata, etc, should rely on data that abide to those best practices. But I do not think defining a new format, that requires current tools to change their exports, would have any chance of being adopted at this point...
>> 
>> (There is of course the issue on where one finds the metadata related to a CSV file, and we may have to rely on HTTP, or some URI schemes; things that the manager of a Web site may control, which is different from the tools used to export the CSV data. But that is not the same as defining a new format.)
>> 
>> I guess this puts me in the "APPROACH 1" camp...
>> 
>> Ivan
>> 
>> 
>> [1] http://dataprotocols.org/csv-dialect/
>> 
>> 
>> On 21 Feb 2014, at 17:31 , Jeni Tennison <jeni@jenitennison.com> wrote:
>> 
>>> Hi,
>>> 
>>> [Only just got net connection to enable me to send this.]
>>> 
>>> A scoping question occurred to me during the call on Wednesday.
>>> 
>>> There seem to be two approaches that we should explicitly choose between.
>>> 
>>> APPROACH 1: Work with what’s there
>>> 
>>> We are trying to create a description / metadata format that would enable us to layer processing semantics over the top of all the various forms of tabular data that people publish so that it can be interpreted in a standard way.
>>> 
>>> We need to do a survey of what tabular data exists in its various formats so that we know what the description / metadata format needs to describe. When we find data that uses different separators, pads out the actual data using empty rows and columns, incorporates two or more tables inside a single CSV file, or uses Excel spreadsheets or DSPL packages or SDF packages or NetCDF or the various other formats that people have invented, we need to keep note of these so that whatever solution and processors we create will work with these files.
>>> 
>>> APPROACH 2: Invent something new
>>> 
>>> We are trying to create a new format that would enable publishers to publish tabular data in a more regular way while preserving the same meaning, to make it easier for consumers of that data.
>>> 
>>> We need to do a survey of what tabular data exists so that we can see what publishers are trying to say with their data, but the format that they are currently publishing that data in is irrelevant because we are going to invent a new format. When we find data that includes metadata about tables and cells, or groups or has cross references between tables, or has columns whose values are of different types, we need to keep note of these so that we ensure the format we create can capture that meaning.
>>> 
>>> We also need to understand existing data so that we have a good backwards compatibility story: it would be useful if the format we invent can be used with existing tools, and if existing data didn’t have to be changed very much to put it into the new format. But there will certainly be files that do have to be changed, and sometimes substantially.
>>> 
>>> 
>>> My focus is definitely on the second approach as I think taking the first approach is an endless and impossible task. But some recent mails and discussion has made me think that some people are taking the first approach. Any thoughts?
>>> 
>>> Cheers,
>>> 
>>> Jeni
>>> --
>>> Jeni Tennison
>>> http://www.jenitennison.com/
>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C
>> Digital Publishing Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> GPG: 0x343F1A3D
>> FOAF: http://www.ivan-herman.net/foaf
>> 
>> 
>> 
>> 
>> 
> <csvw.pdf>


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
FOAF: http://www.ivan-herman.net/foaf
Received on Saturday, 22 February 2014 12:07:26 UTC