W3C home > Mailing lists > Public > public-csv-wg@w3.org > November 2014

Report on CSV files on the Web

From: Juergen Umbrich <juergen.umbrich@wu.ac.at>
Date: Wed, 19 Nov 2014 12:30:30 +0100
Message-Id: <B4B53C3C-4CD3-4E46-8C53-C2269E71A55A@wu.ac.at>
Cc: Sebastian Neumaier <sebastian.neumaier@wu.ac.at>
To: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Hi all, 

as “announced” last week, here is our first early report about our findings by looking into 65k CSV files, published as OpenData on the Web.

"This study reports on our findings about 74395 CSV files published on the Web as Open Data. The documents are extracted from 91 Open Data CKAN portals for which the meta data indicate a comma/character-separate-values file. Our analysis includes the inspection of the HTTP response headers, encoding detection and guessing of used delimiters. We also determine the deviation of data tables compared to a canonical form [1].

Our findings show that the majority of the CSV files adhere to the RFC4180 specification, meaning the use of csv as file extension, text/csv as the HTTP response header content-type , and ’,’ as delimiter. We also show that there exists nearly no information about the content encoding in the HTTP head- ers. The major observed deviations are that data tables contain rows in which one or several data cells occupy multiple columns and that one or several data cells are empty."


Dr. Jürgen Umbrich
WU Vienna, Institute for Information Business

Received on Wednesday, 19 November 2014 11:31:01 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:27:45 UTC