Report on CSV files on the Web from Juergen Umbrich on 2014-11-19 (public-csv-wg@w3.org from November 2014)

From: Juergen Umbrich <juergen.umbrich@wu.ac.at>
Date: Wed, 19 Nov 2014 12:30:30 +0100
To: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Cc: Sebastian Neumaier <sebastian.neumaier@wu.ac.at>
Message-Id: <B4B53C3C-4CD3-4E46-8C53-C2269E71A55A@wu.ac.at>

Hi all, 

as “announced” last week, here is our first early report about our findings by looking into 65k CSV files, published as OpenData on the Web.

"This study reports on our findings about 74395 CSV files published on the Web as Open Data. The documents are extracted from 91 Open Data CKAN portals for which the meta data indicate a comma/character-separate-values file. Our analysis includes the inspection of the HTTP response headers, encoding detection and guessing of used delimiters. We also determine the deviation of data tables compared to a canonical form [1].

Our findings show that the majority of the CSV files adhere to the RFC4180 specification, meaning the use of csv as file extension, text/csv as the HTTP response header content-type , and ’,’ as delimiter. We also show that there exists nearly no information about the content encoding in the HTTP head- ers. The major observed deviations are that data tables contain rows in which one or several data cells occupy multiple columns and that one or several data cells are empty."

Best 
  Jürgen 

--
Dr. Jürgen Umbrich
WU Vienna, Institute for Information Business

Attachments

application/pdf attachment: csv_profiler_report_copy.pdf

Received on Wednesday, 19 November 2014 11:31:01 UTC