W3C home > Mailing lists > Public > public-csv-wg@w3.org > November 2014

Re: Report on CSV files on the Web

From: Yakov Shafranovich <yakov-ietf@shaftek.org>
Date: Wed, 19 Nov 2014 16:17:34 -0500
Message-ID: <CAPQd5oQLiTEMreNugLp5T01Av-nfFf0xYMnCbMyC77j5njS98A@mail.gmail.com>
To: Juergen Umbrich <juergen.umbrich@wu.ac.at>
Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>, Sebastian Neumaier <sebastian.neumaier@wu.ac.at>
This is great stuff, thank you!

I am wondering if there is a correlation between the correct MIME type
being used and the software being used as identified by the "Server"
header. Is there any chance you may have that data?

Thanks,
Yakov

On Wed, Nov 19, 2014 at 6:30 AM, Juergen Umbrich <juergen.umbrich@wu.ac.at>
wrote:

> Hi all,
>
> as "announced" last week, here is our first early report about our
> findings by looking into 65k CSV files, published as OpenData on the Web.
>
> "This study reports on our findings about 74395 CSV files published on the
> Web as Open Data. The documents are extracted from 91 Open Data CKAN
> portals for which the meta data indicate a comma/character-separate-values
> file. Our analysis includes the inspection of the HTTP response headers,
> encoding detection and guessing of used delimiters. We also determine the
> deviation of data tables compared to a canonical form [1].
>
> Our findings show that the majority of the CSV files adhere to the RFC4180
> specification, meaning the use of csv as file extension, text/csv as the
> HTTP response header content-type , and ',' as delimiter. We also show that
> there exists nearly no information about the content encoding in the HTTP
> head- ers. The major observed deviations are that data tables contain rows
> in which one or several data cells occupy multiple columns and that one or
> several data cells are empty."
>
>
>
> Best
>   Jürgen
>
> --
> Dr. Jürgen Umbrich
> WU Vienna, Institute for Information Business
>
>
>
>
>
>
>
Received on Wednesday, 19 November 2014 21:18:31 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:27:45 UTC