Re: Report on CSV files on the Web

Hi Yankov, 

> 
> I am wondering if there is a correlation between the correct MIME type being used and the software being used as identified by the "Server" header. Is there any chance you may have that data?
Sure, this data is available and we can get the numbers hopefully beginning of next week since i won’t be able to compile the numbers during this week. 

Best
  Jürgen 
> 
> Thanks,
> Yakov
> 
> On Wed, Nov 19, 2014 at 6:30 AM, Juergen Umbrich <juergen.umbrich@wu.ac.at> wrote:
> Hi all,
> 
> as “announced” last week, here is our first early report about our findings by looking into 65k CSV files, published as OpenData on the Web.
> 
> "This study reports on our findings about 74395 CSV files published on the Web as Open Data. The documents are extracted from 91 Open Data CKAN portals for which the meta data indicate a comma/character-separate-values file. Our analysis includes the inspection of the HTTP response headers, encoding detection and guessing of used delimiters. We also determine the deviation of data tables compared to a canonical form [1].
> 
> Our findings show that the majority of the CSV files adhere to the RFC4180 specification, meaning the use of csv as file extension, text/csv as the HTTP response header content-type , and ’,’ as delimiter. We also show that there exists nearly no information about the content encoding in the HTTP head- ers. The major observed deviations are that data tables contain rows in which one or several data cells occupy multiple columns and that one or several data cells are empty."
> 
> 
> 
> Best
>   Jürgen
> 
> --
> Dr. Jürgen Umbrich
> WU Vienna, Institute for Information Business
> 
> 
> 
> 
> 
> 
> 

--
Dr. Jürgen Umbrich
WU Vienna, Institute for Information Business

Received on Thursday, 20 November 2014 12:32:25 UTC