Re: Report on CSV files on the Web

On 19 November 2014 11:30, Juergen Umbrich <juergen.umbrich@wu.ac.at> wrote:
> Hi all,
>
> as “announced” last week, here is our first early report about our findings by looking into 65k CSV files, published as OpenData on the Web.
>
> "This study reports on our findings about 74395 CSV files published on the Web as Open Data. The documents are extracted from 91 Open Data CKAN portals for which the meta data indicate a comma/character-separate-values file. Our analysis includes the inspection of the HTTP response headers, encoding detection and guessing of used delimiters. We also determine the deviation of data tables compared to a canonical form [1].
>
> Our findings show that the majority of the CSV files adhere to the RFC4180 specification, meaning the use of csv as file extension, text/csv as the HTTP response header content-type , and ’,’ as delimiter. We also show that there exists nearly no information about the content encoding in the HTTP head- ers. The major observed deviations are that data tables contain rows in which one or several data cells occupy multiple columns and that one or several data cells are empty."


This is great! But may I ask for more? :)

Did you make any investigation into the use of date and time formats /
conventions?

Dan

>
> Best
>   Jürgen
>
> --
> Dr. Jürgen Umbrich
> WU Vienna, Institute for Information Business
>
>
>
>
>
>

Received on Wednesday, 19 November 2014 15:38:46 UTC