Fwd: Analysis of different datatypes per column

Maybe the attachments were too big?

---------- Forwarded message ----------
From: Juergen Umbrich <juergen.umbrich@wu.ac.at>
Date: 25 February 2015 at 07:16
Subject: Fwd: Analysis of different datatypes per column
To: Dan Brickley <danbri@danbri.org>, Jeni Tennison <jeni@jenitennison.com>


Hi,
here the email,
i forgot to gzip one of the text files, and the email was 7MB- > that
might be the reason that the email was not delivered


> Begin forwarded message:
>
> Hi all,
>
> attached is a short survey of the amount of different “datatypes” per columns in CSV files and our logs containing the CSV URL and at what column we detected which types.
>
> DISCLAIMER:
> We are not sure how representative these statistics are for now due to the following reasons:
>       *) we only analysed 9000 csv files due to temporarily limited amount of resources and some not performance  optimised code
>       *) we used our datatype detection algorithm which are far from optimal
>
> Sorry, that we were not able to compile the report with more documents and a more accurate datatype detection.
> Anyway, i hope it sheds some light and can be of any help.
>
> Below are some more details to better understand the results
> The explanation of the data file:
>       first column = the different data types for a column ( separated by “_”=
>       second column = the count of tables for which we detected more than 1 datatype  in one column

>
>
> Sorting them by their count, we can see that
>       *) the most prominent pair of different datatypes per column is TEXT and NUMERICAL ( e.g ALPHA_FLOAT, ALPHA_INTEGER, ….)
>       *) we observed also very often a mix of empty cell values in a column
>       *) mix of different numerical values ( INTEGER and FLOAT , or various ranges of integer values)
>
> Little bit more details:
>       *) 2882 tables contain a column with text and float values (ALPHA_FLOAT)
>       *) 2513 tables contain a column with either characters+numbers  or only float
>               roughly the same amount of tables have in addition also at least one empty cell in the column
>       *) 1848 tables have a column with FLOAT or Integer values
>
> Explanation of our datatypes detection algorithm and datatypes.
>
>       for each cell in a column:
>               1) EMPTY -check: cell is empty or contains “null”
>               2) EMAIL based on email_regex = '[a-zA-Z0-9_\.\+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-\.]+’
>               3) RESOURCE = check if the cell value might be xx://xxx resource ( this is very buggy code :S)
>               4) if cell contains numbers
>                       4a) ALPAHNUM = numbers and characters
>                       4b) DATE if it matches one of the date_regex ( see below)
>                       4c) FLOAT typecasting to float
>                       4d) YEAR  if val >= 1400 and int_val <= 2100
>                       4e) YEAR_MONTH = if val >= 197000 and int_val < 210000
>                       4f) INTEGER if no comma and does not match any of YEAR/Date/etc.
>               5) if contains alpha
>                       5a) text if len(cell) >50
>                       5b) ALPHA = all characters in the string are alphabetic and there is at least one character
>                       5c) COMPLEX = if nothing of the above matches
> !!! You already recognised that this is far from optimal or very accurate and by looking again at the code and the results we discovered *many* improvements and input for an improved datatype algorithm
>

--
Dr. Jürgen Umbrich
WU Vienna, Institute for Information Business

Received on Wednesday, 25 February 2015 15:58:40 UTC