Re: Analysis of different datatypes per column

Hi David,

Seems there happen some concatenating errors, ALPHA++ should be ALPHA+, the same for FLOAT+ which should be FLOAT

best
  Jürgen

> On 04 Mar 2015, at 15:32, David Booth <david@dbooth.org> wrote:
> 
> What does ALPHA++ mean?  It appears more than once in the results, but it is not listed in the key below.  ALPHA+ is listed, but not ALPHA++ .
> 
> Thanks,
> David Booth
> 
> On 03/04/2015 08:44 AM, Juergen Umbrich wrote:
>> Hi all,
>> 
>> we compiled a new list of common datatype tuples in columns of
> > csv files ( see attached file).
>> 
>> Overall, we parsed 50007 csv files and parsed for the following datatypes:
>> 
>> EMPTY= empty string or “null”
>> ALPHA = all characters in the string are alphabetic and there is at least one character
>> ALPHA+ = alphabetic characters + special characters
>> EMAIL: email patterns
>> URL = based on a regex to detect IP, URL, FTP …. ( from the python django library)
>> ALPHANUM = alphabetical and numerical characters
>> ALPHANUM+ = alphabetical and numerical characters, plus special characters
>> DATE = if parseable by the dateutil.parser.parse(cell) (python)
>> NUMBER = only digits
>> FLOAT = digits with “.” or “,” separation
>> NUMERIC+= NUMBERS + special character which are not “.” or “,”
>> UNKNOWN = everything which does not fit the top part
>> 
>> 
>> best
>>   Jürgen
>>> On 25 Feb 2015, at 16:58, Dan Brickley <danbri@danbri.org> wrote:
>>> 
>>> Maybe the attachments were too big?
>>> 
>>> ---------- Forwarded message ----------
>>> From: Juergen Umbrich <juergen.umbrich@wu.ac.at>
>>> Date: 25 February 2015 at 07:16
>>> Subject: Fwd: Analysis of different datatypes per column
>>> To: Dan Brickley <danbri@danbri.org>, Jeni Tennison <jeni@jenitennison.com>
>>> 
>>> 
>>> Hi,
>>> here the email,
>>> i forgot to gzip one of the text files, and the email was 7MB- > that
>>> might be the reason that the email was not delivered
>>> 
>>> 
>>>> Begin forwarded message:
>>>> 
>>>> Hi all,
>>>> 
>>>> attached is a short survey of the amount of different “datatypes” per columns in CSV files and our logs containing the CSV URL and at what column we detected which types.
>>>> 
>>>> DISCLAIMER:
>>>> We are not sure how representative these statistics are for now due to the following reasons:
>>>>      *) we only analysed 9000 csv files due to temporarily limited amount of resources and some not performance  optimised code
>>>>      *) we used our datatype detection algorithm which are far from optimal
>>>> 
>>>> Sorry, that we were not able to compile the report with more documents and a more accurate datatype detection.
>>>> Anyway, i hope it sheds some light and can be of any help.
>>>> 
>>>> Below are some more details to better understand the results
>>>> The explanation of the data file:
>>>>      first column = the different data types for a column ( separated by “_”=
>>>>      second column = the count of tables for which we detected more than 1 datatype  in one column
>>> 
>>>> 
>>>> 
>>>> Sorting them by their count, we can see that
>>>>      *) the most prominent pair of different datatypes per column is TEXT and NUMERICAL ( e.g ALPHA_FLOAT, ALPHA_INTEGER, ….)
>>>>      *) we observed also very often a mix of empty cell values in a column
>>>>      *) mix of different numerical values ( INTEGER and FLOAT , or various ranges of integer values)
>>>> 
>>>> Little bit more details:
>>>>      *) 2882 tables contain a column with text and float values (ALPHA_FLOAT)
>>>>      *) 2513 tables contain a column with either characters+numbers  or only float
>>>>              roughly the same amount of tables have in addition also at least one empty cell in the column
>>>>      *) 1848 tables have a column with FLOAT or Integer values
>>>> 
>>>> Explanation of our datatypes detection algorithm and datatypes.
>>>> 
>>>>      for each cell in a column:
>>>>              1) EMPTY -check: cell is empty or contains “null”
>>>>              2) EMAIL based on email_regex = '[a-zA-Z0-9_\.\+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-\.]+’
>>>>              3) RESOURCE = check if the cell value might be xx://xxx resource ( this is very buggy code :S)
>>>>              4) if cell contains numbers
>>>>                      4a) ALPAHNUM = numbers and characters
>>>>                      4b) DATE if it matches one of the date_regex ( see below)
>>>>                      4c) FLOAT typecasting to float
>>>>                      4d) YEAR  if val >= 1400 and int_val <= 2100
>>>>                      4e) YEAR_MONTH = if val >= 197000 and int_val < 210000
>>>>                      4f) INTEGER if no comma and does not match any of YEAR/Date/etc.
>>>>              5) if contains alpha
>>>>                      5a) text if len(cell) >50
>>>>                      5b) ALPHA = all characters in the string are alphabetic and there is at least one character
>>>>                      5c) COMPLEX = if nothing of the above matches
>>>> !!! You already recognised that this is far from optimal or very accurate and by looking again at the code and the results we discovered *many* improvements and input for an improved datatype algorithm
>>>> 
>>> 
>>> --
>>> Dr. Jürgen Umbrich
>>> WU Vienna, Institute for Information Business
>>> <csv_urls_with_type.txt.gz><csv_datatypes.csv><signature.asc>
>> 
>> --
>> Dr. Jürgen Umbrich
>> WU Vienna, Institute for Information Business
>> 
>> 
>> 
>> 
>> 
> 

--
Dr. Jürgen Umbrich
WU Vienna, Institute for Information Business

Received on Wednesday, 4 March 2015 14:35:17 UTC